Let’s say you are surveying 100 people out of 10,000. You want to analyse the data from your sample of 100 to get answers about the likely behaviours and preferences of the overall 10,000 person population.
Part of your project focuses on equity among sexual orientations. You don’t want to leave anyone out and you know that having a question about sexual orientation where people select ‘heterosexual or homosexual’ isn’t inclusive enough. You consult experts and the local community and decide to include ‘Heterosexual, Gay, Lesbian, Bisexual, Pan Sexual, or Asexual’ as options in that question.
Once your responses have come in, you have data from respondents across each of those categories, however only a few respondents identified as bisexual and only one person identified as pan sexual and asexual respectively. When trying to analyse the data to represent the responses of all these orientations, you realize that you have such a small amount of data from some categories that you can’t say anything statistically relevant about them, you can’t extrapolate the preferences and likely opinions about all Asexually identifying people in your population of 10,000 from one person’s data.
Rather than completely discount the categories in which you have very few responses, you decide it’s better to combine them into an amalgamated category, so that they can be better represented. When you publish your findings, you frame your results as Heterosexual, Homosexual and Other, the very thing you were trying to avoid. People are mad and hurt that they aren’t well represented and feel lumped into an ‘other’ category. Respondents who took your survey feel cheated by being asked detailed questions that you just combined anyway.
This kind of ‘collapsing’ or ‘amalgamating’ of data categories happens all the time and not just with sexual orientation. Almost all demographic questions are susceptible to being limited in the survey or condensed in the analysis; race, ethnicity, gender, language, etc. Imagine how difficult and how statistically useless it would be to list all possible spoken languages as an option on a survey. How can we be inclusive without making minority categories so small that only the majority data has statistical relevance?
Competing Priorities:
- It’s important that the diversity among your respondents is given respect.
- It’s important that the results you show be statistically meaningful.
Option 1: Collapsing
The first ethical issue when collapsing data categories after the initial analysis, for example into ‘Heterosexual/Non-Heterosexual’ is that it frames the categories so that heterosexual is normal and everyone else is “other”. Second, it categorizes your respondents in a way that they did not categorize themselves, removing the agency of choice that you offered them. The least ethical occasions of collapsing occur when people use a lot of inclusive categories on the public facing survey just to appear inclusive, covering their own butts with the public while planning to collapse the data anyway.
From a mathematical point of view, collapsing the sexual orientation into two groups is a problem because your results get a lot less accurate. The attitudes and behaviors might vary a lot between gay, lesbian, or bisexual respondents, which is important to measure and acknowledge. Your results will bury this if you report only on Hetero/Non-Hetero.
Option 2: Not Collapsing
If you have a bunch of data categories with a small number of responses it’s going to reduce the statistical certainty of what you can say about your overall population. It’s not acceptable to say something like “73% of Heterosexual and 88% of Asexual identifying people are in favor of the new law” when you have hundreds of responses from one identity and only one from the other. You have to report your findings with their statistical confidence which almost always corresponds to the number of respondents in that category.
Not collapsing therefore leads to an issue where only the majority categories, the ones with lots of responses, have strong statistical meaning. Your efforts to be inclusive actually weaken the voice of the least represented groups. Of course in many cases you will just have a difference in statistical confidence between groups. If your respondents are in three categories at 60%, 30% and 10%, you can still report on each of them just including the difference in their statistical weight.
So, what to do? Of course, as always, the answer is “It Depends.” It depends on the research question you’re trying to answer. It depends on what the people you’re working with need to know. It depends on how the people you’re collecting data from feel about their representation. Those are just a few of the factors.
What You Can Do:
- Decide on how to deal with this before crafting your survey and before analysis.
- Report your results in more than one way, including collapsed, uncollapsed, and hybrid perspectives.
- Be transparent about the dilemmas, compromises and choices you are addressing with your data team, your survey respondents, and your audience.
Deciding how to approach this issue in the Project Design phase, before creating your survey or conducting your analysis is the first way to dramatically increase the equity of your project. If you decide that it’s most important to have three categories at the end that have a strong statistical confidence, you can design a system for that. Let’s say you’ve decided in advance to report the top three categories: the most respondents, the second most respondents, and a combination of all the remaining respondents. This gives you so many advantages. It will allow to still ask about more than two or three categories on the survey, increasing inclusiveness. It will allow you to not assume what those three categories are; you don’t know that heterosexual respondents will always outnumber bisexual respondents in all surveys. It will allow you to tell survey respondents about how you intend to analyse the data so they don’t feel mistreated when you do combine some categories. There are all kinds of ways to address this issue in Project Design including what questions to include, how to weight categories, how to report categories, and more. Deciding in advance about your projects systems and best practices will help you sidestep many equity issues.
You don’t have to report your findings in only one way. You can break out your results in one way that shows the strongest statistical meaning; maybe bisexuals, asexuals, and lesbians do feel the same way about an issue and combining their data gives their responses a stronger voice. You can then also show all the categories individually, while including information about the statistical confidence to show your audience that your survey did include the pan sexual orientation, even if you didn’t get any respondents in that category. You can create a hybrid where some of the categories are collapsed in ways you think are the most meaningful. Maybe you report the orientations in intersectional categories of their responses: most likely to say yes: lesbian, bisexual and pan sexual orientations; most likely to say no: gay, straight an asexual orientations. Offering your audience all the information increases their confidence in your reporting and methodology while simultaneously strengthening your results.
Lastly, you have to be transparent about this issue with all stakeholders of your project: the people working on it, the people involved in it, and the audience of your findings. Data science with humans in inherently full of difficult decisions and compromises. Don’t craft surveys that remove agency from respondents, don’t hide differences in statistical confidence between categories, and don’t conceal assumptions and choices you’ve made. This can be difficult because often you are inclined to do these things to protect equity. Letting people see how you’ve grappled with issues like this will only increase trust and true equity in your data projects.