If you want to Democratize, Demonstrate, Demystify data we need to pay attention to the words we use.
In data science, there are often a lot of specific, unusual, or downright tricky words, terms, and phrases. Sometimes a term that means one thing in general means a different thing in scientific usage. Also, there is a lot of mystique and allure around being a “data person” so there’s always the temptation to use fancy words as a status symbol. We’re making a point to try to distinguish these different types of words.
Some words used around data are actually describing something pretty unique and we think it could be very helpful for people interested in data to understand their meaning.
Some terms mean one thing in scientific usage and a different thing in common everyday use.
The third category are fancy terms that can often be replaced by much more commonly understood words.
We’re setting up this ‘Data Speak Decoder’ to help anyone navigate through these three types of terms. We’d love to get your additions as we build this lexicon. Email them to Heather. We also really need some help adding non-English words to this list. If you can help us with that email us.
DATA TERMS TO KNOW
Anonymized
(Suggested by Marika)
Anonymization is an important term to understand when working with data, particularly with an eye towards data equity. Essentially, anonymization is when you take data that has been collected from individuals and make it so that this data is no longer able to be identified with that specific individual. There are a few different ways of doing this, some more successful than others. Sometime anonymization means making the data permanently unrelated to any specific individual. Other times it means making it identifiable only with a unique secret code. Usually anonymization applies to two kinds of identifiers: direct and indirect. Direct identifiers are the obvious variables such as name, address, or telephone numbers, which specifically highlight a participant. Indirect identifiers when pieced together could also reveal an individual, for example, by cross-referencing occupation, employer, and location.
Bias
When talking about equity in data science, the word bias comes up a lot. And I personally think that confusion around this word can be traced to the heart of some of the myths around the objectivity of data and statistics. Certain statistical methods that are considered to be technically powerful can produce what are called “unbiased estimates” (We’ll talk about the word estimates in a minute). So these “unbiased estimates” which data scientists are so proud of must be objective and correct, yes? No. In this instance the word “unbiased” means that if you repeat this mathematical calculation many, many times you will get very close to the best number that the data you are using can give you. In a statistical sense, bias or unbiased is not how correct or how representative or how accurate the result is. In everyday usage, however, bias refers to preconceptions and prejudices that affect a person’s view of the world. So oftentimes, A biased statistical estimate is the most unbiased result – because a biased statistical estimate often contains the least amount of racism, sexism, etc.
DATA TERMS WITH MULTIPLE MEANINGS
Error
Error is a measure of an estimate’s precision—if you’re a statistician. To everyone else, errors are just mistakes. Error is a sort of difference, really. The difference between an observed value and its corresponding theoretical value. So it doesn’t mean ‘a mistake’. In pretty well any statistical analysis, there will be an error associated with each observation, and that doesn’t mean the experimenters or statisticians messed up, just that there is some random variability from the theory or expected results.
Impact
(Suggested by Kwame)
Impact is a tricky word when working with data. It has been abused and misused often to avoid equity issues, so it’s good to take some extra time when you see someone using it. In one everyday sense, impact is when something hits or makes contact with another thing. The data sector uses a version of this to mean having an influential or strong effect on something. Impact is not inherently a positive thing. What counts as evidence of impact is, and should be, controversial.
DATA TERMS TO AVOID
Big Data
(Suggested by Jen)
Much like significance, we should stop using the term big data. There is no longer one coherent meaning for this word. It was once used to mean something about the actual size of a dataset. But now it is used a synonym for the size of a dataset, the method of data collection regardless of the size of the dataset, the method of analysis, regardless of the size of the dataset, a buzz word to attract either positive or negative attention, and as a form of either bragging or bullying. Let’s stop using it.
Normal
Another word to reconsider using is normal. It’s just not worth all the implications for a very small amount of technical statistical accuracy. The technical dictionary definition of the word includes stuff like “conforming to a type, standard, or regular pattern” like my normal routine involves having a cup of tea as soon as I wake up. However, in common usage the word “normal” has strongly positive and negative associations. Just check out Urban Dictionary. From a technical statistical perspective, normal means data that follows a bell-shaped curved pattern. There is nothing inherently good or bad about this pattern. However, it’s very confusing to think about normal or non-normal data as probably good data or bad data. Or data that conforms to the expectations of a dominant or sometimes oppressive class of society.
Significant
This is probably one of the biggest problematic terms involved with data. In many everyday English language conversations, when we say something “is significant” we usually mean that it’s meaningful or important. However, in strictly statistical terms, saying something is “significant” is a very technical term. Put very simply, it means that a conceptual and mathematical calculation decided that a result has a very small p-value. At this point even the American Statistical Association thinks we need to stop using this word in reporting on research.
Providing these very clear guidelines:
- Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant” (i.e., the p-value passed some arbitrary threshold such as p < 0.05).
- Don’t believe that an association or effect exists just because it was statistically significant.
- Don’t believe that an association or effect is absent just because it was not statistically significant.
- Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
- Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).
And most importantly….do not say “statistically significant” or use any variant, words, asterisks or other statistical trickery to convey the same message.