The analysis of human resources data typically involves the use of computer databases that were constructed to process transactions. Their purpose normally centers on administration and recordkeeping. Thus the variables that are available for analysis are not necessarily the ones that would be chosen as the ideal set of variables given the purposes of the analysis. A side effect is that in many cases critical analysis variables may be missing. This can lead to "spurious correlations," a common and serious interpretation fallacy. For example, suppose that the critical variable is correlated with race, age, or gender. Thus any other variable that correlates with the critical variable will probably also be correlated with race, age, or gender. These correlations are spurious because their primary cause is the missing critical variable. Nonetheless these spurious correlations are at times used as indicators of discrimination. The purpose of this paper is to illustrate the widespread occurrence of spurious correlations.
My favorite example is to do the following:
- Get data on all the fires in San Francisco for the last ten years.
- Correlate the number of fire engines at each fire and the damages in dollars at each fire.
The reason that I like this example is that the conclusion is so absurd. Anyone will quickly recognize that both variables result from and are correlated with the overall size of the fire. However, many spurious correlations do not seem absurd and some seem compelling.