Disclaimer: The above pic is not at all related to the article, however it might help you to remember the significance of what we are about to discuss.
Suppose you have a large dataset, and you are analyzing it in groups, such that you observe same trend in each of the groups. However, when you combine these groups, the trend either disappears or reverses itself. This phenomenon, is termed as Simpson’s Paradox.
For better understanding lets take following example, where we discuss about Lisa and Bart, and their weekly performance in completing the assignments.
In first week, Lisa did not complete the only assignment she got, while Bart completed 1 of the 4 assignments he received. Clearly, Bart performed better with 25% completion rate as opposed to Lisa’s 0%. In second week, Lisa completed 3 of the 4 assignments, while Bart completed the the only assignment he received. Again, Bart has an edge in completion rate of 100% when compared to Lisa’s 75%.
From above, we can infer that, across weeks, Bart has a better completion percentage than Lisa. Analyzing further we sum the total assignments completed in two weeks, and this is what we observed:
Overall, Lisa has completed 3 of 5 assignments assigned to her, while Bart has completed 2 of 5 assignments. Therefore, Lisa has a higher completion rate of 60% as compared Bart’s 40%. This might seem a bit weird at first, but once we know the causes which leads to Simpson’s Paradox, we can be aware of it in any analysis related to ratios and percentages.
When does Simpson’s Paradox occurs ?
Simpson’s Paradox occurs when one the following occurs:
- When the denominator (sample size) in ratio is not same.
- When percentage is provided and not the ratio.
How it affects in decision making ?
Whether you chose the aggregated numbers or the segregated numbers for your analysis determines what decision you take. It also depends on the type of data you have, and the type of numbers you need. Say, in the current example, if week 1 contained Maths assignments, while Week 2 contained Science assignments, then if you want to know a better performing candidate for Maths, then you would prefer Bart, while for English you would go for Lisa. Similarly, if you want to know the better performing candidate irrespective of the subjects of assignments, then we would prefer aggregated values.
Chosing the aggregated or segregated numbers, depends largely on dataset domain related information, and the problem we are trying to solve.
As per wikipedia, Psychological interest in Simpson’s paradox seeks to explain why people deem sign reversal to be impossible at first, offended by the idea that an action preferred both under one condition and under its negation should be rejected when the condition is unknown. However, this is still not clear from where people get this strong intuition from.
Also known as Yule–Simpson effect, decision making in Simpson’s Paradox, can be simplified using Casual Bayesian networks (back-door test), which will be discussed in advanced part of Simpson’s Paradox. Though, it seems to have a small impact on analysis, but it becomes troublesome when the domain related information and the problem statement is taken casually.