In the previous note, we have learned about data science and the different core activities to extract meaningful information from the data. In this note, we will learn how to read, understand and make relationships among the variables.
An important part of a data scientist is to explain what they see in their data. It is like what is causing the effects that were observed. Suppose if they recommend a course of action, then what might the results be? So the questions about cause and effect arise frequently.
For example, an article was published in heart.bmj.com, and it says chocolate is good for the heart. So the question comes, how did they come to this conclusion and the answer is, they may be conducted few experiments and published this report based on the conclusion or findings.
How do we make sense of such claims, and how do we carry out such analysis by ourselves? To do so, we have to observe.
Observation
- Individuals, study subjects, participants, units: we observe individuals, and we call them to study subjects or participants or units. In the case of the above example, the individual was a European adult. The individual can be categories as people or cars or countries etc.
- Treatment: an individual receives treatment, and in the case of the above example, it was chocolate consumption. Treatment can be considered as the actual experiment on the individuals or study subjects. Or participants etc.
- Outcome: in the case of the above example, it was heart disease.
In the above example, they tried to answer whether there is any relation between chocolate consumption and heart disease? Now the question comes, is there any relation between treatment and outcome. The formal word for relation is called association, and it can be considered any relation or link. To find the association, we require experimental data.
The above example doesn’t prove a cause-and-effect relationship between chocolate and reduced risk of heart disease. However, if we try to answer the question reverse, does chocolate consumption reduce heart disease? This is called causality, and typically it isn’t easy to answer. So, we saw an association, but establishing causality requires more effort. In other words, if our interest is how do we find them (association), how do we prevent them (causality).
Association
Association is a statistical relationship between two variables (different observation data), and these variables may be associated without a causal relationship. For example, there is a statistical association between the suicidal tendency among teenagers and movies released based on love stories in a given year. However, there is obviously no causal relationship because of these love story movies. Suicidal tendencies increase among teenagers.
Causality
Causation means that the treatment produces the effect, and a cause must be associated with the outcome, but simply demonstrating an association is not enough for establishing causality.
To establishing causality, the most important factor is to compare the two groups strictly having similar features (it means the two groups are not different in any way other than the treatment) apart from the treatment and then look at the difference in the two sets of the result. However, if the two groups have systematic differences other than the treatment, it might be difficult to identify causality.
If the two groups differ in some way other than the treatment, we might have trouble identifying which of the differences leads to the difference in the outcomes. This sometimes happens when we have an observational study when we observe differences between self-selected groups of people.
The effect of alcohol on babies’ birth weight is an example of an observational study as no woman will explicitly participate in this experiment.
In this experiment, we are trying to see whether drinking alcohol by a pregnant woman affects her baby’s birth weight. We can’t very well randomly select women to drink during their pregnancy and then look to see what happens to their children. They are unlikely to agree. So, in this case, we might observe the group of women who drink during their pregnancy and the group of women who don’t and compare their babies. But of course, the women who have chosen to drink or not drink are also making other choices, and possibly it is some of those other choices that may be responsible for any difference that we observe between the babies of the two groups. When the underlying difference that we can’t see leads us astray, they are called confounding factors.
Confounding
A confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.
Another example to understand it. Some decades ago, it was believed that drinking coffee causes lung cancer. This belief was because the lung cancer rates among coffee drinkers were higher than those who didn’t. But the confounding variable was smoking. In those days, people sat in cafes and smoked and drank coffee. We now know through biology that smoking causes lung cancer. So, when we have an observational study, we should watch out for confounding factors.
So now the question comes, how do we create two similar groups apart from the treatment? So the easiest answer is to choose them randomly, and also, we can account mathematically for variability in the assignment and the sizes of the differences. This kind of experiment is called Randomized Controlled Experiment.
This note is published under CC BY-NC-SA 4.0 license.
References
194 total views, 1 views today