Descriptive Statistics – Measures of association – Pearson’s Chi-Squared Statistics

In the earlier note on the contingency table of descriptive statistics, we have seen how to create a contingency table and interpret the relationship between two variables in terms of marginal, and conditional frequency distributions, which can be obtained using both absolute and relative frequencies.

In this note, we will learn a few statistical tools that use a contingency table and show the association between two categorical (counting variables). These are Chi-Squared statistics, Cramer’s V statistics, and Contingency Coefficient. These quantify the degree of association between variables similar to the correlation coefficient for continuous variables, the rank correlation coefficient for ordinal variables, or rank data.

Pearson’s Chi-Squared Statistics

It is used to measure the association between variables in a contingency table. The \tilde{\chi}^2 statistics for k x l contingency table is given as follows.

\tilde{\chi}^2 = \sum_{i=1}^k \sum_{i=1}^l \left [ \dfrac{\left [  n_{ij} - \frac{n_{i+} n_{+j}}{n} \right ]^2}{\frac {n_{i+} n_{+j}}{n}} \right ] ;  0 \leq \tilde{\chi}^2 \leq n [ min(k,l) - 1]

Marginal Frequencies

  • n_{i+} = \sum_{j=1}^l n_{ij}, it represents marginal frequency distribution of X and j value varies from 1 to l while keeping i value constant.
  • n_{+j} = \sum_{i=1}^k n_{ij}, it represents marginal frequency distribution of Y and i value varies from 1 to k while keeping j value constant.

Total Frequency

  • n = \sum_{i=1}^k n_{i+} = \sum_{j=1}^l n_{+j} = \sum_{i=1}^k \sum_{j=1}^l n_{ij}

Absolute Frequencies

  • n_{ij}, it represents the joint frequency distribution of X and Y. The joint frequency distribution tells how the values of both the variables behave jointly.

Interpretation of Pearson’s Chi-Squared Statistics

  • The value of \tilde{\chi}^2 close to zero implies a weak association between the two variables.
  • The value of \tilde{\chi}^2 close to  n \times [min(k,l) - 1] , implies strong association between the two variables. [min(k,l) - 1] , gives the minimum size of contingency table.
  • The other values will suitably indicate the degree of association between the two variables to be low-moderate-high.

\tilde{\chi}^2 statistic is symmetric in the sense that its value does not depend on which variable is defined as X and which as Y.

Example 1: A sample of 100 students was chosen and divided into two groups, weak and strong, in academics. Some of the students are given tuition. We would like to see if tuition was helpful in improving the academic performance of the student or not. The data has complied in the following contingency table.

Contingency Table - Example 1
Contingency Table – Example 1
  • \tilde{\chi}^2 =  \left [ \frac{100 \times (40 \times 30 - 20 \times 10)^2}{50 \times 50 \times 40 \times 60}\right ] = 16.66
  •  n \times (min(k,l) - 1) =  100 \times min(2,2) - 1) = 100

We have seen that Pearson’s Chi-squared statistics value is 16.66, which is not closed to zero and not close to 100. So our interpretation is that the association between the two variables is moderate. This interpretation is very subjective as there is no straightforward formula that tells directly.

Example 2: Following data on 20 persons has been collected on their age category and their response to the taste of a soft drink. It is like a soft drink was served to children, young, and elderly persons, and its taste was recorded as good or bad. 

Data set of 20 persons
Data set of 20 persons

We have constructed the following 2*3 contingency table from the data, and the marginal frequencies are represented as rows and columns total.

Contingency Table  - Example 2
Contingency Table – Example 2
  • \tilde{\chi}^2 = 0.278
  •  n \times (min(k,l) - 1) =  20 \times min(2,3) - 1) = 20

We have seen that Pearson’s Chi-squared statistics value is 0.278, which is closed to zero. The limitation of Pearson’s Chi-squared statistics is that the range of Pearson’s Chi-squared statistics depends on the sample size and size of the contingency table, and these values depend on the situations.

So the Cramer modified the interpretation part and proposed Cramer’s V statistic for a k * l contingency table formula.

Cramer’s V Statistics

V = \sqrt{\frac{\tilde{\chi}^2}{n \times (min(k,l) - 1)}} ;   0 \leq V \leq 1  

The advantage of V statistics is that, it is more simpler as values are lies between 0-1.

Interpretation of Cramer’s V Statistics

  • The value of V close to zero implies a weak association between the two variables.
  • The value of V close to 1, implies strong association between the two variables.
  • The other values indicates the moderate association between the variables.

For example1, \tilde{\chi}^2 = 16.66. So the V = \sqrt{\frac{16.66}{100}} = 0.40. This again shows a moderate association. 

For example2, \tilde{\chi}^2 = 0.278. So the V = \sqrt{\frac{0.278}{20}} = 0.11. This shows a weak association. It implies taste is not much dependend on age. 

Contingency Coefficient

The corrected version of Pearson’s contingency coefficient is:

 C_{corr} = \frac{C}{C_{max}} ; 0 \leq C_{corr} \leq 1, where C = \sqrt{\frac{\tilde{\chi}^2}{\tilde{\chi}^2 + n}}, C_{max} = \sqrt{\frac{min(k,l) - 1}{min(k,l)}}  .

Interpretation of Contingency Coefficient Statistics

  • The value of C close to zero implies a weak association between the two variables.
  • The value of C close to 1, implies strong association between the two variables.
  • The other values indicates the moderate association between the two variables.

For example1, \tilde{\chi}^2 = 16.66. So,

  • C = \sqrt{\frac{16.66}{16.66 + 100}} = 0.38.
  • C_{max} = \sqrt{\frac{min(2,2) - 1}{min(2,2)}} = 0.71
  • C_{corr} = \frac{0.38}{0.71)} = 0.54

The value of C_{corr} = 0.54 again shows a moderate association between two variables. 

References

  1. Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

 229 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: