Descriptive Statistics – Measures of association – Contingency tables

In the earlier note of descriptive statistics, we have seen how to measure the association between continuous variables and ranked data or (ordinal variables) using correlation coefficient and the rank correlation coefficient, respectively.

In this note, we will measure the association between variables counting in nature or discrete variables using contingency tables.

Bivariate Frequency Table

When data is collected and tabulated for two variables, this representation is called a bivariate frequency table. We will learn how to create frequency tables with the help of the below example. 

Suppose we want to know if boys and girls have any inclination to choose between mathematics and biology. If there is no discrimination, we expect that the total number of boys and girls opting for mathematics and biology should be nearly the same. The data on such issues are obtained as frequency.

A measure based on frequency data or summarized frequency data is needed to study the association between two such variables. Suppose the data is obtained as follows:

Given input data
Given input data

By seeing the data, we can’t statistically conclude anything meaningful information. To get some meaningful information, we will create and analyze using a contingency table.

Contingency Table

Contingency tables summarize the observed frequencies to describe the relationship between two categorical variables. In statistics, a contingency table (also known as a cross-tabulation or crosstab) is a type of table in a matrix format that displays the frequency distribution of the variables. In the below example, the contingency table summarizes as follows:

  • Total number of male and female students preferring maths are 4 and 2 respectively.
  • Total number of male and female students preferring biology are 1 and 3 respectively.
  • The total number of male students preferring maths and biology is 5.
  • The total number of female students preferring maths and biology is 5.
Contingency table
Contingency table

Contingency Table Representation

Let us consider, X and Y be two discrete variables. x_1, x_2, x_3, \cdots, x_k are k classes of X. Similarly, y_1, y_2, y_3, \cdots , y_l are l classes of Y. n_{ij} is a frequency of  (i,j)^{th} cell corresponding to  (x_i,y_j) for i = 1,2,3, … k and j = 1,2,3, …. l respectively. These frequencies can be presented in the following k x l contingency table.

Contingency Table Representation
General representation of contingency table

When the data on two variables are summarized in a contingency table, there are several characteristics of the data can be studied.

Marginal Frequency

The marginal frequency distribution tells how the values of one variable behave in the joint distribution. 

  • n_{i+} = \sum_{j=1}^l n_{ij}, it represents marginal frequency distribution of X and j value varies from 1 to l while keeping i value constant.
  • n_{+j} = \sum_{i=1}^k n_{ij}, it represents marginal frequency distribution of Y and i value varies from 1 to k while keeping j value constant.

Total Frequency

  • n = \sum_{i=1}^k n_{i+} = \sum_{j=1}^l n_{+j} = \sum_{i=1}^k \sum_{j=1}^l n_{ij}

Absolute Frequency

  • n_{ij}, it represents the joint frequency distribution of X and Y. The joint frequency distribution tells how the values of both the variables behave jointly.

Relative Frequency

If the relative frequency is used instead of absolute frequency, then similar information is provided by the joint relative frequency distribution, marginal relative frequency distribution, and conditional relative frequency distribution.

The advantage of using relative frequency compared to absolute one is that the sum of all the relative frequencies is always one. The range of every relative frequency is between 0-1, which is very similar to probability theory. These tables represent the probability distribution of the discrete variables.

The relative frequency f_{ij} of any class (x_i, y_j) of (i,j)^{th} class respectively are as follows:

  • f_{ij} = \frac{n_{ij}}{n}, it represents joint relative frequency distribution of X and Y.

Conditional Frequency

Conditional frequency distribution tells how the values of one variable behave when another variable is kept fixed. 

  • f_{i|j}(X|Y = y_j) = \frac{n_{ij}}{n_{+j}}, conditional frequency distribution of X given Y = y_j.
  • f_{j|i}(Y|X = x_i) = \frac{n_{ij}}{n_{i+}}, conditional frequency distribution of Y given X = x_i.

Understanding of contingency table with an example

A soft drink was served to children, young, and elderly persons, and its taste was recorded as good or bad. The following 2*3 contingency table was formed by compiling the data.

Contingency Table
Contingency Table

In the above example, there are two variables, Person and Taste, and these variables have three and two classes, respectively. The Person variable has three classes called Children, Young Persons, and Elder Persons, whereas the Taste variable has two classes called Good and Bad. Using the contingency table with different methods, we will measure the association between two variables.

We can see that Taste and Person are not the two independent variables, as the elderly keep different opinions compared to young persons on the taste of soft drinks. So, these two are correlated to each other.

These are different types of measures to measure the association. These measures try to measure the association differently based on the variables types, as we have already seen in the earlier notes of descriptive statistics.

Marginal Frequency

  • Marginal frequency about the taste of the soft drink as good indicates, 60 persons say it tastes good.
  • Marginal frequency about the taste of the soft drink as bad shows, 40 persons say it tastes bad. 
  • Marginal frequency about the persons tell the following, there are 30 children, 45 young persons, and 25 elderlies. 

The marginal frequencies give a particular type of information, and it is calculated as keeping one variable constant. For example, if we have a query about how many persons say that soft drink tastes good, the answer is 60, calculated as marginal frequency.

Relative Frequency

The same contingency table can also be formed by relative frequencies as shown in the below diagram. And once we relative frequencies we can get answers of the conditional queries as well.

Relative frequencies on contingency table
Relative frequencies on contingency table

The joint frequency distribution tells how the values of both the variables behave jointly whereas marginal frequency distribution tells how the values of one variable behave in the joint distribution.

Conditional Frequency

Conditional frequency distribution tells how the values of one variable behave when another variable is kept fixed.  Few examples are shown below to understand it.

  • How many children say that the soft drink tastes good, 20/60 = 33.3%, it means, out of 60 persons including children, young and elders, 20 children say that soft drink tastes good. 
  • How many children say that the soft drink tastes bad, 10/40 = 25%, it means out of 40 persons including children, young, and elders, ten children say that soft drink tastes bad.

References

  1. Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

 286 total views,  1 views today

Scroll to Top
Scroll to Top