Descriptive Statistics – Measures of Central Tendency – Median

Median is one of the techniques to measure the central tendency of data. It divides the observations into two equal parts. At least fifty percent of the values are greater than or equal to the median, and the remaining values are less than or equal to the median.

Median in Descriptive Statistics

In simple words, the median is a measure that tries to divide the total frequency into two parts. For example, If we say that the median of our frequency distribution is 100. It means, 50% of the values are greater than or equal to 100, and the remaining values are equal to or less than 100.

It is a better measure than the arithmetic mean in the case of extreme observations or outliers. Let us first undestand how the extreme observations or outlies affect the mean and median values. 

  • Suppose we have three observations [2,4,6], and its average mean value is equal to (2+4+6)/3 = 4, and the median is 4. However, suppose one of the observations has altered, and now it is [2,4,100]. So its average mean value is equal to (2+4+100)/3 = 35.3, whereas the median remains the same. We can understand that if there are extreme values or observations, the average mean value changes drastically, but the median stays the same.

We will learn, understand and compute the median for grouped and ungrouped data using python programming language on the tips dataset

Median (ungrouped data)

The calculation of the median for the ungrouped data (discrete variable) is very straightforward. Let us consider we have N observations X as  x_1, x_2, ..... x_n in the ascending order such that  x_i <= x_j where  i <= j . However, if the observations are not sorted in ascending order, then sort them. There are two cases to find the median:

Case 1: When N is an odd integer, in this case, we first calculate the position of the median and pick that particular value from that position, as shown in the below equation.

\bar{X} =  X \left [ \dfrac{n+1}{2} \right ]

For example: Suppose there are a total of N = 7 observations and are as follows: [11, 12, 13, 14, 15, 16, 17], the median position is (7+1)/2 = 4. From the observation list, 4th position value is 14 and thus this is a median of our observations.

Case 2: When N is an even integer, in this case, we first calculate the even and odd positions, pick the values and then take the average value as a median, as shown in the below equation.

\bar{X} = \left [ \dfrac{X \left [ \dfrac{n}{2} \right ] + X \left [ \dfrac{n+2}{2} \right ] }{2} \right ]

For example: Suppose there are a total of N = 8 observations and are as follows: [11, 12, 13, 14, 15, 16, 17, 18], the median position is (8)/2 = 4th &  (8+2)/4 = 5th. From the observation list, 4th & 5th positions values are 14 & 15, respectively. Median = (14+15)/2 = 14.5,  and thus this is a median of our observations.

Compute Median for ungrouped data using Python

# Observation List
x = [11, 13, 12, 14, 15, 17, 16]

# Find the lenght of a given list.
x_len = len(x)

# Sort the observations in ascending order.
x.sort()

# Calculate Median
# Two cases

# Find whether total number of observations are even or odd using modulo operator.
if x_len%2 == 0:
    
    # 1st case, when total number of observations are even.
    x_med = (x[int(x_len/2)-1] + x[int((x_len+2)/2)-1])/2
else:
    
    # 2nd case, when total number of observations are odd
    x_med = x[int((x_len+1)/2)-1]

print(x_med)

In the above code, we have used python list data type to store observations and calculate the median without using standard python library functions.

Median (grouped data)

Whenever we have a group data or observations from any continuous variable, then the first step is that we try to create the frequency table, and it consists of classes with the assumption that each class are equally distributed.

Suppose there are k classes such as  A_1,  A_2,  A_3, ... A_k , and n_i is the number of observations in i^{th} class A_i.  First step is to determine median class A_m, as this class includes the median value. It is calculated using the same median technique that was used to determine the median of ungrouped data.

\bar{X} =  e_{m-1} + \frac {d_m}{f_m} \left ( 0.5 - \sum_{j=1}^{m-1}{f_j}  \right )

Where:

  •  e_{m-1} is a lower limit of A_m class
  •  d_m is a width (upper limit – lower limit) of A_m class
  •  f_{m} is a relative frequency of A_m class
  •  f_{i} is a relative frequency of A_i class

Compute Median for grouped data using Python

import seaborn as sns
import pandas as pd
import numpy as df
import math
import matplotlib.pyplot as plt

# Load tips dataset from seaborn package
tips_data = sns.load_dataset('tips')


max_bill_amount = int(math.ceil(tips_data['total_bill'].max()))
interval_len = int(math.ceil(max_bill_amount/10))
sum_of_total_bills = tips_data['total_bill'].sum()

# Create a class interval automatically
total_bins = [i for i in range(0,max_bill_amount+interval_len,interval_len)]

total_bills_groupby_interval = pd.cut(x=tips_data['total_bill'], bins=total_bins)

# Calculate absolute frequency
absolute_frequency_table = tips_data.groupby(total_bills_groupby_interval)['total_bill'].count()

# Renaming headers and put into dataframe
frequency_table = pd.DataFrame({'Class Interval':absolute_frequency_table.index, 'Absolute Frequency':absolute_frequency_table.values})

# Calculate Relative Frequency
frequency_table["Relative Frequency"] = frequency_table['Absolute Frequency']/frequency_table['Absolute Frequency'].sum()


# To calculate median class
# Find whether total number of observations are even or odd using modulo operator.
x_len = absolute_frequency_table.sum()
if x_len%2 == 0:    
    # 1st case, when total number of observations are even.
    x_med = (x_len/2 + (x_len+2)/2)/2
else:
    # 2nd case, when total number of observations are odd
    x_med = (x_len+1)/2

# Median Class interval
# Calculate Cummulative Frequency
median_of_class_interval_index = 0;
for index in frequency_table.index:
    
    abs_val = frequency_table['Absolute Frequency'][index]
    if index == 0:
        cum_val = 0
    else:
        cum_val = frequency_table["Cumulative Frequency"][index-1]
    
    frequency_table.at[index, "Cumulative Frequency"] =  cum_val + abs_val
    if index > 0:
        if frequency_table["Cumulative Frequency"][index] >= x_med and x_med > frequency_table["Cumulative Frequency"][index-1]:
            median_of_class_interval_index = index

            
# Calculation of median
left = frequency_table['Class Interval'].apply(lambda x: x.left).astype(float)

e_m_1 = left[median_of_class_interval_index]
d_m = interval_len
f_m = frequency_table["Relative Frequency"][median_of_class_interval_index]
sum_f_m_1 = 0
for index in range(0,median_of_class_interval_index-1):
    sum_f_m_1 += frequency_table["Relative Frequency"][index]
        
# Formula to calculate median
median = e_m_1 + (d_m/f_m) * (0.5 - sum_f_m_1)

print("Median {:.1f}".format(median))
frequency_table

In the above code, we computed the median using the formula mentioned in the median for grouped data section. If we compare mean and median values, we can conclude that using the same dataset, the weighted mean value was 19.81, whereas the median value is 20.3. We can very well observe that both the estimates produce a slightly different result.

Frequency Distribution Table
Median – Frequency Distribution Table

References

  1. Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

 324 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: