Descriptive Statistics – Measures of Central Tendency – Mean

Mean is one of the techniques to measure the central tendency of data. Data is often described as ungrouped and grouped. The ungrouped data is data given as individual data points or, in other words, we can say, observations are collected from the discrete variable. Whereas the grouped data is data given in intervals, these data or observations are collected from the continuous variable

In this note, we will learn how to find the arithmetic mean for ungrouped and grouped data, and we will practice these concepts using the python programming language.

Arithmetic Mean (ungrouped data)

The arithmetic mean of ungrouped data is the summation of N observations divided by N. It is represented as the below equation.

\bar X = \frac {1}{N}\sum_{i=1}^n {x_i} = \frac {x_{1}+x_{2}+\cdots +x_{n}}{N}

where X is a variable name and  x_{1}, x_{2}, \cdots ,x_{n} are N observations or data points. 

Arithmetic Mean using Python

In the earlier note, we were playing with the tips dataset. In this note also, we will use the same dataset for calculating the arithmetic mean value of tips received from various customers by the waiter.

import seaborn as sns

# Load an example dataset
tips_data = sns.load_dataset("tips")

# Print the information about dataset
print(tips_data.info())

# To calculate mean of tips received from 244 customers
tips_data['tip'].mean()

The average mean value is 2.99. Even with the help of a histogram, we can visualize and understand the data. Using the python seaborn library, we drew a histogram to understand the tip receiving pattern. We used the following python code to draw a histogram.

import math
from matplotlib import pyplot as plt
import seaborn as sns

sns.set_theme()

# Load an example dataset
tips_data = sns.load_dataset("tips")


# Finding the exact bin size
max_range = tips_data['tip'].max() - tips_data['tip'].min()
bins_size = math.ceil(max_range)

# Setting the dimensions of figure
plt.figure(figsize=(10,4),dpi=100)

# Plot the histogram
plt.hist(tips_data['tip'], color = 'green', edgecolor = 'white', bins = bins_size)

# To set the title
plt.title('Histogram for tips received')

# Range is equal to highest number - lowest number

# To set the x and y axis labels.
plt.xlabel('Tips')
plt.ylabel('Total number of tips received from customer')
plt.show()
Histogram using Python
Histogram using Python

Even with the histogram, we can see that most of the customers had given tips of approximately in a range of 2-4 dollars.

Arithmetic Mean (grouped data)

When the observations of a variable are categorized into the class interval, that kind of data is called grouped data. These data are divided into suitable intervals with suitable widths, and each width or class interval has lower and higher values.

We need to create a frequency table (crosstabs in python) with the midpoint, absolute, and relative frequencies to handle the group data.

  • Midpoint is computed by summing lower and upper values of a class interval and divided by two.
  • Absolute frequency is computed by counting the total number of data points that lie in that class interval.
  • Relative frequency is computed by dividing the absolute frequency by the total number of data points.

In the below example code, we used the total_bill variable of the tips dataset, derived the class interval, and created a frequency table by calculating absolute and relative frequencies. The class interval categorizes the bill amount into various categories, and absolute frequency tells how many customers had dined, with the bill amount belongs to that category.

Frequency Distribution Table using Python

import seaborn as sns
import pandas as pd
import numpy as df
import math

# Load tips dataset from seaborn package
tips_data = sns.load_dataset('tips')


max_bill_amount = int(math.ceil(tips_data['total_bill'].max()))
interval_len = int(math.ceil(max_bill_amount/10))
sum_of_total_bills = tips_data['total_bill'].sum()

# Create a class interval automatically
total_bins = [i for i in range(0,max_bill_amount+interval_len,interval_len)]

total_bills_groupby_interval = pd.cut(x=tips_data['total_bill'], bins=total_bins)

# Calculate absolute frequency
absolute_frequency_table = tips_data.groupby(total_bills_groupby_interval)['total_bill'].count()

# Renaming headers and put into dataframe
frequency_table = pd.DataFrame({'Class Interval':absolute_frequency_table.index, 'Absolute Frequency':absolute_frequency_table.values})

# Calculate Relative Frequency
frequency_table["Relative Frequency"] = frequency_table['Absolute Frequency']/frequency_table['Absolute Frequency'].sum()

frequency_table
Frequency Table using Tips dataset

Conclusion: From the frequency table, we can conclude that 87 customers paid the total bill amount in a range of 18 to 24 dollars and so on.

Weighted Arithmetic Mean using Python

import seaborn as sns
import pandas as pd
import numpy as df
import math
import matplotlib.pyplot as plt

# Load tips dataset from seaborn package
tips_data = sns.load_dataset('tips')

max_bill_amount = int(math.ceil(tips_data['total_bill'].max()))
interval_len = int(math.ceil(max_bill_amount/10))
sum_of_total_bills = tips_data['total_bill'].sum()

# Create a class interval automatically
total_bins = [i for i in range(0,max_bill_amount+interval_len,interval_len)]

total_bills_groupby_interval = pd.cut(x=tips_data['total_bill'], bins=total_bins)

# Calculate absolute frequency
absolute_frequency_table = tips_data.groupby(total_bills_groupby_interval)['total_bill'].count()

# Renaming headers and put into dataframe
frequency_table = pd.DataFrame({'Class Interval':absolute_frequency_table.index, 'Absolute Frequency':absolute_frequency_table.values})

# Calculate Relative Frequency
frequency_table["Relative Frequency"] = frequency_table['Absolute Frequency']/frequency_table['Absolute Frequency'].sum()

# Find midpoint value of each interval
left = frequency_table['Class Interval'].apply(lambda x: x.left).astype(float)
right = frequency_table['Class Interval'].apply(lambda x: x.right).astype(float)
frequency_table['Mid Point'] = (left+right)/2


# Weighted arithemetic Mean 
frequency_table['Absolute Freq * Mid Point'] = frequency_table['Mid Point']*frequency_table['Absolute Frequency']
weighted_mean_of_total_bills = sum(frequency_table['Absolute Freq * Mid Point'])/tips_data.shape[0]
print(weighted_mean_of_total_bills)

In the above section of grouped data, we used the class interval concept and divided the total bill amount paid by the customers for dining into different class intervals such as 0-6, 6-12, 12-18, etc. From the frequency distribution table, we can read the absolute frequency, which tells the amount paid by the maximum number of customers. In our example, it was 87 customers in a range of 12-18 dollars. The weighted mean value is 19.81.

The frequency distribution is partitioned to have an idea about the concentration of values over the entire frequency distribution. There are several ways we can partition and interpret the concentration of values across the whole distribution. 

References

  1. Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

 156 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: