Mean is one of the techniques to measure the central tendency of data. Data is often described as ungrouped and grouped. The ungrouped data is data given as individual data points or, in other words, we can say, observations are collected from the discrete variable. Whereas the grouped data is data given in intervals, these data or observations are collected from the continuous variable.
In this note, we will learn how to find the arithmetic mean for ungrouped and grouped data, and we will practice these concepts using the python programming language.
Arithmetic Mean (ungrouped data)
The arithmetic mean of ungrouped data is the summation of N observations divided by N. It is represented as the below equation.
where X is a variable name and are N observations or data points.
Arithmetic Mean using Python
In the earlier note, we were playing with the tips dataset. In this note also, we will use the same dataset for calculating the arithmetic mean value of tips received from various customers by the waiter.
import seaborn as sns # Load an example dataset tips_data = sns.load_dataset("tips") # Print the information about dataset print(tips_data.info()) # To calculate mean of tips received from 244 customers tips_data['tip'].mean()
The average mean value is 2.99. Even with the help of a histogram, we can visualize and understand the data. Using the python seaborn library, we drew a histogram to understand the tip receiving pattern. We used the following python code to draw a histogram.
import math from matplotlib import pyplot as plt import seaborn as sns sns.set_theme() # Load an example dataset tips_data = sns.load_dataset("tips") # Finding the exact bin size max_range = tips_data['tip'].max() - tips_data['tip'].min() bins_size = math.ceil(max_range) # Setting the dimensions of figure plt.figure(figsize=(10,4),dpi=100) # Plot the histogram plt.hist(tips_data['tip'], color = 'green', edgecolor = 'white', bins = bins_size) # To set the title plt.title('Histogram for tips received') # Range is equal to highest number - lowest number # To set the x and y axis labels. plt.xlabel('Tips') plt.ylabel('Total number of tips received from customer') plt.show()
Even with the histogram, we can see that most of the customers had given tips of approximately in a range of 2-4 dollars.
Arithmetic Mean (grouped data)
When the observations of a variable are categorized into the class interval, that kind of data is called grouped data. These data are divided into suitable intervals with suitable widths, and each width or class interval has lower and higher values.
We need to create a frequency table (crosstabs in python) with the midpoint, absolute, and relative frequencies to handle the group data.
- Midpoint is computed by summing lower and upper values of a class interval and divided by two.
- Absolute frequency is computed by counting the total number of data points that lie in that class interval.
- Relative frequency is computed by dividing the absolute frequency by the total number of data points.
In the below example code, we used the total_bill variable of the tips dataset, derived the class interval, and created a frequency table by calculating absolute and relative frequencies. The class interval categorizes the bill amount into various categories, and absolute frequency tells how many customers had dined, with the bill amount belongs to that category.
Frequency Distribution Table using Python
import seaborn as sns import pandas as pd import numpy as df import math # Load tips dataset from seaborn package tips_data = sns.load_dataset('tips') max_bill_amount = int(math.ceil(tips_data['total_bill'].max())) interval_len = int(math.ceil(max_bill_amount/10)) sum_of_total_bills = tips_data['total_bill'].sum() # Create a class interval automatically total_bins = [i for i in range(0,max_bill_amount+interval_len,interval_len)] total_bills_groupby_interval = pd.cut(x=tips_data['total_bill'], bins=total_bins) # Calculate absolute frequency absolute_frequency_table = tips_data.groupby(total_bills_groupby_interval)['total_bill'].count() # Renaming headers and put into dataframe frequency_table = pd.DataFrame({'Class Interval':absolute_frequency_table.index, 'Absolute Frequency':absolute_frequency_table.values}) # Calculate Relative Frequency frequency_table["Relative Frequency"] = frequency_table['Absolute Frequency']/frequency_table['Absolute Frequency'].sum() frequency_table
Conclusion: From the frequency table, we can conclude that 87 customers paid the total bill amount in a range of 18 to 24 dollars and so on.
Weighted Arithmetic Mean using Python
import seaborn as sns import pandas as pd import numpy as df import math import matplotlib.pyplot as plt # Load tips dataset from seaborn package tips_data = sns.load_dataset('tips') max_bill_amount = int(math.ceil(tips_data['total_bill'].max())) interval_len = int(math.ceil(max_bill_amount/10)) sum_of_total_bills = tips_data['total_bill'].sum() # Create a class interval automatically total_bins = [i for i in range(0,max_bill_amount+interval_len,interval_len)] total_bills_groupby_interval = pd.cut(x=tips_data['total_bill'], bins=total_bins) # Calculate absolute frequency absolute_frequency_table = tips_data.groupby(total_bills_groupby_interval)['total_bill'].count() # Renaming headers and put into dataframe frequency_table = pd.DataFrame({'Class Interval':absolute_frequency_table.index, 'Absolute Frequency':absolute_frequency_table.values}) # Calculate Relative Frequency frequency_table["Relative Frequency"] = frequency_table['Absolute Frequency']/frequency_table['Absolute Frequency'].sum() # Find midpoint value of each interval left = frequency_table['Class Interval'].apply(lambda x: x.left).astype(float) right = frequency_table['Class Interval'].apply(lambda x: x.right).astype(float) frequency_table['Mid Point'] = (left+right)/2 # Weighted arithemetic Mean frequency_table['Absolute Freq * Mid Point'] = frequency_table['Mid Point']*frequency_table['Absolute Frequency'] weighted_mean_of_total_bills = sum(frequency_table['Absolute Freq * Mid Point'])/tips_data.shape[0] print(weighted_mean_of_total_bills)
In the above section of grouped data, we used the class interval concept and divided the total bill amount paid by the customers for dining into different class intervals such as 0-6, 6-12, 12-18, etc. From the frequency distribution table, we can read the absolute frequency, which tells the amount paid by the maximum number of customers. In our example, it was 87 customers in a range of 12-18 dollars. The weighted mean value is 19.81.
The frequency distribution is partitioned to have an idea about the concentration of values over the entire frequency distribution. There are several ways we can partition and interpret the concentration of values across the whole distribution.
References
- Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.
157 total views, 1 views today