Descriptive Statistics – Graphic and Plots

In the previous note on Descriptive Statistics – Frequency Distribution, we have seen the different aspects of frequency distribution such as absolute frequency distribution, relative frequency distribution, and cumulative frequency distribution. The frequency distribution is one way to make the data compatible to be exposed to the graphical and analytical tools. On this note, we will try to build concepts on graphical tools that are essentially important for preliminary visual analysis of data and relationships between variables using various kinds of plots.

Graphical Tools

Graphics summarize the information contained in the data. For example, a person’s mood may be conveyed very easily by smiles compared to reading multiple sentences to understand a person’s mood.

Graphical tools have the advantage of conveying the information hidden inside the data more compactly with appropriate numbers and the choice of plots for better inferences and analysis. There are various types of graphical tools such as:

  • 2D and 3D plots
  • Scatter plot
  • Pie plot
  • Histogram
  • Bar plot
  • Stem and leaf plot
  • Box plot

There are many more tools, particularly with the advent of software programming languages such as Python and R. These graphics have become very popular because they are straightforward to create without significant effort. 

Tips Dataset

We will use the tips dataset to understand all the graphical tools (functions) available in the python programming language. Before going into the graphical tools, we will first understand the dataset. 

# import data from kaggle site.
# index_col = 0 means, Set index 0 as first column.
# na_values = ["??","????"] replace missing values with NaN values.
tips_data = pd.read_csv("/kaggle/input/tips-dataset-for-beginners/tips.csv",index_col=0,na_values=["??","????"])
tips_data

All about this dataset: One waiter recorded information about each tip he received over a few months of working in one restaurant. In all, he recorded 244 tips. Mostly, this dataset is used to train the machine learning model to predict the future tip amount. However, this dataset is also very suitable for data visualization as it contains both quantitative and qualitative variables. 

About tips variables
About tips variables

Bar diagram

It visualizes the relative or absolute frequencies of observed values of a variable and contains only one bar for each category. The height of each bar is determined by either the absolute frequency or the relative frequency of the respective category and is shown on the y-axis. However, the width of the bar is immaterial or arbitrary. 

Creation of Bar plot using Python Matplotlib

We will use the python matplotlib library bar function, and the syntax is as follows:

plt.bar(x, height, width, bottom, align)

In the below code, we have used the tips dataset to draw the bar plot. Mainly, we show the bar plot of the absolute frequency of the number of people who visit the restaurant for dining.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Getting count values based on days.
tips_df = tips_data.value_counts('day')

# Extracting data from pandas.series data type

tips_days=list(tips_df.index)
tips_days_count = tips_df.values.tolist()

# Setting the dimensions of figure
plt.figure(figsize=(10,4),dpi=100)

# Creating the bar plot
plt.bar(tips_days, tips_days_count, color ='orange', width = 0.2)

# Setting the labels
plt.xlabel("Days")
plt.ylabel("No. of people comes for dining")
plt.title("Days wise bar plot of people comes for dining")
plt.show()
Tips dataset - Bar Plot
Tips dataset – Bar Plot

Observations: This bar plot says that most people visit restaurants for dining on Saturday, then Sunday, etc.

Pie diagram

The pie chart is a circle partitioned into segments where each segment represents a category, and it is used to visualize the absolute and relative frequencies. The size of each segment depends upon the relative frequency and is determined by the angle. The angle is calculated as (frequency * 360).

Using the same dataset, we will create a pie chart using the matplotlib library. 

# Pie Chart using Matplotlib
from matplotlib import pyplot as plt

tips_df = tips_data.value_counts('day')

fig = plt.figure()
ax = fig.add_axes([1,1,1,1])
ax.axis('equal')

tips_days=list(tips_df.index)
tips_days_count = tips_df.values.tolist()

ax.pie(tips_days_count, labels = tips_days,autopct='%1.1f%%')
plt.show()
Pie chart - Days wise bar plot of people comes for dining
Pie chart – Days wise bar plot of people comes for dining

Observations: After analyzing the pie plot, we can say that 31.7% of people visit restaurants on Saturday, 31.1% visit on Sunday, 24.4% of people visit on Thursday, and very few people visit on Friday for dining.

Limitation of  Bar diagram

Pie plot and Bar plot are well suited for categorical data with frequency distribution of each categories. Using the Pie plot or bar plot we can visually analyze the frequency distribution of each category by the seeing and analyzing the bars. However, It does not know how to handle continuous data. So for the visually analysis of continuous data we can go for histogram or kernel density plots. 

Histogram

A histogram is based on categorizing the data into different groups and plotting the bars for each category with height. The area of bars ( = height * width) is proportional to the frequency (or relative frequency). So the widths of the bars need not necessarily be the same. The histogram is mainly used for continuous data. 

Difference between Pie plot, Bar plot, and Histogram

The histogram does the same thing that a bar diagram or a pie chart does. Still, the difference is that the bar diagrams and pie diagrams are essentially categorical variables where the values are indicated by some numbers representing the category. However, the histogram is for continuous data. It first tries to categorize the data into different groups (or bins) and then plots the bars for each category.

The height of the bar plot is simply proportional to the frequency or relative frequency, and the width of the bar plot is immaterial. However, it is not true for histograms. The size of the bar is essentially proportional to the area of the bars in the histogram. It means the bar area is given by the height of the bar and width of the bar, which has to be multiplied. 

Now we will learn how to draw the histogram in python using the python matplotlib library. 

import math
from matplotlib import pyplot as plt
# Finding the exact bin size
max_range = tips_data['tip'].max() - tips_data['tip'].min()
bins_size = math.ceil(max_range/2)

# Setting the dimensions of figure
plt.figure(figsize=(10,4),dpi=100)

# Plot the histogram
plt.hist(tips_data['tip'], color = 'blue', edgecolor = 'white', bins = bins_size)

# To set the title
plt.title('Histogram for tips received')

# Range is equal to highest number - lowest number

# To set the x and y axis labels.
plt.xlabel('Tips')
plt.ylabel('Total number of tips received from customer')
plt.show()
Histogram : Total number of tips received from customers
Histogram : Total number of tips received from customers, max bins = 5

Observations: After analyzing the histogram, we can confidently say that the maximum tip range is 0-3 dollars, almost 120 customers had given to the waiter.

Limitation of Histogram

In the histogram, the continuous data is artificially categorized, and the choice of width of the class interval or the number of bins is crucial in the construction of the histogram.

The limitation of the histogram is that how to decide the maximum bins or classes. In the above histogram, we have chosen the maximum bins to be five, and our observation was that around 120 customers had given tips in the range of 0.1-3 dollars. However, if we change the bin length from five to nine, our observation and analysis are completely changed. 

import math
from matplotlib import pyplot as plt

# Finding the exact bin size
max_range = tips_data['tip'].max() - tips_data['tip'].min()
bins_size = math.ceil(max_range)

# Setting the dimensions of figure
plt.figure(figsize=(10,4),dpi=100)

# Plot the histogram
plt.hist(tips_data['tip'], color = 'blue', edgecolor = 'white', bins = bins_size)

# To set the title
plt.title('Histogram for tips received')

# Range is equal to highest number - lowest number

# To set the x and y axis labels.
plt.xlabel('Tips')
plt.ylabel('Total number of tips received from customer')
plt.show()
Histogram : Total number of tips received from customers, max bins = 9
Histogram : Total number of tips received from customers, max bins = 9

Now seeing the histogram, we can say more clearly that a maximum number of customers paid tips in the range of 2-3 dollars. This indicates that the selection of bin length is significant for the analysis of data.

Usually, we received the frequency distribution of data and the maximum number of bins or class information. Based on the bins, we create the class interval and decide the height of each bar by calculating the mean value of all the frequency distribution that comes under a particular class interval. However, when the number of observations is quite large and bin size is not properly fed, then the analysis based on the histogram wouldn’t be precise. To overcome the limitation of histograms, there is another plot called Kernel density plots.

Kernel Density Plots

Kernel density plot is like a smoothened histogram for visualizing data distribution over a continuous interval or time period. It uses kernel smoothing to smoothen the plots by smoothing out the noise. The smoothness is controlled by a parameter called bandwidth. The peak of a kernel density plot display where values are concentrated over the interval. And it uses a kernel density estimate. 

Kernel density plots are better to determine the distribution shape than histograms because they are not affected by the number of bins or the number of bars used. It is well suitable for the analysis of the distribution of continuous data for large datasets.

Kernel Density Estimates

A kernel density plot is produced by the below function 

\widehat{f}_n(x) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big), h > 0

  • n : Sample size
  • h : Bandwidth
  • K : Kernel function
  • x : Input data

The different choices of K provide different estimates, and the kernel functions are not arbitrarily defined. Still, they satisfy the conditions as of probability density function for a continuous random variable, which helps determine the probabilities of events. These are different density functions, and a few are listed below:

  • Normal density function
  • Gamma density function
  • Chi-square function
  • t-distribution
  • chi-square distribution
  • f-distribution

It is usual that different kernel density functions give a different types of plots. The kernel density plots are constructed on the values of kernel function and that are obtain on the basis of the given set of data. The overall goal of selection of kernel density functions is to approximate the actual frequency distribution of the data.

tips_df = tips_data.value_counts('tip')
tips_serial = pd.Series(tips_data['tip'])
tips_serial.plot.kde()
Kernel Density Plots
Kernel Density Plots

Stem-and-Leaf Plots

The stem-and-leaf plots show the absolute frequency in different classes like frequency distribution table or a histogram. The stem-and-leaf plot of a quantitative variable is a textual graph that presents the data according to their most significant numeric digit and more suitable for a small dataset.

It is a tabular presentation where each data value is split into a stem (the first digit or digits) and a leaf (usually the last digit).

References

  1. Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

 278 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: