Python for Data Science – Data Visualization – Totoya Dataset

In this note, we will learn to create basic plots using the matplotlib and seaborn libraries on the Totoya dataset. The basic plots include a Scatter plot, Histogram, Bar plot, Box and whiskers plot, and Pairwise plots. Before going into various kinds of plots, let us first understand what data visualization is.

Data visualization allows us to quickly interpret the data and adjust different variables to see their effect. The advantages of data visualization are that we can observe the patterns using the various graphs, and while seeing, we can identify extreme values that could be anomalies. So if we want to interpret the data easily, then we have to go for data visualization. However, data visualization has certain limitations, which we will see in the later sections.

Popular plotting libraries in Python

Python offers multiple graphing libraries that offer diverse features. However, we will only focus on Matplotlib and Seaborn libraries for data visualization. 

  • Matplotlib: It is widely used to create 2d graphs and plots.
  • Pandas visualization: It is an easy-to-use interface, and it was built on top of the matplotlib library.
  • Seaborn: It provides a high-level interface for drawing interactive and informative statistical graphics, and it is also built on top of the matplotlib library. However, whatever graphs and plots we can create using the seaborn library can also be created using the matplotlib library.
  • ggplot: It is used for advanced graphics, entirely based on the R’s ggplot2. R is another programming language that is being used for analytics, and it is basically used the grammar of graphics.
  • Plotly: It is used to create interactive plots.

Matplotlib

The matplotlib is a 2D plotting library that produces good-quality figures. Although it has its origins in emulating the MATLAB graphics commands, it is independent of MATLAB. It makes heavy use of NumPy and other extension codes to provide good performance even for large arrays.

Seaborn

Seaborn is a Python data visualization library based on the matplotlib library. It provides a high-level interface for drawing attractive and informative statistical graphics. Also, it provides more features compared to the matplotlib library.

We will draw various plots using different libraries and analyze the benefits of using one over other libraries.  We will also analyze and visualize the Toyota data set and enhance our plots reading and analyzing skillsets. This dataset can be downloaded directly from Kaggle.com.

Importing data for Data Visualization

The most important task is downloading the dataset and loading it into the python object data frame. The below code does the same task. The entire Jupyter notebook can be download or directly execute from kaggle.com.

# import the necessary libraries

# Pandas library for data frames
import pandas as pd

# numpy library to do numerical operations
import numpy as np

# matplotlib library to do visualization
import matplotlib.pyplot as plt

import os

# Set the working directory
os.chdir("/notepub/eda/")

# Importing data.
# index_col = 0 means, Set index 0 as first column.
# na_values = ["??","????"] replace missing values with NaN values.

cars_data = pd.read_csv("Toyota.csv",index_col=0,na_values=["??","????"])
Python for Data Science - EDA - Toyota Dataset
Python for Data Science – EDA – Toyota Dataset
# Remove missing values from the dataframe 
cars_data.dropna(axis = 0, inplace=True)
Data Visualization  - Removal of all rows contains NaN
Data Visualization – Removal of all rows contains NaN

Scatter Plot

A scatter plot is a set of points that represents the values obtained for two different variables plotted on horizontal and vertical axes. A Scatter plot is used mainly to convey the relationship between two numerical variables, and it is also called correlation plots as it shows how two variables are correlated. The correlation can be positive, negative, or no correlation and all this information can be deduced just by looking at the patterns on the scatter plot.

Scatter Plot using the matplotlib library

We will create a scatter plot between two variables, Age and Price, and analyze any correlation. To set the title and labels in the scatter plot, we used predefined functions.

# Create scatter plot using two variables, Age and Price.
# c= 'blue' is the color for the scatter plot
plt.scatter(cars_data['Age'],cars_data['Price'],c='blue')

# To set the title
plt.title('Scatter plot of Price vs Age of the Cars')

# To set the x and y axis labels.
plt.xlabel('Age (months)')
plt.ylabel('Price (Euros)')

# To show the scatter plot
plt.show()
Scatter plot of Price vs Age of the Cars
Scatter plot of Price vs Age of the Cars

Scatter Plot using the Seaborn library

Scatter plot of Prive vs. Age with default arguments. By default, fit_reg = True and it estimates the coefficient of x and plots a regression model relating the x and y variables. This is why the function is called a regression plot.

# Scatter plot using seaborn library
# Scatter plot of Price vs Age with default arguments

# Setting theme to the background of the plot
# Theme: Dark shade with grid

sns.set(style="darkgrid")

# regplot stand for regression plot
# set the variable for x and y axis
# Age vs. price of the car

sns.regplot(x=cars_data['Age'], y=cars_data['Price'])
Scatter Plot - Price vs. Age
Scatter Plot – Price vs. Age

This regression fit line into the scatter plot can be disabled by setting the parameter as fit_reg=False  and then entire code looks as follows:

sns.regplot(x=cars_data['Age'], y=cars_data['Price'], fit_reg=False)

The marker can be customized by setting the parameter as marker="*" and the code looks as follows:

sns.regplot(x=cars_data['Age'], y=cars_data['Price'], fit_reg=False, marker="*")

Scatter plot of price vs. age by FuelType: In this setup, we will add one more variable into the scatter plot that is the variable fuel type. We will analyze how price increases or decreases with the car’s age along with different fuel types. 

To do the same, we will use the hue parameter, including another variable to show the fuel types categories with different colors. 

# lmplot is a function from seaborn library
# It combines regression plot and facetgrid
# It is useful when we want to plot a scatter plot with conditional subsets of data
# or by including another variable into the picture

# fit_reg = False; we don't want regression fit line

# hue = 'FuelType'; points differentiated based on Fuel type of the car
# This help us to know which color represent which category
# for that by making legent = True

# color palette = "Set1"; There are few predefined color palettes and one of them is set1.
# to color the data points based on the fuel type

sns.lmplot(x='Age', y='Price', data=cars_data, fit_reg=False, hue='FuelType', legend=True, palette="Set1")
Scatter Plot using lmplot
Scatter Plot using lmplot

It is the same scatter plot. However, now we can easily differentiate the data points using the available categories under the fuel type. The red represents Diesel, blue represents Petrol and green represents CNG. We can easily say that there are more Petrol type fuel cars than others for the different colors. 

Similarly, we can also custom the appearance of the markers using transparency, shape, and size. 

Observations: In the scatter plot, we can analyze that the price of the car decreases as the age of the car increases.

Histogram

It is a graphical representation of data using bars of different heights. It groups numbers into ranges, and the height of each bar depicts the frequency of each range or bin. The histograms are used to represent the frequency distribution of numerical variables.

We will create a histogram using the matplotlib library. The hist function taker first argument as input data, bin or range color, separation color between bins, and bins range.

plt.hist(cars_data['KM'], color = 'blue', edgecolor = 'white', bins = 5)

# To set the title
plt.title('Histogram of Kilometer')

# To set the x and y axis labels.
plt.xlabel('Kilometer')
plt.ylabel('Frequency')

plt.show()
Histogram of Kilometer
Histogram of Kilometre

Even we can draw histograms using functions provided by the seaborn library. The default way to generate a histogram is just bypassing data in a column. However, there are various other parameters through which the generation of the histogram can be customized.

# Histogram with default kernel density estimate
sns.distplot(cars_data['KM'])
# Histogram with custom kernel density estimate
sns.distplot(cars_data['KM'], kde = False, bins = 5)
Histogram of KM using distplot
Histogram of KM using distplot

Observations: Frequency distribution of kilometers of the cars shows that most cars have traveled between 5000 – 100000 km, and there are only a few cars with more distance traveled.

Bar Plot

A bar plot is a plot that presents categorical data with rectangular bars and lengths proportional to the counts that they represent. Whenever we have categorical data and look for frequencies of each category in a variable, we use a bar plot. 

The bar plot is similar to the histogram. However, in the histogram, there wouldn’t be any space in between as it measures continuous range. Whereas the bar plots measure frequencies of categories, there will be space in between. Another difference is that the bar plot is used for categorical variables, and the histogram is used for the numerical variables.

A bar plot is used to represent the frequency distribution of categorical variables. A bar plot makes it easy to compare sets of data between different groups.

counts = [979,120,12]
fuelType = ("Petrol","Diesel","CNG")
index = np.arange(len(fuelType))

# index = X axis
# counts = Height of the bars

plt.bar(index, counts, color=['red', 'blue', 'cyan'])

# Title and labels.

plt.title("Bar plot of fuel types")
plt.xlabel("Fuel Types")
plt.ylabel("Frequency")

# index - Set the location of the xticks
# fuelType - Set the labels of the xticks

plt.xticks(index,fuelType,rotation = 90)

# Display the bar plot

plt.show()
Bar plot - Toyota Dataset
Bar plot – Toyota Dataset

Bar plot generation using the seaborn library functions is much easier than matplotlib library functions. The frequency distribution of fuel type of the cars are followed:

# Bar plot generation using countplot function
sns.countplot(x="FuelType",data=cars_data)
Bar plot - Fuel Type
Bar plot – Fuel Type

Grouped Bar Plot: We will understand how to create grouped bar plot with an example of FuelType and Automatic variables. It will display the frequency distribution of the car’s fuel type and the interpretation of whether the car’s gearbox is automatic or manual. This way, we can analyze data with the combination of multiple variables of our interest.

# Grouped bar plot of FuelType and Automatic
sns.countplot(x="FuelType", data=cars_data, hue="Automatic")
Grouped bar plot of FuelType and Automatic
Grouped bar plot of FuelType and Automatic

Observations: Bar plot of fuel type shows that most of the cars have petrol as fuel type.

Box and Whiskers plot

Box and whiskers plot uses for analyzing data while seeing the five-number summary. The five-number summary includes minimum, maximum, and the three quantiles. It is called box and whiskers plot, as it has some boxes and the whiskers to its horizontal lines. We will try to explain using the below diagram. 

  • The lower extreme horizontal line is called minimal whisker or also the representation of the minimum value, which is excluding the outliers. 
  • The higher extreme horizontal line is called maximal whisker or also the representation of the maximum value. It also excludes the outliers. 
  • The lowest horizontal line of the box represents the first quantile that is 25 percentage, and the middle line represents 50 percent, which is called a median. The upper horizontal line represents 75 percentage or third quantile. 

The points above maximal whisker and below minimal whisker are considered as outliers. The outliers are those extreme values that deviate from other observations on data, and they may indicate variability in measurement or a novelty.

The advantage of the box and whiskers plots is that we can easily identify the outliers of any numerical variables.

sns.boxplot(y=cars_data["Price"])
Box and Whiskers Plot - Price
Box and Whiskers Plot – Price

In this case, we will use to plot of Price to interpret the five-number summary visually. 

  • The minimum value of the price of the car is 5000 euros.
  • The maximum value of the price of the car is 16000-17000 euros.
  • The quantiles values are 8000, 10000, and 12000 euros for 1st, 2nd, and 3rd quantiles.

Box and Whiskers plot for numerical vs. categorical variable

The box and whiskers plot for numerical vs. categorical variable as it is very useful when we want to check the relationship between one numerical and one categorical variable.

For example, how the price varies concerning another variable. In the previous example, we have just looked at the distribution and try to detect the outliers using the plot. However, in this case, we will check the relationship between the two variables, such as the price of the cars for various fuel types. The price of the car is a numerical variable, and fuel types are categorical variables. 

sns.boxplot(x = cars_data['FuelType'], y = cars_data["Price"])
Box and Whiskers Plot - Price vs Fuel Type
       Box and Whiskers Plot – Price vs. Fuel Type

Observations:

The price varies for different fuel types of cars. The middle lines of each box represent the median, and the median price of the cars is really high when the car’s car’s car’s fuel type is Petrol. The median value is really low when the fuel type is either Diesel or CNG. The maximum price of the car is for the Diesel fuel type, and the minimum value of the car is also for the diesel fuel type.

Grouped Box and Whiskers plot

The grouped box and whiskers plot of prices of fuel type by including one more variable called automatic. We will see the relationship using a grouped box with whiskers plot of Price vs. FuelType and Automatic.

sns.boxplot(x = "FuelType", y = cars_data["Price"], hue = "Automatic", data = cars_data)
Grouped box with whiskers plot of Price vs. FuelType and Automatic
Grouped box with whiskers plot of Price vs. FuelType and Automatic

Whenever we want to have a grouped box plot, we need to add that variable, including hue. In this case, the automatic category has two values, zero and one. The zero represents manual, and one represents automatic transmission. 

Box-whiskers plot and Histogram

In this section, we will plot box-whiskers plot and histogram on the same window, and we will analyze and see the advantage of having both on the same plot. For that, we need to split the plotting window into 2 parts. The upper part will display the box-whiskers or five-number summary, and the lower part will display the histogram. It is done using the subplots function of the matplotlib library. 

f,(ax_box, ax_hist) = plt.subplots(2,gridspec_kw={"height_ratios": (.15, .85)})

The first parameter tells split into 2 rows, and the second parameter tells row ratios. This way, it splits the window into two parts row-wise and gives a specification of grids using gridspec_kw. The output of the subplots is saved on ax_box and ax_hist variables. It will be used to plot box-whiskers on the first row and the histogram on the second row.

# Create two plots
sns.boxplot(cars_data['Price'], ax=ax_box)
sns.distplot(cars_data['Price'],ax=ax_hist, kde=False)
Box-whiskers plot and Histogram

Observations: Now, we can see both the frequency distribution of any continuous variable as well as the five-number summary. And seeing both the charts together, we can easily say where the outliers are and what the median is. This helps us to understand the parameters in a better way. 

Pairwise Plots

It is used to plot pairwise relationships in a dataset. Mainly, we create scatter plots for joint relationships and histograms for univariate distributions. Using the below snipped, we can create the pairwise plot.

sns.pairplot(cars_data,kind="scatter",hue="FuellType")
Pairwise Plots
Pairwise Plots

Basically, we have all possible relationships using all the variables. However, the diagonal plots were drawn against the same variable, so it will act as a histogram while other plots are drawn against different variables.

References

 979 total views,  1 views today

Scroll to Top
Scroll to Top