In this note, we will learn to create basic plots using the matplotlib and seaborn libraries on the Totoya dataset. The basic plots include a Scatter plot, Histogram, Bar plot, Box and whiskers plot, and Pairwise plots. Before going into various kinds of plots, let us first understand what data visualization is.
Data visualization allows us to quickly interpret the data and adjust different variables to see their effect. The advantages of data visualization are that we can observe the patterns using the various graphs, and while seeing, we can identify extreme values that could be anomalies. So if we want to interpret the data easily, then we have to go for data visualization. However, data visualization has certain limitations, which we will see in the later sections.
Popular plotting libraries in Python
Python offers multiple graphing libraries that offer diverse features. However, we will only focus on Matplotlib and Seaborn libraries for data visualization.
- Matplotlib: It is widely used to create 2d graphs and plots.
- Pandas visualization: It is an easy-to-use interface, and it was built on top of the matplotlib library.
- Seaborn: It provides a high-level interface for drawing interactive and informative statistical graphics, and it is also built on top of the matplotlib library. However, whatever graphs and plots we can create using the seaborn library can also be created using the matplotlib library.
- ggplot: It is used for advanced graphics, entirely based on the R’s ggplot2. R is another programming language that is being used for analytics, and it is basically used the grammar of graphics.
- Plotly: It is used to create interactive plots.
Matplotlib
The matplotlib is a 2D plotting library that produces good-quality figures. Although it has its origins in emulating the MATLAB graphics commands, it is independent of MATLAB. It makes heavy use of NumPy and other extension codes to provide good performance even for large arrays.
Seaborn
Seaborn is a Python data visualization library based on the matplotlib library. It provides a high-level interface for drawing attractive and informative statistical graphics. Also, it provides more features compared to the matplotlib library.
We will draw various plots using different libraries and analyze the benefits of using one over other libraries. We will also analyze and visualize the Toyota data set and enhance our plots reading and analyzing skillsets. This dataset can be downloaded directly from Kaggle.com.
Importing data for Data Visualization
The most important task is downloading the dataset and loading it into the python object data frame. The below code does the same task. The entire Jupyter notebook can be download or directly execute from kaggle.com.
# import the necessary libraries # Pandas library for data frames import pandas as pd # numpy library to do numerical operations import numpy as np # matplotlib library to do visualization import matplotlib.pyplot as plt import os # Set the working directory os.chdir("/notepub/eda/") # Importing data. # index_col = 0 means, Set index 0 as first column. # na_values = ["??","????"] replace missing values with NaN values. cars_data = pd.read_csv("Toyota.csv",index_col=0,na_values=["??","????"])
# Remove missing values from the dataframe cars_data.dropna(axis = 0, inplace=True)
Scatter Plot
A scatter plot is a set of points that represents the values obtained for two different variables plotted on horizontal and vertical axes. A Scatter plot is used mainly to convey the relationship between two numerical variables, and it is also called correlation plots as it shows how two variables are correlated. The correlation can be positive, negative, or no correlation and all this information can be deduced just by looking at the patterns on the scatter plot.
Scatter Plot using the matplotlib library
We will create a scatter plot between two variables, Age and Price, and analyze any correlation. To set the title and labels in the scatter plot, we used predefined functions.
# Create scatter plot using two variables, Age and Price. # c= 'blue' is the color for the scatter plot plt.scatter(cars_data['Age'],cars_data['Price'],c='blue') # To set the title plt.title('Scatter plot of Price vs Age of the Cars') # To set the x and y axis labels. plt.xlabel('Age (months)') plt.ylabel('Price (Euros)') # To show the scatter plot plt.show()
Scatter Plot using the Seaborn library
Scatter plot of Prive vs. Age with default arguments. By default, fit_reg = True
and it estimates the coefficient of x and plots a regression model relating the x and y variables. This is why the function is called a regression plot.
# Scatter plot using seaborn library # Scatter plot of Price vs Age with default arguments # Setting theme to the background of the plot # Theme: Dark shade with grid sns.set(style="darkgrid") # regplot stand for regression plot # set the variable for x and y axis # Age vs. price of the car sns.regplot(x=cars_data['Age'], y=cars_data['Price'])
This regression fit line into the scatter plot can be disabled by setting the parameter as fit_reg=False
and then entire code looks as follows:
sns.regplot(x=cars_data['Age'], y=cars_data['Price'], fit_reg=False)
The marker can be customized by setting the parameter as marker="*"
and the code looks as follows:
sns.regplot(x=cars_data['Age'], y=cars_data['Price'], fit_reg=False, marker="*")
Scatter plot of price vs. age by FuelType: In this setup, we will add one more variable into the scatter plot that is the variable fuel type. We will analyze how price increases or decreases with the car’s age along with different fuel types.
To do the same, we will use the hue parameter, including another variable to show the fuel types categories with different colors.
# lmplot is a function from seaborn library # It combines regression plot and facetgrid # It is useful when we want to plot a scatter plot with conditional subsets of data # or by including another variable into the picture # fit_reg = False; we don't want regression fit line # hue = 'FuelType'; points differentiated based on Fuel type of the car # This help us to know which color represent which category # for that by making legent = True # color palette = "Set1"; There are few predefined color palettes and one of them is set1. # to color the data points based on the fuel type sns.lmplot(x='Age', y='Price', data=cars_data, fit_reg=False, hue='FuelType', legend=True, palette="Set1")
It is the same scatter plot. However, now we can easily differentiate the data points using the available categories under the fuel type. The red represents Diesel, blue represents Petrol and green represents CNG. We can easily say that there are more Petrol type fuel cars than others for the different colors.
Similarly, we can also custom the appearance of the markers using transparency, shape, and size.
Observations: In the scatter plot, we can analyze that the price of the car decreases as the age of the car increases.
Histogram
It is a graphical representation of data using bars of different heights. It groups numbers into ranges, and the height of each bar depicts the frequency of each range or bin. The histograms are used to represent the frequency distribution of numerical variables.
We will create a histogram using the matplotlib library. The hist function taker first argument as input data, bin or range color, separation color between bins, and bins range.
plt.hist(cars_data['KM'], color = 'blue', edgecolor = 'white', bins = 5) # To set the title plt.title('Histogram of Kilometer') # To set the x and y axis labels. plt.xlabel('Kilometer') plt.ylabel('Frequency') plt.show()
Even we can draw histograms using functions provided by the seaborn library. The default way to generate a histogram is just bypassing data in a column. However, there are various other parameters through which the generation of the histogram can be customized.
# Histogram with default kernel density estimate sns.distplot(cars_data['KM'])
# Histogram with custom kernel density estimate sns.distplot(cars_data['KM'], kde = False, bins = 5)
Observations: Frequency distribution of kilometers of the cars shows that most cars have traveled between 5000 – 100000 km, and there are only a few cars with more distance traveled.
Bar Plot
A bar plot is a plot that presents categorical data with rectangular bars and lengths proportional to the counts that they represent. Whenever we have categorical data and look for frequencies of each category in a variable, we use a bar plot.
The bar plot is similar to the histogram. However, in the histogram, there wouldn’t be any space in between as it measures continuous range. Whereas the bar plots measure frequencies of categories, there will be space in between. Another difference is that the bar plot is used for categorical variables, and the histogram is used for the numerical variables.
A bar plot is used to represent the frequency distribution of categorical variables. A bar plot makes it easy to compare sets of data between different groups.
counts = [979,120,12] fuelType = ("Petrol","Diesel","CNG") index = np.arange(len(fuelType)) # index = X axis # counts = Height of the bars plt.bar(index, counts, color=['red', 'blue', 'cyan']) # Title and labels. plt.title("Bar plot of fuel types") plt.xlabel("Fuel Types") plt.ylabel("Frequency") # index - Set the location of the xticks # fuelType - Set the labels of the xticks plt.xticks(index,fuelType,rotation = 90) # Display the bar plot plt.show()
Bar plot generation using the seaborn library functions is much easier than matplotlib library functions. The frequency distribution of fuel type of the cars are followed:
# Bar plot generation using countplot function sns.countplot(x="FuelType",data=cars_data)
Grouped Bar Plot: We will understand how to create grouped bar plot with an example of FuelType and Automatic variables. It will display the frequency distribution of the car’s fuel type and the interpretation of whether the car’s gearbox is automatic or manual. This way, we can analyze data with the combination of multiple variables of our interest.
# Grouped bar plot of FuelType and Automatic sns.countplot(x="FuelType", data=cars_data, hue="Automatic")
Observations: Bar plot of fuel type shows that most of the cars have petrol as fuel type.
Box and Whiskers plot
Box and whiskers plot uses for analyzing data while seeing the five-number summary. The five-number summary includes minimum, maximum, and the three quantiles. It is called box and whiskers plot, as it has some boxes and the whiskers to its horizontal lines. We will try to explain using the below diagram.
- The lower extreme horizontal line is called minimal whisker or also the representation of the minimum value, which is excluding the outliers.
- The higher extreme horizontal line is called maximal whisker or also the representation of the maximum value. It also excludes the outliers.
- The lowest horizontal line of the box represents the first quantile that is 25 percentage, and the middle line represents 50 percent, which is called a median. The upper horizontal line represents 75 percentage or third quantile.
The points above maximal whisker and below minimal whisker are considered as outliers. The outliers are those extreme values that deviate from other observations on data, and they may indicate variability in measurement or a novelty.
The advantage of the box and whiskers plots is that we can easily identify the outliers of any numerical variables.
sns.boxplot(y=cars_data["Price"])
In this case, we will use to plot of Price to interpret the five-number summary visually.
- The minimum value of the price of the car is 5000 euros.
- The maximum value of the price of the car is 16000-17000 euros.
- The quantiles values are 8000, 10000, and 12000 euros for 1st, 2nd, and 3rd quantiles.
Box and Whiskers plot for numerical vs. categorical variable
The box and whiskers plot for numerical vs. categorical variable as it is very useful when we want to check the relationship between one numerical and one categorical variable.
For example, how the price varies concerning another variable. In the previous example, we have just looked at the distribution and try to detect the outliers using the plot. However, in this case, we will check the relationship between the two variables, such as the price of the cars for various fuel types. The price of the car is a numerical variable, and fuel types are categorical variables.
sns.boxplot(x = cars_data['FuelType'], y = cars_data["Price"])
Observations:
The price varies for different fuel types of cars. The middle lines of each box represent the median, and the median price of the cars is really high when the car’s car’s car’s fuel type is Petrol. The median value is really low when the fuel type is either Diesel or CNG. The maximum price of the car is for the Diesel fuel type, and the minimum value of the car is also for the diesel fuel type.
Grouped Box and Whiskers plot
The grouped box and whiskers plot of prices of fuel type by including one more variable called automatic. We will see the relationship using a grouped box with whiskers plot of Price vs. FuelType and Automatic.
sns.boxplot(x = "FuelType", y = cars_data["Price"], hue = "Automatic", data = cars_data)
Whenever we want to have a grouped box plot, we need to add that variable, including hue. In this case, the automatic category has two values, zero and one. The zero represents manual, and one represents automatic transmission.
Box-whiskers plot and Histogram
In this section, we will plot box-whiskers plot and histogram on the same window, and we will analyze and see the advantage of having both on the same plot. For that, we need to split the plotting window into 2 parts. The upper part will display the box-whiskers or five-number summary, and the lower part will display the histogram. It is done using the subplots function of the matplotlib library.
f,(ax_box, ax_hist) = plt.subplots(2,gridspec_kw={"height_ratios": (.15, .85)})
The first parameter tells split into 2 rows, and the second parameter tells row ratios. This way, it splits the window into two parts row-wise and gives a specification of grids using gridspec_kw
. The output of the subplots is saved on ax_box
and ax_hist
variables. It will be used to plot box-whiskers on the first row and the histogram on the second row.
# Create two plots sns.boxplot(cars_data['Price'], ax=ax_box) sns.distplot(cars_data['Price'],ax=ax_hist, kde=False)
Observations: Now, we can see both the frequency distribution of any continuous variable as well as the five-number summary. And seeing both the charts together, we can easily say where the outliers are and what the median is. This helps us to understand the parameters in a better way.
Pairwise Plots
It is used to plot pairwise relationships in a dataset. Mainly, we create scatter plots for joint relationships and histograms for univariate distributions. Using the below snipped, we can create the pairwise plot.
sns.pairplot(cars_data,kind="scatter",hue="FuellType")
Basically, we have all possible relationships using all the variables. However, the diagonal plots were drawn against the same variable, so it will act as a histogram while other plots are drawn against different variables.
References
- NPTEL lectures on Introduction to Python for Data Science, IIT Madras.
979 total views, 1 views today