This note will learn how to use the python seaborn library to draw scatter plots with various customization on the function parameters to display it in different ways to analyze and extract information from the scatter plots.
We use the dm_office_sales.csv dataset, which contains categorical and continuous variables or features without missing data.
Importing data for Data Visualization
The most important task is downloading the dataset and loading it into the python object DataFrame. The below code does the same task. The entire Jupyter notebook can be download or directly execute from kaggle.com.
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Base library for seaborn library import matplotlib.pyplot as plt # For all kinds of plots import seaborn as sns # Importing data. # index_col = 0 means, Set index 0 as first column. # na_values = ["??","????"] replace missing values with NaN values. # This dataset contains both numerical and categorical variables which will help us to learn the uses of seaborn library. office_data = pd.read_csv("/kaggle/input/dm-office-sales/dm_office_sales.csv",index_col=0,na_values=["??","????"]) office_data
To check the missing values and other dataset information, we will use the following code snippet. However, if the dataset has missing data, we can use different approaches to fill the missing data.
# These functions give the summary of missing data. #office_data.isna().sum() #office_data.isnull().sum() # For an executive summary about the dataset, we can also use the following code. # office_data.head() office_data.info()
There are four numerical variables, training level, work experience, salary, and sales in this dataset. In contrast, the level of education is a categorical variable. Using the scatter plots, we will learn how to draw and analyze and infer the relationship between different variables or features.
We can use a few of the matplotlib functions to customize the scatter plots, and these are as follows.
# To set the figure size and clarity plt.figure(figsize=(10,3),dpi=200) # To save the figure plt.savefig("figure-name.png") # To display the figure (terminals only) plt.show()
Scatter Plot using the Seaborn library
We will generate a scatter plot between sales and salary using the scatterplot()
function.
# Set the figure size plt.figure(figsize=(10,3),dpi=200) # generate scatterplot between salary and sales from the dataset sns.scatterplot(x='salary', y='sales',data=office_data)
Observations: We can clearly see that the sales are directly proportional to the salary. However, the maximum range of the salary lies in the range of 60000 to 120000 and sales 200000 to 600000 units.
Seaborn scatterplot function parameters
We will learn few famous parameters which are used very frequently while drawing scatter plots using the python seaborn library, and these are as follows:
hue Parameter: The hue parameter determines which column in the data frame should be used for color encoding. Hue parameters can be used for both categorical and numerical variables. Using the hue from the two-dimensional plot, we can extract three-dimensional information.
plt.figure(figsize=(10,3),dpi=200) # Uses of hue parameter. sns.scatterplot(x='salary', y='sales',data=office_data, hue="level of education")
Observations: Using the hue, we can analyze the relationship between salary, sales, and level of education. As per the plot, the level of education is independent of salary or sales. Most employees who are having associate degrees are drawing the maximum salary and giving the maximum sales.
Now, we will analyze the relationship between salary, sales, and training level. The hue parameter is configured for training level, and it gives extra information about whether salary and sales are related to training level or not.
plt.figure(figsize=(10,3),dpi=200) sns.scatterplot(x='salary', y='sales',data=office_data, hue="training level")
Observations: Using the hue, we can analyze the relationship between salary, sales, and training level. As per the plot, the training level is related to salary and sales. Most employees with the highest training level are drawing the maximum salary and giving the maximum sales.
palette Parameter: It provides well-suited colors to characteristics data that are used for visualization. A palette consists of multiple colors used as color markers, and markers are the data point. There is a complete range of palettes provided by the matplotlib color map. In the below scatter plot, we have used the Set1 palette. If we provide the wrong palette name, then the python interpreter throws an error with all the available palettes name.
Using the Set1 palette, we can see that now training levels are represented with different colors compared to the previously generated scatterplot with the same function arguments without a palette argument.
style Parameter: It uses different kinds of markers for each category to represent. It is well suited for the categorical type of data types with a limited number of categories. The main difference between style and hue is that hue is suitable for any data type. However, style is suitable for the only categorical data types.
plt.figure(figsize=(10,3),dpi=200) sns.scatterplot(x='salary', y='sales',data=office_data, s=200, style="level of education")
hue, style, and palette Parameters: The combination of hue, style, and palette parameters produces an excellent scatterplot from the readability point of view.
plt.figure(figsize=(10,3),dpi=200) sns.scatterplot(x='salary', y='sales',data=office_data, style="level of education", hue="level of education", palette='Set1')
size Parameters: To varies the marker based on a variable. However, it makes sense when the column values (variable) are continuous in nature to represent three-dimensional information on the 2D plot.
plt.figure(figsize=(10,3),dpi=200) sns.scatterplot(x='salary', y='sales',data=office_data, size="training level")
There are other useful parameters:
- c: To set the size of the markers
- alpha: To set the resolution of each marker, and its range is from 0 to 1, so we can choose any value from the range. For example, by default, it is 1, and if we choose .5, we can easily differentiate that where most of the data points lie compared to the remaining data points.
plt.figure(figsize=(10,3),dpi=200) sns.scatterplot(x='salary', y='sales',data=office_data, style="level of education", hue="level of education", palette='Set1', alpha = 0.9, c=400)
There are few useful functions that we can use to decorate our plots.
- sns.set_style() : To style plot background
- sns.set_context(): To set up many parameters that will define how seaborn produces plots
Limitation of Scatter Plots
A scatterplot cannot give the exact correlation between two variables. However, it can only show the quantitative expression of quantitative change, which means, using the scatterplot, one can’t confidently express the exact relationship between two variables.
708 total views, 1 views today