Python for Data Science – Exploratory Data Analysis – IRIS Dataset

Exploratory Data Analysis (EDA) is a visual method for finding structures in data with the power of the human eye or brain that can easily detect structures or process the patterns. However, it is very difficult when the dataset is larger. EDA is essential before going to apply actual machine learning algorithms for the different kinds of predictions.

IRIS Dataset

It is a simple dataset containing three flowers of the Iris species family, and these are Iris setosaIris virginica, and Iris versicolor. Four features are selected and measured in centimeters per flower, and the four features are sepal length, sepal width, petal length, and petal width. The selection of features was completely subjective to the domain expert. There are fifty samples per Iris species family, and the combination of these fifty samples per species forms 150 records. 

The objective of using this dataset is to classify a new flower with a given four features belonging to one of the three classes.

Whenever we are doing data analysis, we need to always keep in mind our primary objective and accordingly determines the type of analysis or plotting that would be preferred. In this case, the problem statement is that given a flower with four features, can we classify it into one of the three categories.

Importing Data

This dataset can be downloaded directly from kaggle.com or other websites. The below code does the loading dataset into panda’s Dataframe object, and we will use the same Dataframe object for the analysis purpose.

# import the necessary libraries

# Pandas library for data frames
import pandas as pd

# numpy library to do numerical operations
import numpy as np

# matplotlib library to do visualization
import matplotlib.pyplot as plt

# for all kinds of plots
import seaborn as sns import os # Set the working directory os.chdir("/notepub/eda/") # Importing data. # index_col = 0 means, Set index 0 as first column. # na_values = ["??","????"] replace missing values with NaN values. iris_data = pd.read_csv("Iris.csv",index_col=0,na_values=["??","????"])

It is assumed that we have downloaded the Iris dataset from the above-mentioned location and loaded it into our computer at the location /notepub/eda/. Using the panda’s CSV read function, we are loading it into a Dataframe object called iris_data. 

Data Exploration

Using the below code, we will analyze the dataset and its features. 

iris_data = pd.read_csv("Iris.csv",index_col=0,na_values=["??","????"])
iris_data
Iris dataset - information
Iris dataset – Information

It is having four features and the respective species class per sample. For example, Sepal Length is 5.1 cm, Sepal Width is 3.5 cm, Petal Length is 1.4 cm, and Petal Width is 0.2 cm, and based on these features, it belongs to the iris-setosa family.

To know the shape and sample for each feature, we can we the following code snippet. 

# prints the shape of the dataset
iris_data.shape

# print column name
iris_data.columns

# print the count of individual samples based on the features
iris_data["Species"].value_counts()
Iris Dataset - individual features count
Iris dataset – individual features count

While analyzing the individual features count, we can determine whether the dataset is balanced or not. The balance dataset means the samples of each feature are more or less equal in the count. However, for the imbalance, dataset individual feature samples vary. An example of an imbalance dataset, Iris-virginica samples count is 100, Iris-versicolor samples count is 40, and Iris-setosa samples count is 10. Having the imbalanced dataset, future predictions will not be accurate.

In the Iris dataset, each sample belongs to one of the three classes, so we can confidently say that the dataset is balanced after analysis.

Bivariate Analysis

Bivariate analysis is one of the simplest ways of quantitative analysis. It involves the analysis of two variables for the purpose of determining the empirical relationship between them. It can be descriptive or inferential. However, in this note, we are focused on the descriptive analysis, primarily using graphical methods to visualize the characteristics of a dataset and the relationship among the variables using different visualization techniques.

Scatter Plot

The Scatter plot is the 2D plot the represents the relationship between two variables. We will use different combinations of features and analyze which one produces a better scatter plot than the other from the point of analysis in the 2D plane.

In this case (1), we have used sepal length and sepal width to see whether we can distinguish each species from others.

# Scatterplot using seaborn library
plt.figure(figsize=(10,3),dpi=200)
sns.set_style("whitegrid")
sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',data=iris_data, hue='Species')
Iris dataset - Sepal Width & Sepal Length
Iris dataset – Sepal Width & Sepal Length

Exploration Result: Using sepal length and width features, we can partially distinguish Setosa flowers from others. 

In this case (2), we have used petal length and petal width to see whether we can distinguish each species from others.

plt.figure(figsize=(10,3),dpi=200)
sns.set_style("whitegrid")
sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',data=iris_data, hue='Species')
Iris dataset - Petal Width & Petal Length
Iris dataset – Petal Width & Petal Length

Exploration Result: Using petal length and petal width features, we can clearly distinguish Setosa flowers from others. 

In this case (3), we have used petal length and sepal length to see whether we can distinguish each species from others.

plt.figure(figsize=(10,3),dpi=200)
sns.set_style("whitegrid")
sns.scatterplot(x='SepalLengthCm', y='PetalLengthCm',data=iris_data, hue='Species')
Iris dataset - Petal Length and Sepal Length
Iris dataset – Petal Length and Sepal Length

Exploration Result: Using Petal length and Sepal length features, we can clearly distinguish Setosa flowers from others. However, separating Versicolor from Viginica is much difficult as they are overlapped to each other. So these are not linearly separable. 

The other alternative is to go for 3D scatter plot using the Plotly library and do the analysis to determine the relationship among the features. 

import plotly.express as px

fig = px.scatter_3d(iris_data, x='SepalWidthCm', y='PetalWidthCm', z='PetalLengthCm', color='Species')
fig.show()
Iris dataset - 3D scatter plot
Iris dataset – 3D scatter plot

Exploration Result: Using Petal length, Sepal length, and Petal width features, we can clearly distinguish Setosa flowers from others. However, separating Versicolor from Viginica is again difficult in the higher dimension (In a Plane).

We have seen that different combination of features produces different results for analysis. So to do individually for different features is a tiresome job. Instead of that, we can go for a pair plot to draw all the scatter plots together.

Pair plots

Pair-plot generates a pairwise scatter plot. However, it won’t be easy to generate all the pairwise plots when the features list is higher. In this case, we can generate pairwise scatter plots easily using the below code.

plt.close()
sns.set_style("whitegrid")
sns.pairplot(iris_data,hue="Species",height=3, diag_kind="kde")
plt.show()
Iris dataset – pair plot

Exploration Result: The petal length and petal width is the most useful feature to identify and differentiate various flower species. However, in most scatter plots, Setosa can be linearly separable, and others are a little difficult to separate.

Limitation of Scatter Plots

We have seen 2D scatter plots, 3D scatter plots, and Pair plots and almost all have certain limitations. The limitation is, with a large no of data points, these plots reveal little structure & most of the markers conceal by overprinting, which can be significant for multimodal data. However, to overcome some scatter plot problems, we can go for the Contour density plots or Trellis plots, and many more. 

Univariate Analysis

Univariate analysis performs on a variable to find the range of parameters that will be helpful to understand a variable. These parameters are the central tendency & spread. To find a central tendency, we can go for mean, median, or mode to select the appropriate method to find the central tendency once we understand the data. There is a wide range of tools and techniques to analyze a variable, such as frequency distribution, class interval, histogram, kernel density estimates, box and whisker plots, etc. In the subsequent sections, we will get familiar with various methods to study the statistics of a variable.

1D Scatter Plot

1D scatter plot is used when the visualization data contains only one-dimensional data points. The below example code is used to draw 1D scatter plot.

# Plots using 1D scatter plot

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Load Iris dataset from seaborn library
iris_data = sns.load_dataset('iris')

# Get the Species based on the setosa, versicolor and virginica
iris_setosa = iris_data.loc[iris_data['species'] == 'setosa']
iris_versicolor = iris_data.loc[iris_data['species'] == 'versicolor']
iris_virginica = iris_data.loc[iris_data['species'] == 'virginica']

plt.figure(figsize=(10,2),dpi=200)

# Plot 1D graph based on the petal length of all the species
plt.plot(iris_setosa['petal_length'],np.zeros_like(iris_setosa['petal_length']),label='Setosa');
plt.plot(iris_versicolor['petal_length'],np.zeros_like(iris_versicolor['petal_length']),label='Versicolor');
plt.plot(iris_virginica['petal_length'],np.zeros_like(iris_virginica['petal_length']), label='Virginica');


# np.zeros_like() providing us an array filled with zeros
# We usually draw a scatter plot between two variables, but in this case, we have considered the second variable filled as zeroes

# Plot customization
plt.legend();
plt.title('1D scatter plot based on different Iris Species')
plt.xlabel("Petal Length");
plt.show();
1D scatter plot based on different Iris Species - Petal Length
1D scatter plot based on different Iris Species – Petal Length

Conclusion: Scatter plots are suitable to find the relationship between two variables. However, making one-dimensional data and plotting all the variables together gives a density of individual variables. So seeing the plot, we can conclude that the range of Setosa Petal length is between 1-2 cm, Versicolor, 3-4.5, and Virginica 4.5-7. 

The significant disadvantage of a 1-D scatter plot is that most of the data point overlaps, making it harder to analyze the plot. The question comes, is there a better way to do a univariate analysis?


CITE THIS AS:
“Python for Data Science – Exploratory Data Analysis – IRIS Dataset” From NotePub.io – Publish & Share Note! https://notepub.io/notes/programming-languages/python-for-data-science/python-for-data-science-exploratory-data-analysis-iris-dataset/

 1,447 total views,  2 views today

Scroll to Top
Scroll to Top
%d bloggers like this: