Exploratory Data Analysis (EDA) is a visual method for finding structures in data with the power of the human eye or brain that can easily detect structures or process the patterns. However, it is very difficult when the dataset is larger. EDA is essential before going to apply actual machine learning algorithms for the different kinds of predictions.
IRIS Dataset
It is a simple dataset containing three flowers of the Iris species family, and these are Iris setosa, Iris virginica, and Iris versicolor. Four features are selected and measured in centimeters per flower, and the four features are sepal length, sepal width, petal length, and petal width. The selection of features was completely subjective to the domain expert. There are fifty samples per Iris species family, and the combination of these fifty samples per species forms 150 records.
The objective of using this dataset is to classify a new flower with a given four features belonging to one of the three classes.
Whenever we are doing data analysis, we need to always keep in mind our primary objective and accordingly determines the type of analysis or plotting that would be preferred. In this case, the problem statement is that given a flower with four features, can we classify it into one of the three categories.
Importing Data
This dataset can be downloaded directly from kaggle.com or other websites. The below code does the loading dataset into panda’s Dataframe object, and we will use the same Dataframe object for the analysis purpose.
# import the necessary libraries # Pandas library for data frames import pandas as pd # numpy library to do numerical operations import numpy as np # matplotlib library to do visualization import matplotlib.pyplot as plt
# for all kinds of plots
import seaborn as sns import os # Set the working directory os.chdir("/notepub/eda/") # Importing data. # index_col = 0 means, Set index 0 as first column. # na_values = ["??","????"] replace missing values with NaN values. iris_data = pd.read_csv("Iris.csv",index_col=0,na_values=["??","????"])
It is assumed that we have downloaded the Iris dataset from the above-mentioned location and loaded it into our computer at the location /notepub/eda/
. Using the panda’s CSV read function, we are loading it into a Dataframe object called iris_data.
Data Exploration
Using the below code, we will analyze the dataset and its features.
iris_data = pd.read_csv("Iris.csv",index_col=0,na_values=["??","????"]) iris_data
It is having four features and the respective species class per sample. For example, Sepal Length is 5.1 cm, Sepal Width is 3.5 cm, Petal Length is 1.4 cm, and Petal Width is 0.2 cm, and based on these features, it belongs to the iris-setosa family.
To know the shape and sample for each feature, we can we the following code snippet.
# prints the shape of the dataset iris_data.shape # print column name iris_data.columns # print the count of individual samples based on the features iris_data["Species"].value_counts()
While analyzing the individual features count, we can determine whether the dataset is balanced or not. The balance dataset means the samples of each feature are more or less equal in the count. However, for the imbalance, dataset individual feature samples vary. An example of an imbalance dataset, Iris-virginica samples count is 100, Iris-versicolor samples count is 40, and Iris-setosa samples count is 10. Having the imbalanced dataset, future predictions will not be accurate.
In the Iris dataset, each sample belongs to one of the three classes, so we can confidently say that the dataset is balanced after analysis.
Bivariate Analysis
Bivariate analysis is one of the simplest ways of quantitative analysis. It involves the analysis of two variables for the purpose of determining the empirical relationship between them. It can be descriptive or inferential. However, in this note, we are focused on the descriptive analysis, primarily using graphical methods to visualize the characteristics of a dataset and the relationship among the variables using different visualization techniques.
Scatter Plot
The Scatter plot is the 2D plot the represents the relationship between two variables. We will use different combinations of features and analyze which one produces a better scatter plot than the other from the point of analysis in the 2D plane.
In this case (1), we have used sepal length and sepal width to see whether we can distinguish each species from others.
# Scatterplot using seaborn library plt.figure(figsize=(10,3),dpi=200) sns.set_style("whitegrid") sns.scatterplot(x='SepalLengthCm', y='SepalWidthCm',data=iris_data, hue='Species')
Exploration Result: Using sepal length and width features, we can partially distinguish Setosa flowers from others.
In this case (2), we have used petal length and petal width to see whether we can distinguish each species from others.
plt.figure(figsize=(10,3),dpi=200) sns.set_style("whitegrid") sns.scatterplot(x='PetalLengthCm', y='PetalWidthCm',data=iris_data, hue='Species')
Exploration Result: Using petal length and petal width features, we can clearly distinguish Setosa flowers from others.
In this case (3), we have used petal length and sepal length to see whether we can distinguish each species from others.
plt.figure(figsize=(10,3),dpi=200) sns.set_style("whitegrid") sns.scatterplot(x='SepalLengthCm', y='PetalLengthCm',data=iris_data, hue='Species')
Exploration Result: Using Petal length and Sepal length features, we can clearly distinguish Setosa flowers from others. However, separating Versicolor from Viginica is much difficult as they are overlapped to each other. So these are not linearly separable.
The other alternative is to go for 3D scatter plot using the Plotly library and do the analysis to determine the relationship among the features.
import plotly.express as px fig = px.scatter_3d(iris_data, x='SepalWidthCm', y='PetalWidthCm', z='PetalLengthCm', color='Species') fig.show()
Exploration Result: Using Petal length, Sepal length, and Petal width features, we can clearly distinguish Setosa flowers from others. However, separating Versicolor from Viginica is again difficult in the higher dimension (In a Plane).
We have seen that different combination of features produces different results for analysis. So to do individually for different features is a tiresome job. Instead of that, we can go for a pair plot to draw all the scatter plots together.
Pair plots
Pair-plot generates a pairwise scatter plot. However, it won’t be easy to generate all the pairwise plots when the features list is higher. In this case, we can generate pairwise scatter plots easily using the below code.
plt.close() sns.set_style("whitegrid") sns.pairplot(iris_data,hue="Species",height=3, diag_kind="kde") plt.show()
Exploration Result: The petal length and petal width is the most useful feature to identify and differentiate various flower species. However, in most scatter plots, Setosa can be linearly separable, and others are a little difficult to separate.
Limitation of Scatter Plots
We have seen 2D scatter plots, 3D scatter plots, and Pair plots and almost all have certain limitations. The limitation is, with a large no of data points, these plots reveal little structure & most of the markers conceal by overprinting, which can be significant for multimodal data. However, to overcome some scatter plot problems, we can go for the Contour density plots or Trellis plots, and many more.
Univariate Analysis
Univariate analysis performs on a variable to find the range of parameters that will be helpful to understand a variable. These parameters are the central tendency & spread. To find a central tendency, we can go for mean, median, or mode to select the appropriate method to find the central tendency once we understand the data. There is a wide range of tools and techniques to analyze a variable, such as frequency distribution, class interval, histogram, kernel density estimates, box and whisker plots, etc. In the subsequent sections, we will get familiar with various methods to study the statistics of a variable.
1D Scatter Plot
1D scatter plot is used when the visualization data contains only one-dimensional data points. The below example code is used to draw 1D scatter plot.
# Plots using 1D scatter plot import seaborn as sns import matplotlib.pyplot as plt import numpy as np # Load Iris dataset from seaborn library iris_data = sns.load_dataset('iris') # Get the Species based on the setosa, versicolor and virginica iris_setosa = iris_data.loc[iris_data['species'] == 'setosa'] iris_versicolor = iris_data.loc[iris_data['species'] == 'versicolor'] iris_virginica = iris_data.loc[iris_data['species'] == 'virginica'] plt.figure(figsize=(10,2),dpi=200) # Plot 1D graph based on the petal length of all the species plt.plot(iris_setosa['petal_length'],np.zeros_like(iris_setosa['petal_length']),label='Setosa'); plt.plot(iris_versicolor['petal_length'],np.zeros_like(iris_versicolor['petal_length']),label='Versicolor'); plt.plot(iris_virginica['petal_length'],np.zeros_like(iris_virginica['petal_length']), label='Virginica'); # np.zeros_like() providing us an array filled with zeros # We usually draw a scatter plot between two variables, but in this case, we have considered the second variable filled as zeroes # Plot customization plt.legend(); plt.title('1D scatter plot based on different Iris Species') plt.xlabel("Petal Length"); plt.show();
Conclusion: Scatter plots are suitable to find the relationship between two variables. However, making one-dimensional data and plotting all the variables together gives a density of individual variables. So seeing the plot, we can conclude that the range of Setosa Petal length is between 1-2 cm, Versicolor, 3-4.5, and Virginica 4.5-7.
The significant disadvantage of a 1-D scatter plot is that most of the data point overlaps, making it harder to analyze the plot. The question comes, is there a better way to do a univariate analysis?
CITE THIS AS:
“Python for Data Science – Exploratory Data Analysis – IRIS Dataset” From NotePub.io – Publish & Share Note! https://notepub.io/notes/programming-languages/python-for-data-science/python-for-data-science-exploratory-data-analysis-iris-dataset/
1,447 total views, 2 views today