Exploratory Data Analysis (EDA) is an approach for data analysis that includes various techniques to gather the maximum insight from a data set, uncover underlying structure, extract important parameters, and detect outliers and anomalies. EDA techniques are graphical in nature with a few quantitative techniques. The reason for using graphics or plots is that we can easily explore and visually understand the dataset compared to quantitative numbers.
In this note, we will see frequency tables and frequency tables for each category in the categorical variables, the relationship between categorical variables using two-way tables, conversion of the two-way table into joint probabilities, marginal probabilities, and conditional probabilities. Also, we will check the relationship between two numerical variables using a measure called correlation.
We will analyze and understand the Totoya data set, which we can download from Kaggle.com, using the various Exploratory Data Analysis (EDA) techniques. The most important task is downloading the dataset and loading it into the python object data frame. The below code does the same task. The entire Jupyter notebook can be download or directly execute from kaggle.com.
# OS library to change the working directory. import os # Panda library to work with data frames. import pandas as pd # Change directory and set it to the location where your totoya.csv dataset is stored. # Download dataset from https://www.kaggle.com/klkwak/toyotacorollacsv os.chdir("/notepub/eda/") # Importing Data, as it is a csv file, we need to use csv reader. # index_col = 0 means, we are setting the index 0 as our first column. # na_values = ["??","????"] means, in this dataset, there are missing values so on that positions # put the NaN value. cars_data = pd.read_csv("Toyota.csv",index_col=0,na_values=["??","????"]) # To view the entire dataset, just type the below command. cars_data
There are various categorical variables, and these are displayed as follows: Price, Age, Kilometer (KM), Horse Power (HP), Gear Type (Manual or Automatic), CC, Doors, and Weight of the cars.
Creating a copy of original data. It is necessary to do all the modifications in the copied data frame. So that in the future, we can cross verify with the original data. Now we are creating a copy and storing it in the new object called cars_data_ins.
cars_data_ins = cars_data.copy()
Frequency Tables
The frequency tables show the relationship between variables in terms of numbers. There are multiple variables in the data frame, and we will understand the relationship between variables. However, we can’t check the relationship between all the variables together at a time. In this note, we will use Cross Tabulation Analysis techniques to analyze the relationship between any categorical variables.
For these analyses, we will use pandas.crosstab()
to compute a simple cross-tabulation of one, two, or more factors, and by default, it computes a frequency table of the factors.
Frequency Table: One-way Table
In this example, we have considered only one categorical variable, i.e., Fuel type, to get the frequency distribution of each category. The below code generates the frequency table for Fuel type. We used dropna=True, which tells that it does not include the missing value records.
pd.crosstab(index=cars_data_ins['FuelType'],columns='Count',dropna=True)
Observations:
These are three categories, CNG, petrol, and diesel, under the category Fuel type. It is very evident from the output that there are only 15 cars whose fuel type is CNG, and 144 cars whose fuel type is diesel, and most of the car models Fuel type is Petrol. In this case, we observed that, though there are three types of car categories by Fuel type, most cars are petrol.
Frequency Table: Two-way tables
When we want to check the relationship between two categorical variables, we can go for two-way tables. In this case, we will look at the frequency distribution of gearbox types with respect to different fuel types of cars.
In the dataset, gearbox types are represented by a categorical variable called Automatic, and there are only two types of values, zero indicates manual, and one indicates automatic gearbox. We will use panda crosstab to create a two-way table. The dropna=True setting only includes those values, where there are no missing values in both the Automatic and Fuel Type columns.
pd.crosstab(index = cars_data_ins['Automatic'], columns = cars_data_ins['FuelType'],dropna = True)
Observations:
From the output, it is evident that only petrol cars have manual and automatic gearboxes. In comparison, CNG and diesel fuel-type cars only have manual gearboxes. So this is a fascinating relationship between Automatic and Fuel Type categorical variables.
Finding Probability using Two-Way Tables
We will try to understand how to calculate the joint, marginal, and conditional probabilities and analyze the relationship between categorical variables using cross-tabulation in probability distributions.
Joint Probability using Two-Way Table
The joint probability is the likelihood of two independent events happening at the same time. So, we calculate if two independent events are happening simultaneously and the probability of it. In this case, we will convert table values in terms of proportion.
pd.crosstab(index = cars_data_ins['FuelType'], columns = cars_data_ins['Automatic'], normalize = True, dropna = True)
To find the joint probability, we need to add one more parameter normalize = True
in the crosstab function. This parameter tells the crosstab function to convert all the table values from numbers to proportions.
Observations:
We can make the below interpretation based on the categorical variables such as Fuel type and Gearbox type.
- The joint probability of the car has a manual gearbox with CNG Fuel type: 0.01.
- The joint probability of the car has an automatic gearbox with a petrol Fuel type: 0.54.
- The joint probability of the car has a manual gearbox with a petrol Fuel type: 0.82.
Marginal Probability using Two-Way Table
The marginal probability is the probability of the occurrence of a single event.
pd.crosstab(index = cars_data_ins['FuelType'], columns = cars_data_ins['Automatic'], margins=True, normalize=True, dropna = True)
To find the marginal probability, we need to add two parameters, margins=True
and normalize = True
in the crosstab function. These parameters tell the crosstab function to convert all the table values from numbers to proportions & generate row sums and column sums for the table values.
Observations:
- The probability of cars having a manual gearbox (0) when the fuel types are CNG or Diesel, or Petrol is 0.94.
Conditional Probability using Two-Way Table
The conditional probability is the probability of an event (A), given that another event (B) has already occurred. As an example, given the type of gearbox, what is the probability of different fuel types?
pd.crosstab(index = cars_data_ins['Automatic'], columns = cars_data_ins['FuelType'], margins=True, normalize='index', dropna = True)
To find the conditional probability, we need to add two parameters, margins=True
and normalize = 'index'
in the crosstab function. These parameters tell the crosstab function to convert all the table values from numbers to proportions & calculate the marginal probability given the ‘index’ column values.
If we see the cross-tabulation of Automatic and Fuel Type variable and all the table values are in terms of probability values as we have set normalize is equal to index, we will get the row sum as one.
Observations:
- Given the gearbox type as manual, the probability of getting a CNG Fuel type is 0.011, and a Diesel Fuel type is 0.11, and a Petrol Fuel type is 0.87. The learning is that Fuel type as petrol is really high when compared to the other fuel types. For any manual gearbox, petrol can be the Fuel Type because it has a higher probability than others.
Correlation
Till now, we have seen the relationship between two categorical variables using cross-tabulation. However, now we will see the relationship between two numerical variables using the measure called correlation. It is used to check the strength of association between two variables. Nevertheless, it need not always be numerical variables, but we will analyze numerical variables in this case.
We will understand it using scatter plots:
- In the first plot, we say its positive trend as one variable increases, the other variable also increases.
- In the second plot, we say it is a negative trend as one variable increases and the other decreases.
- In the third plot, there is no pattern, so we can say there is no correlation or little correlation.
We will understand the correlation between two variables using scatter plots. In python, the panda library provides DataFrame.corr(self,method='pearson')
to computer pairwise correlation of columns while excluding NaN values. It is because we are not just going to consider only two variables. Rather, we will consider all the variables at a time and then function computers the pairwise correlation.
By default, it uses the Pearson method to computer correlation to check the strength of association between two numerical variables. However, we can go for other measures as Kendall’s rank correlation and Spearman’s rank correlation if we have ordinal variables. So, in that case, we need to exclude the categorical variables to find the Pearson correlation.
The Pearson method does not work on categorical data, so we need to exclude all the categorical data. This is done using the following code.
num_data = cars_data_ins.select_dtypes(exclude=[object])
Now we will calculate the correlation between numerical variables using the following code.
corr_mat = num_data.corr() corr_mat
Observations:
- The principal diagonal values are one because the correlation between price and price would be one as the relationship is checked against the same.
- The values with a negative symbol and the correlation are above 0.5. It means there is a strong negative correlation between the attributes or variables such as Age and Price, Age, and KM. Whenever the age of the car increases, the price of the car decreases. It is also similar for Age and KM too, though the correlation value is slightly lesser.
References
- NPTEL lectures on Introduction to Python for Data Science, IIT Madras.
386 total views, 1 views today