We have seen that few places data were missing from the dataset during the exploratory data analysis and visualization, and we removed those rows from our analysis. But on the actual data, when the data size is limited, we need to find a way to fill the missing data in the dataset.
This note will identify missing values and the different approaches or techniques to fill the missing values before data visualization and analysis.
Importing data
The most important task is downloading the dataset and loading it into the python object data frame. The below code does the same task. The entire Jupyter notebook can be download or directly execute from kaggle.com.
# import the necessary libraries # Pandas library for data frames import pandas as pd # numpy library to do numerical operations import numpy as np # matplotlib library to do visualization import matplotlib.pyplot as plt import os # Set the working directory os.chdir("/notepub/eda/") # Importing data. # index_col = 0 means, Set index 0 as first column. # na_values = ["??","????"] replace missing values with NaN values. cars_data = pd.read_csv("Toyota.csv",index_col=0,na_values=["??","????"])
Identifying missing values
In panda dataframes, missing data is represented by NaN (an acronym for Not a Number). So whenever there is a blank cell in the CSV file (Dataset), the panda data frame function automatically reads it as NaN value. All the NaN values will be treated as missing values in the panda dataframes. To check null values in Pandas Dataframes, isnull()
and isna()
are used.
These functions return dataframes of boolean values, which are True for NaN values and False for other values. This is shown in the below diagram.
print(cars_data.isnull())
To count missing values present in the entire dataset, we can use any of these below functions. These functions have a limitation that it only tells how many missing values exist but does not specify the locations such as row number and all.
cars_data.isna().sum() cars_data.isnull().sum()
While analyzing the missing value report, we can see few column variables whose few of the values are missing. For example, Age has 100 missing values, and KM has 15 missing values, and so on. Now we need to analyze further and take corrective actions. There are two types of corrective actions. First, we can get rid of the entire row and the second approach is to fill up those missing values logically.
Suppose multiple data are missing from the same row. It means, from the dataset, if multiple variables values are missing from the same row, then, in that case, there is no logic to fill up the missing values, and the simple approach would be to discard the entire row. However, if data are randomly missing from the entire dataset, we should look at an approach to fill in those missing values.
Identifying missing value row-wise
Now our goal is to find all the rows that have one or more missing values. This can be done using the below code snippet.
# Create backup of original dataframe bk_cars_data = cars_data.copy() # Find the missing values row-wise # cars_data.isnull(): Find NaN values # any(axis=1): Atleast one column value is missing from a particular row # Store those missing value into the missing_cars_data frame. missing_cars_data = cars_data[cars_data.isnull().any(axis=1)] # Print those values missing_cars_data
As a summary, we can see 340 missing values out of 1436 from the various rows in the dataset. This we can see using the below code snippet.
# Check how many missing values are there print(cars_data.shape) print(missing_cars_data.shape) # There are 340 rows with 10 columns and atleast one column values are missing.
Approaches to fill the missing values
There are several ways or approaches to fill the missing values. However, we will learn two ways to fill the missing values for variables type in this note. For numerical variables, missing values are filled based on mean or median, and this is one of the simplest ways to fill the missing values for numerical variables. For categorical variables, missing values are filled based on the class value, which has maximum counts.
Numerical Variables: Imputing missing values
Look at the description to know whether numerical variables should be imputed with mean or median. If there are extreme values, they may cause the mean value to be very high or very low from the actual mean value. So, in that case, we should always go for the median.
To calculate the median, we sort all the values by ascending or descending orders and then take exactly the middle value when the total observation count is odd. However, in the case of even count, we take the middle of two numbers average.
These decisions are made when we have the distribution of the variables. These can be generated using the below code snippet. It mainly generates descriptive statistics summarizing the central tendency, dispersion, and shape of data sets distribution while excluding NaN values from the variables.
# Generation of descriptive statistics cars_data.describe()
Count: It tells the total number of values that exist for a variable, excluding missing values. For example, Price has 1436 values whereas Age has 1336, which means it has 100 missing values, and similarly, we can compare to other variables.
Mean: It is an average mean value, and it is calculated by summing up all the values and divide by the total number of values counts. For example, the mean value of Price is 10730 euros, and similarly, we can analyze for other variables.
Standard Deviation (std): It represents average dispersion from the mean value. A higher value indicates variable values are more dispersed from the mean, which may be because of more outliers. And, lower the value indicates values are more packed, and there are fewer outliers.
Min, 25%, 50%, 75% and Max values are the five-number summary. This we have seen while studying box and whiskers plots. 50% percentage is the median value, and dispersion is calculated based on the median values rather than the mean values.
An outlier can be identified in the data by calculating Inter Quartile Range (IQR). Any data having a value greater than 1.5 or less than 1.5 IQR is treated as an outlier. The presence of an outlier will drag the mean of the sample/population towards it, and in that case, it is not the best measure for describing the sample /population.
Median overcomes this problem by taking the positional middle value when data is in sorted order. However, if no outlier exists, then Mean should be the preferred measure of central tendency as mean represents the data set better as it includes every value, where else median lacks the representatives of data as it only selects the positional middle value.
Imputing missing values of Age
After analyzing the descriptive statistics of Age variable, we considered the mean value to be imputed with missing values as there is not much difference between the mean and the median values.
Imputing missing values of KM
After analyzing the descriptive statistics of the KM variable, we considered the median value to be imputed with missing values as there a huge difference between mean and median because of outliers.
Imputing missing values of HP
After analyzing the descriptive statistics of the HP variable, we considered the mean value to be imputed with missing values as there is not much difference between the mean and the median values.
# To calculate the mean value of Age cars_data['Age'].mean() # Impute NA/NaN values with mean value in Age Variable cars_data['Age'].fillna(cars_data['Age'].mean(),inplace=True) # To calculate the median value of KM cars_data['KM'].median() # Impute NA/NaN values with mean value in Age Variable cars_data['KM'].fillna(cars_data['KM'].median(),inplace=True) # To calculate the mean value of HP cars_data['HP'].mean() # As mean and median is not so far so in this case we will use mean value cars_data['HP'].fillna(cars_data['HP'].mean(),inplace=True) # To generate descriptive statistics cars_data.describe()
In the above diagram, we can see that it shows count values of Price, KM, and HP as 1436 as we have imputed missing values with respective mean or median depend upon the analysis.
Categorical Variables: Imputing missing values
To impute the missing values of categorical type, we will use class maximum count values using the function serier.value_count()
. It returns a series containing counts of unique values. The values will be in descending order so that the first element is the most frequently occurring element while excluding NA/NaN values by default. So for that, we need to understand what are the categories replacing that with the missing values.
Imputing missing values of FuelType
# Imputing categorical missing values cars_data['FuelType'].value_counts() # As it displays result in decresing order # we can directly use index values cars_data['FuelType'].value_counts().index[0] # Impute NA/NaN values with higher count value of Fuel Type Class cars_data['FuelType'].fillna(cars_data['FuelType'].value_counts().index[0], inplace=True) # Check the missing data after filling values cars_data.isnull().sum()
Imputing missing values of MetColor
We will impute all the missing values of MetColor with a model value of the variable MetColoer. It is basically the metallic color of the car, and it has two values, zero represents the non-metallic color, and one represents the car has a metallic color. There are several ways we can fill the model value. In this case, we will use the Mode value of the categorical variable.
# Imputing categorical missing values cars_data['MetColor'].mode() # As it displays result in decresing order # we can directly use index values cars_data['MetColor'].mode()[0] # Impute NA/NaN values with higher count value of Fuel Type Class cars_data['MetColor'].fillna(cars_data['MetColor'].mode()[0], inplace=True) # Check the missing data after filling values cars_data.isnull().sum()
Imputing missing values using Lambda Functions
To fill the NA/NaN value in both numerical and categorical variables at one stretch, we will use a lambda function. This function can be used whenever we want to perform any operations column-wise or row-wise, and even it can be used for any other cases. However, in this case, we will use this function to access or manipulate data column-wise.
Whatever logic we are writing inside the apply function will be applied across all the columns. The lambda function is an anonymous and powerful function. Here x is the argument of an anonymous function. It has two conditions; if values are float, then consider that column as a numerical variable. Otherwise, consider it as a categorical variable and fill the data accordingly.
# Imputing missing values using lambda functions # To fill the NA/NaN values in both numerical and categorical variables at one stretch bk1_cars_data = bk_cars_data.copy() bk1_cars_data = bk1_cars_data.apply(lambda x:x.fillna(x.mean()) if x.dtype =='float' else x.fillna(x.value_counts().index[0])) bk1_cars_data.isnull().sum()
We have filled all the missing data in this dataset, and it became a working dataset, and we can do all sorts of analysis to get some meaningful information.
References
- NPTEL lectures on Introduction to Python for Data Science, IIT Madras.
651 total views, 1 views today