Python for Data Science – Dealing with missing data – Toyota Dataset

We have seen that few places data were missing from the dataset during the exploratory data analysis and visualization, and we removed those rows from our analysis. But on the actual data, when the data size is limited, we need to find a way to fill the missing data in the dataset.

This note will identify missing values and the different approaches or techniques to fill the missing values before data visualization and analysis. 

Importing data

The most important task is downloading the dataset and loading it into the python object data frame. The below code does the same task. The entire Jupyter notebook can be download or directly execute from kaggle.com.

# import the necessary libraries

# Pandas library for data frames
import pandas as pd

# numpy library to do numerical operations
import numpy as np

# matplotlib library to do visualization
import matplotlib.pyplot as plt

import os

# Set the working directory
os.chdir("/notepub/eda/")

# Importing data.
# index_col = 0 means, Set index 0 as first column.
# na_values = ["??","????"] replace missing values with NaN values.

cars_data = pd.read_csv("Toyota.csv",index_col=0,na_values=["??","????"])
Python for Data Science - EDA - Toyota Dataset
Python for Data Science – EDA – Toyota Dataset

Identifying missing values

In panda dataframes, missing data is represented by NaN (an acronym for Not a Number). So whenever there is a blank cell in the CSV file (Dataset), the panda data frame function automatically reads it as NaN value. All the NaN values will be treated as missing values in the panda dataframes. To check null values in Pandas Dataframes, isnull() and isna() are used.

These functions return dataframes of boolean values, which are True for NaN values and False for other values. This is shown in the below diagram.

print(cars_data.isnull())  
Identifying missing values
Identifying missing values

To count missing values present in the entire dataset, we can use any of these below functions. These functions have a limitation that it only tells how many missing values exist but does not specify the locations such as row number and all. 

cars_data.isna().sum()
cars_data.isnull().sum()
Missing Value Count per variable wise
Missing Value Count per variable wise

While analyzing the missing value report, we can see few column variables whose few of the values are missing. For example, Age has 100 missing values, and KM has 15 missing values, and so on. Now we need to analyze further and take corrective actions. There are two types of corrective actions. First, we can get rid of the entire row and the second approach is to fill up those missing values logically.

Suppose multiple data are missing from the same row. It means, from the dataset, if multiple variables values are missing from the same row, then, in that case, there is no logic to fill up the missing values, and the simple approach would be to discard the entire row. However, if data are randomly missing from the entire dataset, we should look at an approach to fill in those missing values. 

Identifying missing value row-wise

Now our goal is to find all the rows that have one or more missing values. This can be done using the below code snippet.

# Create backup of original dataframe
bk_cars_data = cars_data.copy()

# Find the missing values row-wise
# cars_data.isnull(): Find NaN values
# any(axis=1): Atleast one column value is missing from a particular row
# Store those missing value into the missing_cars_data frame.
missing_cars_data = cars_data[cars_data.isnull().any(axis=1)]


# Print those values
missing_cars_data
Rows that have one or more missing values
Rows that have one or more missing values

As a summary, we can see 340 missing values out of 1436 from the various rows in the dataset. This we can see using the below code snippet. 

# Check how many missing values are there

print(cars_data.shape)
print(missing_cars_data.shape)

# There are 340 rows with 10 columns and atleast one column values are missing.

Approaches to fill the missing values

There are several ways or approaches to fill the missing values. However, we will learn two ways to fill the missing values for variables type in this note. For numerical variables, missing values are filled based on mean or median, and this is one of the simplest ways to fill the missing values for numerical variables. For categorical variables, missing values are filled based on the class value, which has maximum counts.

Numerical Variables: Imputing missing values

Look at the description to know whether numerical variables should be imputed with mean or median. If there are extreme values, they may cause the mean value to be very high or very low from the actual mean value. So, in that case, we should always go for the median.

To calculate the median, we sort all the values by ascending or descending orders and then take exactly the middle value when the total observation count is odd. However, in the case of even count, we take the middle of two numbers average.

These decisions are made when we have the distribution of the variables. These can be generated using the below code snippet. It mainly generates descriptive statistics summarizing the central tendency, dispersion, and shape of data sets distribution while excluding NaN values from the variables.

# Generation of descriptive statistics
cars_data.describe()
Descriptive statistics summarizing the central tendency, dispersion, and shape of data sets distribution
Descriptive statistics summarizing the central tendency, dispersion, and shape of data sets distribution

Count: It tells the total number of values that exist for a variable, excluding missing values. For example, Price has 1436 values whereas Age has 1336, which means it has 100 missing values, and similarly, we can compare to other variables.

Mean: It is an average mean value, and it is calculated by summing up all the values and divide by the total number of values counts. For example, the mean value of Price is 10730 euros, and similarly, we can analyze for other variables.

Standard Deviation (std): It represents average dispersion from the mean value. A higher value indicates variable values are more dispersed from the mean, which may be because of more outliers. And, lower the value indicates values are more packed, and there are fewer outliers.

Min, 25%, 50%, 75% and Max values are the five-number summary. This we have seen while studying box and whiskers plots. 50% percentage is the median value, and dispersion is calculated based on the median values rather than the mean values.

An outlier can be identified in the data by calculating Inter Quartile Range (IQR). Any data having a value greater than 1.5 or less than 1.5 IQR is treated as an outlier. The presence of an outlier will drag the mean of the sample/population towards it, and in that case, it is not the best measure for describing the sample /population.

Median overcomes this problem by taking the positional middle value when data is in sorted order. However, if no outlier exists, then Mean should be the preferred measure of central tendency as mean represents the data set better as it includes every value, where else median lacks the representatives of data as it only selects the positional middle value.

Imputing missing values of Age

After analyzing the descriptive statistics of Age variable, we considered the mean value to be imputed with missing values as there is not much difference between the mean and the median values. 

Imputing missing values of KM

After analyzing the descriptive statistics of the KM variable, we considered the median value to be imputed with missing values as there a huge difference between mean and median because of outliers.

Imputing missing values of HP

After analyzing the descriptive statistics of the HP variable, we considered the mean value to be imputed with missing values as there is not much difference between the mean and the median values. 

# To calculate the mean value of Age
cars_data['Age'].mean()

# Impute NA/NaN values with mean value in Age Variable
cars_data['Age'].fillna(cars_data['Age'].mean(),inplace=True)

# To calculate the median value of KM
cars_data['KM'].median()

# Impute NA/NaN values with mean value in Age Variable
cars_data['KM'].fillna(cars_data['KM'].median(),inplace=True)

# To calculate the mean value of HP
cars_data['HP'].mean()

# As mean and median is not so far so in this case we will use mean value
cars_data['HP'].fillna(cars_data['HP'].mean(),inplace=True)


# To generate descriptive statistics
cars_data.describe()
Descriptive statistics after imputing missing numerical values
Descriptive statistics after imputing missing numerical values

In the above diagram, we can see that it shows count values of Price, KM, and HP as 1436 as we have imputed missing values with respective mean or median depend upon the analysis. 

Categorical Variables: Imputing missing values

To impute the missing values of categorical type, we will use class maximum count values using the function serier.value_count(). It returns a series containing counts of unique values. The values will be in descending order so that the first element is the most frequently occurring element while excluding NA/NaN values by default. So for that, we need to understand what are the categories replacing that with the missing values. 

Imputing missing values of FuelType

# Imputing categorical missing values
cars_data['FuelType'].value_counts()

# As it displays result in decresing order
# we can directly use index values
cars_data['FuelType'].value_counts().index[0]

# Impute NA/NaN values with higher count value of Fuel Type Class
cars_data['FuelType'].fillna(cars_data['FuelType'].value_counts().index[0], inplace=True)

# Check the missing data after filling values
cars_data.isnull().sum()

Imputing missing values of MetColor

We will impute all the missing values of MetColor with a model value of the variable MetColoer. It is basically the metallic color of the car, and it has two values, zero represents the non-metallic color, and one represents the car has a metallic color. There are several ways we can fill the model value. In this case, we will use the Mode value of the categorical variable. 

# Imputing categorical missing values
cars_data['MetColor'].mode()

# As it displays result in decresing order
# we can directly use index values
cars_data['MetColor'].mode()[0]

# Impute NA/NaN values with higher count value of Fuel Type Class
cars_data['MetColor'].fillna(cars_data['MetColor'].mode()[0], inplace=True)

# Check the missing data after filling values
cars_data.isnull().sum()

Imputing missing values using Lambda Functions

To fill the NA/NaN value in both numerical and categorical variables at one stretch, we will use a lambda function. This function can be used whenever we want to perform any operations column-wise or row-wise, and even it can be used for any other cases. However, in this case, we will use this function to access or manipulate data column-wise. 

Whatever logic we are writing inside the apply function will be applied across all the columns. The lambda function is an anonymous and powerful function. Here x is the argument of an anonymous function. It has two conditions; if values are float, then consider that column as a numerical variable. Otherwise, consider it as a categorical variable and fill the data accordingly. 

# Imputing missing values using lambda functions
# To fill the NA/NaN values in both numerical and categorical variables at one stretch
bk1_cars_data = bk_cars_data.copy()

bk1_cars_data = bk1_cars_data.apply(lambda x:x.fillna(x.mean()) if x.dtype =='float' else x.fillna(x.value_counts().index[0]))
bk1_cars_data.isnull().sum()
Descriptive statistics after imputing all the missing variables
Descriptive statistics after imputing all the missing variables

We have filled all the missing data in this dataset, and it became a working dataset, and we can do all sorts of analysis to get some meaningful information. 

References

 649 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: