Python for Data Science – Data Visualization on Standard Datasets

In this note, we will use all the famous libraries such as Matplotlib and Seaborn for data visualization, Pandas for table-wise data manipulation, Numpy for standard numerical operations, and Sci-Kit learn for machine learning algorithms.

We will work on few standard datasets that are available in the seaborn package, and these are as follows:

  • Flights
  • Diamonds
  • Iris
  • Titanic
  • Anscombe
  • Digits

Density Estimation

In density estimation, we are interested in determining an unknown function f,  given only random samples or observations or data points distributed according to this function. In other words, the goal of density estimation is to infer the probability density function from observations of a random variable. For example, given a set of random samples, determine what PDF was used to create them. Density estimation approaches can be broadly classified into two groups: parametric density estimation and non-parametric density estimation.

Parametric Methods

Parametric methods make strict a priori assumptions about the form of the underlying density function. For instance, a parametric approach may assume the random variables have a Gaussian distribution. Such assumptions significantly simplify the problem since only the parameters of the chosen family of functions need to be determined. In a normal distribution, the density estimation process reduces the mean µ and standard deviation σ of the sample points.

Non-Parametric Methods

Oftentimes it is not possible to make such strict assumptions about the form of the underlying density function. Instead, these techniques make few assumptions about the density function and allow the data to drive the estimation process more directly.

The simplest density estimator that does not make particular assumptions on the data distribution (we call this nonparametric statistic) is the histogram.

Histogram

The simplest form of density estimator is a Histogram. It looks different based on the bin sizes, and mostly it has certain errors related to the choice of bin selection and the number of data points. This error will go down for optimal bins and the number of data points on them.

Iris Data 

We will load Iris data from the seaborn library and also configure options to display the whole dataset.

# To set to display whole row and column

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# To reset
pd.reset_option('use_inf_as_na')

# Import data from seaborn library
iris_data = sns.load_dataset("iris")
display(iris_data)
Dataset - Iris listing
Dataset – Iris
# we will analyse sepal length using histogram.

iris_sepal_length = iris_data['sepal_length']
iris_sepal_length.head()

# Bins size selection matters. So we will analyse using different bins size.

f, (ax1,ax2) = plt.subplots(1,2)
ax1.hist(iris_sepal_length, bins=5, density=True)
ax2.hist(iris_sepal_length, bins=100, density=True)
Iris Dataset - Histogram with different bins size
Iris Dataset – Histogram with different bins size

These two histograms are from the same variable, i.e., sepal length, but different bin sizes. The left one has the bin size = 5, and the right one has the bin size = 100. We can see the differences between these two histograms. 

There is a bias-variance tradeoff when choosing the width of the bins, and lower width means more bias and less variance. There is no choice of width of bins that optimizes both. The risk goes down at a pretty slow rate as the number of data points increases. This tells us that a better estimator converges more quickly, and one of the famous estimators is the kernel density estimator.

Kernel Density Estimation

In kernel density estimation, the most important parameter is to select the bandwidth. 

 136 total views,  1 views today

Scroll to Top
Scroll to Top