Essentials of Data Science – Probability and Statistical Inference – Quantiles and Tschebyschev’s Inequality

In the previous note on the Probability and Statistical Inference, we have learned expectations, moments, skewness, and kurtosis to measure the central tendencydispersionsymmetry, and peakedness of probability curve or distribution, respectively.

Introduction

We define quantiles in terms of the distribution function. The value x_p for which the cumulative distribution function is:

F(x_p)  = \begin{cases} p, & 0 \leq p \leq 1\\ 0, & \text{ otherwise } \end{cases}

is called the p-quantile.

Here, x_p is a value which divides the CDF into two parts: The probability of observing a value left of x_p is p, whereas, the probability of observing a value right of x_p is 1-p.

For example, the 0.25 quantile x_{0.25} describes the x-value for which the probability of observing x_{0.25} or any smaller value is 0.25.

Quantiles
Quantiles

The above figure shows the 0.25-quantile (first quartile), the 0.5-quantile (median), and the 0.75-quantile (third quartile) is a cumulative distribution function.

Quantiles

Quantiles are the values that divide the distribution into different partitions. These partitions can have equal or unequal widths, so in general, these partitions are called Quantiles, and we decide the size of the partition based on the requirements. For example:

  • 25% Quantile: Splits the data into two parts such that at least 25% of the values are less than or equal to quantile, and at least 75% of the values are greater than or equal to quantile. 
  • 50% Quantile: Splits the data into two parts such that at least 50% of the values are less than or equal to quantile, and at least 50% of the values are greater than or equal to the quantile. It is also called Median.

In general, (\alpha \times 100)% quantile: Value which divides the data in proportions of (\alpha \times 100)% and ( 1 - \alpha) \times 100 % such that at least (\alpha \times 100)% of the values are less than or equal to the quantile and at least ( 1 - \alpha) \times 100 % of the values are greater than or equal to the quantile.

Quartiles

The values which divide the given data into four equal parts, say, Q_1, Q_2, Q_3, Q_4 .

  • Q_1: First quartile which has 25% of the observations.
  • Q_2: Second quartile which has 50% of the observations also called Median.
  • Q_3: Third quartile which has 75% of the observations.
  • Q_4: Fourth quartile which has 100% of the observations.

We have divided the entire frequency distribution into four equal parts and these partitions are called as Quartiles and these are the partitions values of Quantiles.

Deciles

The values which divide the given data into ten equal parts, say, D_1, D_2, ..., D_{10} .

  • D_1: First decile which has 10% of the observations.
  • D_2: Second decile which has 20% of the observations.
  • D_3: Third decile which has 30% of the observations.
  • D_{10}: Tenth decile which has 100% of the observations.

Percentiles

The percentiles are the values which divide the given data into hundred equal parts, say, P_1, P_2, ..., P_{100}.

  • P_1: 1st percentile which has 1% of the observations.
  • P_2: 2nd percentile which has 2% of the observations.
  • P_3: 3rd percentile which has 3% of the observations.
  • P_{100}: 100th percentile which has 100% of the observations.

Tschebyschev’s Inequality

If we do not know the distribution of a random variable X, we can still make statements about the probability using Tschebyschev’s inequality that X takes values in a certain interval, (which has to be symmetric around the expectation \mu) if the mean \mu and the variance \sigma^2 of X are known.

In simple words, we know that a random variable is always associated with some probability function such as:

  • Probability mass function for a discrete random variable
  • Probability density function for a continuous random variable

However, when we do not know any of these functions and still want to say the upper and lower limits of the values of a random variable. For those situation, Tschebyschev’s inequality helps us. 

  • If the variance is small, then a random variable is unlikely to be far from the mean.

Let X be a random variable with E(X) = \mu and Var(X) = \sigma^2. It holds that:

P(|X - \mu| \geq c) \leq \frac{Var(X)}{c^2}

or

P(|X - \mu| < c) \geq 1 - \frac{Var(X)}{c^2}

After simplification, the value of X lies between \mu - c < X < \mu + c.

It means, the probability that the distance from the mean is larger than or equal to a certain number is, at most the variance divided by the square of that number.

Example 1:

Consider the continuous random variable waiting time for the train. Suppose that a train arrives every 20 minutes. Therefore, the waiting time of a particular person is random and can be any time contained in the interval [0,20].

Let X be a random variable.

The PDF is

 f(x) = \begin{cases} \frac{1}{20}, & \text{ for } 0 \leq x \leq 20 \\ 0, & \text{ otherwise } \end{cases}

Mean:

\begin{aligned} \mu &= E(X) \\ &= \int_{-\infty}^{\infty} x f(x) dx \\ &=\int_{0}^{20} x  \frac{1}{20} dx \\ &= 10\end{aligned}

Variance:

\begin{aligned} \sigma^2 &= Var(X) \\ &= \int_{-\infty}^{\infty} [x - E(X)]^2 f(x) dx \\ &=\int_{0}^{20} (x - 10)^2 \frac{1}{20} dx \\ &= \frac{100}{3}\end{aligned}

We can calculate the probability of waiting between 10 – 7 = 3 and 10+7 = 17 mins. Here, 3 is a lower and 17 is the upper limit values which can take a random variable X. If we compute probability value within the range:

\begin{aligned}P(3 < X < 17) &= F(17) - F(3) \\ &= \frac{17}{20} - \frac{3}{20} \\ &= 0.7 \end{aligned}

Tschebyschev’s Inequality: 

Suppose we don’t know the probability function (i.e., probability density function) but we know the mean (\mu = 10) and variance (\sigma^2 = \frac{100}{3}).

P(|X - \mu| \geq c) \leq \frac{\sigma^2}{c^2}

P(|X - 10| < 7) \leq 1 - \frac{\frac{100}{3}}{7^2} = 0.32, It shows that the probability value should be at least 0.32.

So comparing the results, with and without knowing the probability function is 0.7 and 0.32. It shows that knowing the probability density function gives a more precise result.

The exact probability is 0.7, which is relatively poor for approximate probability. One must remember that only the lack of distributional knowledge makes inequality worthwhile.

References

  1. Essentials of Data Science With R Software – 1: Probability and Statistical Inference, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

CITE THIS AS:

“Probability and Statistical Inference – Quantiles and Tschebyschev’s Inequality”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/mathematics/statistics/statistical-inference-for-data-science/quantiles-and-tschebyschevs-inequality/

 21,118 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: