Essentials of Data Science – Probability and Statistical Inference – Continuous Random Variables

In the previous note on random variables, we have seen what a random variable is, types of random variables based on the sample space, and the difference between discrete and continuous random variables. In this note, we will further extend the continuous random variables concept and explore how probability density function (PDF) and cumulative distribution function (CDF) are related to continuous random variables along with handon using python programming language

In simple words, whenever an experimental setup or sample space contains the real numbers or subset of real numbers, we use a continuous random variable concept to model it mathematically. Further, to extract meaningful information from the continuous random variable, we use different tools such as probability density and cumulative distribution functions that give statistical data about that random variable. And using that information, we can learn and understand more about that variable statistically rather than by random guessing.

Introduction

We know that a sample space is continuous if it contains an interval (either finite or infinite) of real numbers. Whereas, a sample space is discrete if it contains a finite or countable infinite set of outcomes.

In general, we can say that it is mandatory to know P(X \in A) for all possible A which are subsets of R. Where R \text{ represents real numbers}.  Suppose, If we choose A = (\infty, s], x \in R, we have

\begin{aligned}[latex]P(X \in A) &= P(X \in (-\infty, x]) \\  &= P(- \infty < X \leq x) \\ &= P(X \leq x)\end{aligned}

P(X \leq x) tells what is the probability of random variable (X) where X takes values less than or equal to x. Basically, it gives rise to the definition of the cumulative distribution function.

Cumulative Distribution Function (CDF)

The cumulative distribution function, or more simply the distribution function, F of the random variable X is defined for any real number x by:

F(x) = P(X \leq x)

F(x) or F_X(x) is the probability that the random variable X takes on a value that is less than or equal to x.

Properties of Cumulative Distribution Function

In simple words, probability values are always positive or zero but never negative. The basic interpretation are as follows:

  • Firstly, the CDF value computed given  x_2 should be greater than the CDF value computed given  x_1 if  x_2 > x_1 .
  • Secondly, probability values are always bound between zero and one. So the cumulative probability values should not be greater than one on the upper side and should not be lower than zero at the lower size.

All probability about X can be computed in terms of its distribution function F. For example, suppose we want to compute:

P(a < X \leq b) = F(b) - F(a)

Similarly, suppose we want to compute P(a \leq X \leq b), then P(a \leq X \leq b) = F(b) - F(a-). Where F(a-) is the left limit of CDF. 

Mainly, CDF helps obtain the probabilities related to random events. It will understand with the help of an example.

Suppose the random variable X has distribution function:

 F(x) = \begin{cases} 1 - exp(-x^2), & x > 0 \\ 0, & x \leq 0 \end{cases}

The probability that X exceeds 1 is found as follows:

\begin{aligned}P(X > 1) &= 1 - P(X \leq 1) \\ &= 1 - F(1) \\ &= 1 - (1 - exp(-1^2)) \\ &=exp(-1) \end{aligned}

The above solution gives a cumulative probability value up to a particular range. In this case, the range is equal to the value of x. Now we will discuss how to find a probability at a particular event rather than up to certain events.

Continuous Random Variable

A continuous random variable is a random variable with an interval either finite or infinite of real numbers for its range. The characteristics of continuous random variables are that the number of possible outcomes is uncountable infinite and they have a continuous distributed function F(x). It follows that the point probabilities are zero, i.e. P(X = x) = 0.

Continuous distribution function or Cumulative distribution function F(x) exists when a unique density function f exists such that F(x) = \int_{-\infty}^x f(t)dt

Probability Density Function (PDF)

A random variable X is said to be continuous if there is a function f(x) such that for all  x \in R.

F(x) = \int_{-\infty}^x f(t)dt

holds. Where F(x) is the cumulative distribution function of X, and f(x) is the probability density function of X and \frac{d}{d(x)} F(x) = f(x) if and only if all x that are continuity point of f.

Properties of Probability Density Function

For a function f(X) to be a probability density function of a continuous random variable X, it needs to satisty the following conditions:

  • f(X) \geq 0 for all x \in R
  •  \int_{-\infty}^{\infty} f(x)dx = 1

The probability of a continuous random variable taking a particular value x_0 is always zero. P(X=x_0) = \int_{x_0}^{x_0} f(x)dx = 0

Relationship between PDF and CDF

Let X be a random variable with CDF F(x) where x is a known constant. F(x) = P(X \leq x). Suppose x_1 < x_2 then

P(x_1 \leq X \leq x_2) = F(x_2) - F(x_1) = \int_{x_1}^{x_2} f(x)dx

It is same as area under F(x_2) - F(x_1) which is equal to \int_{x_1}^{x_2} f(x)dx

As we know, F(x) is the cumulative distribution function of X, and f(x) is the probability density function of X. So knowing the CDF we can find PDF.

\frac{d}{d(x)} F(x) = f(x) if and only if all x that are continuity point of f.

Example 1: Finding the probability of a random variable.

For a probability density function f(x) of a continuous random variable is given as follows:

 f(x) = \begin{cases} 1 - exp(-x^2), & x > 0 \\ 0, & x \leq 0 \end{cases}

Suppose we want to find the probability of a continuous random variable in a range between a and b, which is as follows:

 P(a \leq X \leq b) = \int_{a}^{b} f(x)dx

As shown below, the basic intuition is to find the probability area under the curve.

Probability density function
Probability density function

It motivates us to learn, having the data points from an experiment, how to find the PDF. And the example of a data point could be like the starting salary of a sample of 10000 students graduating in statistics. Probably, wages would always be in an integer number in this case. So we can say that these sample data points can be represented as a discrete random variable, and the method would be probability mass function and cumulative distribution function.

Example 2: Approximate a histogram to Probability density function

A histogram is an approximation to a probability density function. In the below diagram we can see that, for each interval of the histogram, the area of the bar equals the relative frequency of the measurements in the interval. 

Probability Density Function - Histogram
Probability Density Function - Histogram

The relative frequency is an estimate of the probability that a measurement falls in the interval. Similarly, the area under f(x) over any interval equals the true probability that a measurement falls in the interval.

Example 3: Calculation of PDF and CDF.

Consider the continuous random variable waiting time for the train. Suppose that a train arrives every 20 minutes. Therefore, the waiting time of a particular person is random and can be any time contained in the interval [0,20]. We can start describing the required probability density function as 

 f(x) = \begin{cases} k, & for 0 \leq x \leq 20 \\ 0, & \text{ otherwise } \end{cases}

where k is an unknown constant. 

The value of k for which f(x) is a PDF is: 

 1 = \int_{0}^{20} f(x)dx = 20k \implies k = \frac{1}{20}

Thus the PDF is 

 f(x) = \begin{cases} \frac{1}{20}, & for 0 \leq x \leq 20 \\ 0, & \text{ otherwise } \end{cases}

The CDF F(x) of f(x) is:

 F(x) = \int_{0}^{x} f(t)dt = \int_{0}^{x} \frac{1}{20}dt = \frac{x}{20}

Suppose we are interested in calculating the probability of a waiting time between 15 and 20 minutes. 

P(15 \leq X \leq) = F(20) - F(15) = \frac{20}{20} - \frac{15}{20} = 0.25

References

  1. Essentials of Data Science With R Software – 1: Probability and Statistical Inference, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

CITE THIS AS:

“Continuous Random Variables”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/mathematics/statistics/statistical-inference-for-data-science/continuous-random-variables/

 8,819 total views,  2 views today

Scroll to Top
Scroll to Top
%d bloggers like this: