Essentials of Data Science – Probability and Statistical Inference – Probability

In the previous note on Set Theory and Events, we have seen how the set theory can be used to model different kinds of events and derive a new event using the operations of set theory such as union, intersection, complement, and difference. This note will see the intuitive notion of probability, how relative frequency helps understand the intuitive notion of probability of the events and its limitation, then the actual axiomatic definition of probability.

The probability of an event is defined as the total number of favourable outcomes divided by the total number of possible outcomes. However, in this definition there is a problem in understanding it.

Intuitive notion of Probability

These is a close connection between the relative frequency and the probability of an event. This we will understand with an example in the below section.

Relative Frequency and Probability of an Event

Suppose an experiment has m possible outcomes or events such as A_1, A_2, A_3, A_4, \cdots A_m and the experiment is repeated n times. Now we will count how many times each of the possible outcome has occured.

The absolute frequency n_i = n(A_i) where n(A_i) tells the number of times an event A_i, i = 1,2,3, ... , m occurs.

The relative frequency f_i = f(A_i) of a random event A_i, with n repetitions of the experiment, is calculated as:  f_i = f(A_i) = \frac{n_i}{n} .

If we understand it from the descriptive statistics point of view, it is similar to the absolute and relative frequencies of the events from a random experiment.

If we assume that,

  • The experiment is repeated a large number of times, then mathematically, it means that n tends to infinity.
  • The experimental conditions remain the same (at least approximately) over all the repetitions.

the relative frequency f(A) converges to a limiting value for A. The limiting value is interpreted as the probability of A and denoted by:

P(A) = lim_{n\to\infty} \frac{n(A)}{n}

where n(A) denotes the number of times an event A occurs out of n times.

This is the probability what we always say. Whenever we say some event has a probability, it means if we try to repeat the experiment for a sufficiently large number of times and try to compute the relative frequency, then this relative value will converge to this particular value or probability of an event.

Suppose a fair coin, it means the probabilities of occurance of head and tail are equal, is tossed n = 10 times, the number of observed heads n(A_1) = 3 times and number of observed tails n(A_2) = 7 times. Then, the relative frequencies in the experiment are:

  • f(A_1) = 3/10 = 0.3
  • f(A_2) = 7/10 = 0.7

When the coin is tossed a large number of times and n tends to infinity, then both f(A_1) and f(A_2) will have a limiting value 0.5 which is the probability of getting a head or tail in tossing a fair coin. 

Example 1:

Suppose a fair coin is tossed five times, and the following outcomes are observed: {Head, Head, Tail, Head, Tail}, then relative frequencies of Tail = Probability of Tail = 2/5 and relative frequencies of Head = 3/5.

The same example, we will illustrate using the R programming language. In the R code, we have used the sample function, which generates sample data with argument size = 5,10,100, and so on with replacing the values. 

# Experiment is repeated 5 times 
outcomes = sample(c(0,1), size=5,replace=T)
print(outcomes)
print(table(outcomes)/length(outcomes))

# Experiment is repeated 10 times 
outcomes = sample(c(0,1), size=10,replace=T)
print(outcomes)
# 1 1 1 0 1 0 1 1 1 1
print(table(outcomes)/length(outcomes))
# outcomes 0.2 0.8
 
# Experiment is repeated 100 times 
outcomes = sample(c(0,1), size=100,replace=T)
outcomes
print(table(outcomes)/length(outcomes))

After running the experiments 100 or more, we started getting relative frequencies of getting head or tail = 0.5 approximately.

Example 2:

Suppose we are tossing the six-sided fair die multiply times and observing the relative frequencies of getting each event. As we know that, the probability of occurrence of getting any number between 1 to 6 is 1/6 or 0.166.

# Experiment is repeated 5 times 
outcomes = sample(c(1,2,3,4,5,6), size=5,replace=T)
print(outcomes)
print(table(outcomes)/length(outcomes))

# Experiment is repeated 10 times 
outcomes = sample(c(1,2,3,4,5,6), size=10,replace=T)
print(outcomes)
print(table(outcomes)/length(outcomes))

# Experiment is repeated 100 times 
outcomes = sample(c(1,2,3,4,5,6), size=100,replace=T)
outcomes
print(table(outcomes)/length(outcomes))

# Experiment is repeated 1000 times 
outcomes = sample(c(1,2,3,4,5,6), size=1000,replace=T)
print(table(outcomes)/length(outcomes))
Experiment results - Rolling a six-sided die
Experiment results – Rolling a six-sided die

Observations: 

  • When we ran the experiment 5 times, we got 2, 3, 5, 6 with relative frequencies, 0.2, 0.2, 0.2, 0.4, respectively. It means, 6 is repeated twice as compared to 2, 3, and 5 faces.
  • When we ran the experiment 10 times, we got 1,2,3,5,6 with relative frequencies, 0.1, 0.1, 0.3, 0.3, 0.2, respectively. It means 3 and 5 repeated three times, six repeated two times, and four did not appear at all. 

As we increase the number of experiments, values are moving toward 1/6 = 0.166, and this we can observe in the above diagram when we ran the experiment 1000 times. 

Limitations

Although the above definition is certainly intuitively pleasing, but it possesses a serious drawback. How do we know that \frac{(n(A)}{n} will converge to some constant limiting value that will be the same for each possible sequence of repetitions of the experiment? 

For example, a coin is continously tossed repeatedly. 

  • How do we know that the proprotion of heads obtained in the first n tosses will converge to some value as n gets large?
  • Even if it converges to some value, how do we know that, if the experiment is repeatedly performed a second time, we will again obtain the same limiting proportion of heads?

This issue is answered by starting the convergence of \frac{n(A)}{n} to a constant limiting value as an assumption, or an axiom, of the system. However, to assume that \frac{n(A)}{n} will necessarily converge to some contant value is a complex assumption. Moreover, this limiting frequency exists, but it is difficult to belive a priori.

In fact, it would be better to assume a set of simpler axioms about probability and then attempt to prove that such a constant limiting frequency does in some sense exist. And this approach is the modern axiomatic approach to probability theory. It works as follows:

  • We assume that for each event A in the sample space \Omega there exists a value P(A), referred to as the probability of A.
  • We then assume that the probabilities satisfy a certain set of axioms which will be more agreeable with our intuitive notion of probability.

Axiomatic definition of Probability

From a purely mathematical viewpoint, suppose that for each event A of a random experiment having a sample space \Omega there is a number, denoted by P(A) which satisfies the following three axioms:

  • Every random event A has a probability in the interval [0,1], i.e., 0 \le P(A) \le 1. It states that the probability that the outcome of the experiment is contained in A is some number between 0 and 1.
  • The sure event has probability 1, i.e., P(\Omega) = 1. It states that with probability 1, the outcome will be a member of the sample space P(\Omega) .
  • For any sequence of disjoint or mutually exclusive events, A_1, A_2, A_3, \cdots ,A_n, \cdots (that is, events for which A_i \cap A_j = \emptyset when i \ne j , P(A_1 \cup A_2 \cup A_3 \cup \cdots , \cup A_n, \cdots ) = P(A_1) + P(A_2) + P(A_3) + \cdots + P(A_n) + \cdots, where n = 1,2,3, \cdots \infty.  It states that for any set of mutual exclusive events the probability that at least one of these events occurs is equal to the sum of their respective probabilities. It is also called the theorem of additivity of disjoint events.

Then, we call P(A) the probability of the event A. 

Even the relative frequency of the event A when a large number of repetitions of the experiment are performed, then P(A) would indeed satisfy the above axioms. 

Rules of Probability

These rules are helpful for modeling and calculating the probabilities of the events. 

  • The probability of occurrence of an impossible event \emptyset is zero:

P(\emptyset) = 1 - P(\Omega) =  0

Example 1:

Suppose a box of 30 ice creams contains ice creams of 6 different flavours with 5 ice creams of each flavour.  Suppose an event A is defined as A = {“Vanilla flavour”}, then the probability of finding a vanilla flavour ice cream is:

P(A) = 5/30

The probability of the complementary event \bar{A}, i.e., the probability of not finding a vanilla flavour ice cream is:

P(“No Vanilla falvour”) = 1 – P(“Vanilla flavour”) = 1- 5/30 = 25/30. 

  • The probability of occurrence of a sure event is one:

P(\Omega) = 1

  • The probability of the complementary event of A, (i.e. \bar{A}) is:

P(\bar{A}) = 1 - P(A)

  • The odds of an event A is defined by:

\frac{P(A)}{P(\bar{A}} = \frac{P(A)}{1 - P(A)}

Thus the odds of an event A tells how much more likely it is that A occurs than that it does not occur. 

  • Additive theorem of Probability:

Let A_1 and A_2 be not necessarily disjoint events. The probability of occurrence of A_1 and A_2 is:

P(A_1 \cup A_2) = P(A_1) + P(A_2) - P(A_1 \cap A_2)

The meaning of “or” is in the statistical sense: either A_1 is occurring, A_2 is occurring, or both of them. In this case, as A_1 and A_2 are the disjoint event, the P(A_1 \cap A_2) = \emptyset = 0.

Example 1:

Suppose a total of 28% people like sweet snacks, 7% like salty snacks, and 5% like both sweet and salty snacks. The percentage of people like neither sweet nor salty snacks is obtained as follows: 

Let A_1 be the event that a randomly chosen person likes sweet snacks and A_2 be the event that a randomly chosen person likes salty snacks. As we can see that events A_1 and A_2 not a disjoint events. 

The probability that a person likes either sweet or saulty snacks is P(A_1 \cup A_2):

P(A_1 \cup A_2)= P(A_1) + P(A_2) - P(A_1 \cap A_2) = 0.07 + 0.28 – 0.05 = 0.30

Thus, 70% of people does not like either sweet or salty snacks.

  • Sample spaces having equally likely outcomes:

For a large number of experiments, it is natural to assume that each point in the sample space is equally likely to occur. For many experiments whose sample space \Omega is a finite set, say \Omega = {1,2,3], \cdots , N}, it is often natural to assume that: 

P({1}) = P({2}) = …. = P({N}) = p (say)

Sum of total probabilities = P(\Omega) = P({1}) + P({2}) + …. + P({N})= Np, so the probability of each event is \frac{1}{N}.

If we assume that each outcome of an experiment is equally likely to occur, then the probability of any event A = the proportion of points in the sample space that are contained in A. 

Thus, to compute probabilities, it is necessary to konw the number of different ways to count given events. For that we need to have the knowledge of principle of counting. This is our next topic of discussion. 

Q&A

From this note, we can get answers to the following questions.

  • How Probability and Relative Frequency of events are related to each other?
  • What is the axiomatic definition of Probability?
  • What is the theorem of additivity of disjoint events in Probability?
  • What is equally likely evenst in Probability?

References

  1. Essentials of Data Science With R Software – 1: Probability and Statistical Inference, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.

 187 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: