In this note series, we have seen how to dig out the information from data. For example, to measure the central tendency of data, we can use any of the estimates such as mean, median, or mode based on the data.
Measures of variation in Descriptive Statistics
Measures of central tendency gives an idea about the location where most of the data is concentrated. However, it is not enought to describe the behaviour of data. Variation or dispersion of data around any particular value is another property to characterize the data.
For example, a few data points may be concentrated in the center, whereas others may be far from the center. Even this is possible that two different variables have the same mean value but possibly different concentrations around the mean. Using graphical tools, we can easily visualize the central tendency and variability of data.
Visualize data with the same mean and different variance using Python
In the below code, we have generated data from the normal distribution with mean = 10 and variance = 10 and 100 for two different variables and plotted a scatter plot.
# Scatterplot of different random variables with same mean and different variance import numpy as np from numpy.random import normal from matplotlib import pyplot as plt n_observations = 500 x = np.random.randn(n_observations) var_1 = normal(loc=10, scale=10, size=n_observations) var_2 = normal(loc=10, scale=100, size=n_observations) # Plot plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100}) plt.scatter(x=var_1,y=x,color='blue', marker= '*', label='1st Variable') plt.scatter(x=var_2,y=x,color= 'red', marker='v', label='2nd Variable') # Decorate plt.title('Scatter plot with same mean and different variance') plt.xlabel('X - value') plt.ylabel('Y - value') plt.legend(loc='best') plt.show()
Although the two variables have the same mean value, there is undoubtedly a significant difference in spread. The first variable (green in color) has less spread and is more concentrated towards the center. At the same time, the second variable (red) has more spread and uniformly distributed. So this way, it proves that just having the central tendency of data is not sufficient to understand the behavior of data.
What about the other way around, such that different mean values and the same variance. So this part, we will understand with the help of the below diagram.
Visualize data with the different mean and same variance using Python
We changed the mean values for these two variables in the below code, kept the same variance, and drew the scatter plot.
# Scatterplot of different random variables with same mean and different variance import numpy as np from numpy.random import normal from matplotlib import pyplot as plt n_observations = 500 x = np.random.randn(n_observations) var_1 = normal(loc=10, scale=5, size=n_observations) var_2 = normal(loc=100, scale=5, size=n_observations) # Plot plt.rcParams.update({'figure.figsize':(10,4), 'figure.dpi':100}) plt.scatter(x=var_1,y=x,color='blue', marker= '*', label='1st Variable') plt.scatter(x=var_2,y=x,color= 'red', marker='v', label='2nd Variable') plt.title('Scatter plot with same mean and different variance') plt.xlabel('X - value') plt.ylabel('Y - value') plt.legend(loc='best') plt.show()
Sometimes, people consider that if the two variables have the same variance, its central tendency would be the same. But, this is not true. While looking at the data variation, we cannot comment on the central tendency of data and vice-versa. Thus, it proves that just having one of the two, such as the central tendency of data or dispersion of data, is not sufficient to understand the behavior of data. To understand the behavior of data, we need both.
In addition, one may argue that we can measure spread around any point for that we don’t require mean. So the answer is yes. However, the spread and scatteredness of data at mean is more preferred. Also, there are statistical advantages when we measure variance around the mean, which we will cover sometime later in this series of notes.
We have seen the spread of data using a graphical tool. Now the question is how to quantify it. And which kind of information is conveyed by the measures of variations.
Measures of variation (or dispersion)
Measures of variation or dispersion helps in measuring the spread and scatterdness of data around any point, preferebly the arithmetic mean value. Various measures of variation are available:
- Range
- Inter Quartile Range (IQR)
- Quartile Deviation
- Absolute Mean Deviation
- Variance
- Standard Deviation
References
- Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.
614 total views, 1 views today