In this note series of descriptive statistics, we will learn about different measures to visualize, understand and quantify variable properties statistically. Before deep-diving into the nitty-gritty of various measures, we will learn about statistics and its essential terms.
What is statistics?
Statistics is the art and science of extracting answers from data. It helps in decision-making in an uncertain environment along with, there are times it also helps in decision-making in certainty. Primarily, the purpose is to make good decisions from data. The decisions made using data are consistent compared to those made through opinions. Therefore, we need to make decisions using models involving data.
Important terms
Statistics as a field of study provides us models and methods to make good decisions using data. Therefore, we collect and analyze data to make decisions.
Population
It is a complete set of all items that interest an investigator, and it is usually represented as N. It can be huge (even infinity). For example:
- All males in this world
- All children between 6-9 age group
Sample
A sample is an observed subset of the population of size ‘n’ or a subset of the population from which data is collected.
Parameter
Numerical description of a population characteristic.
- The mean height of all males in the world
- Population average will be a parameter, and it is represented by
Statistic
Numerical description of a particular sample characteristic.
- The sample average will be a statistic, and it is represented by
At times we collect data, but we will not cover the entire population as it is huge. So we collect data from samples and then try to understand something about the population by analyzing the data collected from or collected using samples.
Suppose our task is to find the average height of people in the world. Is it possible to go and measure each height and then take the average? The answer is no, as the world population is huge and practically not possible. So the alternative is a worldwide collection of samples randomly or based on the criteria to learn the average height of people.
Example 1:
An airline claims that less than 5% of its flights from Delhi airport depart late. From a sample of 100 flights, 6 flights were found to depart late.
- What is the population? Answer: All the flights depart from Delhi airport (It might be huge), represented as N.
- What is the sample? Answer: 100 flights from the population of N.
- What is the statistic? Answer: 6% of flights were found to depart late.
- Is 6% a parameter or a statistic? Answer: Statistic.
Models in Statistics
There are two types of models in statistics, and these are called descriptive statistics and inferential statistics.
- Descriptive statistics use graphical and numerical procedures to summarize data and to transform data into information.
- Inferential statistics provide bases for forecasts, predictions, estimates and are used to transform information into knowledge.
In short, Descriptive statistics gather, sort, summarize data from samples and Inferential statistics uses descriptive statistics to estimate population parameters.
Example 1:
- Based on a survey, 25% of all men dislike football (Inferential Statistics)
- 65% of seniors at local high school applying to college and plan to major in business (Descriptive Statistics)
Difference between Data, Information, and Knowledge
The data independently does not give any meaning to symbols, numbers, characters, etc. However, when it is decoded and presented into the structure form, then it becomes information. Later, when the context is associated with information, then it becomes knowledge.
Example 1:
- Data: 02101869
- Information: 02/10/1869
- Knowledge: Mahatma Gandhi, Mohandas Karamchand Gandhi, was born 02/10/1869, Porbandar, India.
Descriptive Statistics
It is a starting point for knowledge discovery based on data. The knowledge discovery can be made in various ways, such as looking at the pictures, reading some books, etc. Similarly, the statistics also facilitate various tools to discover the knowledge that is contained inside the data.
Data is essentially a vital source of information. However, extracting information from the data is tiresome and sometimes required domain expertise. Apart from domain knowledge, we also require descriptive statistics tools that will be used to retrieve information from the data.
Role of Statistics
Statistics is a language of data that provides a scientific way to extract and retrieve the hidden information from the data and forecast or predict future values. However, it can’t do miracles and also can’t change the process or phenomenon.
Let us understand the role of statistics. If we use the correct statistical tools over correct data, then using the extracted information, we will mostly make correct decisions. However, the reverse is not true. If we are using the correct statistical tools over incorrect data, then using the extracted information. Certainly, we will make incorrect decisions. On the other way round, using the wrong statistical tools, irrespective of whether the data is correct or not. We will always make incorrect decisions based on the information extracted from the data.
Descriptive Statistical Tools
There are several components and tools based on the graphical and analytical. The graphical tools contain various types of plots, such as:
- 2D and 3D plots
- Scatter diagram or plot
- Pie diagram
- Histogram
- Bar plot
- Stem and leaf plot
- Box plot
The analytical tools contain various measures such as:
- To measure the central tendency of data using Mean, Median, Mode, Geometric mean, Harmonic mean, Quantiles, etc.
- To measure the dispersion in data using Variance, Standard deviation, Standard error, Mean deviation, Absolute deviation, Range, etc.
- To understand the structure of data using the properties such as symmetricity, skewness, kurtosis, etc.
- To understand the relationships in data using correlation coefficient, rank correlation, multiple correlation coefficient, partial correlation coefficient, correlation ratio, intraclass correlation, linear regression, non-linear regression, etc.
Statistical thinking and Methods
We have seen graphical and analytical tools. Now the question is which of the tools to be used. The answer is that the graphical tools provide visualization, and it is more like first-hand information. However, analytical tools provide quantitative information.
Mostly, both approaches work together and are inseparable in a system of interconnected processes. But there exists variation in each of the processes. So understanding the extent of variation and reducing it are the keys to success.
Using the information gained by the tools of descriptive statistics and combining them to reach a meaningful conclusion to depict the information hidden inside the data is the objective of any statistical analysis. The proper interpretation of inferences drawn is important from the data.
Role of Data
The data is collected with different types of objectives, and these may be as follows:
- To verify the theoretical findings.
- To draw inferences just based on collected data
- To develop statistical models, which can be further used for policy decisions, classification, forecasting, etc.
Q&A
From this note on introduction to description statistics, we can get answers to the following questions.
- What is descriptive statistics, and how is it different from inferential statistics?
- Why is it important for a data scientist to learn statistics?
Summary
In this note, we have seen the different aspects of descriptive statistics which include, the role of data collection, types of statistical tools, and roles of statistical methods.
References
- Introduction to probability and Statistics, By Prof. G. Srinivasan, IIT Madras.
- Descriptive Statistic, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.
331 total views, 1 views today