In this note series, we will cover the underlying mathematics of Data Science, specifically Applied Probability Theory and Statistical Inference. In the elementary classes, we have seen some fearsome puzzles of probability, which may have created an impression that probability means solving different kinds of puzzles. However, this is not true in this note series. We will see Probability Theory and Statistical Inferences from the applied data science perspective, and using the programming languages; we will visualize and understand the same.
Understand of Data Science
To understand Data Science, we will first understand what Science is, it is a special approach to find the answer to a query. When we talk about data, the data is a source of reliable information given that it is captured from the actual phenomenon. So the data science is a scientific approach to retrieve information from data.
Examples:
- A scientific way to select a suitable candidate for class representation (CR) could be voting instead of randomly picking a candidate from the classroom.
- A scientific way to evaluate the class performance could be the central tendency of marks obtained by all the students in a classroom.
Need for data:
We collect the data to quench the thirst for knowledge, and the knowledge answers the questions like how, why, where, who, etc. Data can be collected for various purposes such as, to verify theoretical findings, draw inferences just on the basis of collected data, to develop statistical models, which can be further used for policy decisions, classification, forecasting, etc.
In the next section, we will see how data and statistics are related to each other. In the different subsequent sections, we will see how data, statistics, and computing devices and technologies are related and tightly coupled.
Statistics and Data
Data is an essential source of information, but it can’t speak itself. To understand what data is telling us, we need a language to communicate to data and statistics is the language of data. It is a scientific way of extracting and retrieving information for decision-making or data analysis. However, it can’t do miracles, or it can’t change the process or phenomenon.
The data analysis is a domain that tells the following:
- How to collect the data.
- How to analyze that.
- How to draw correct statistical inferences.
- How to decide on the correct statistical tool based on the numerical facts.
When we use correct statistical tools on correct data, we make the correct decision. However, the other combinations, such as using the correct statistical tool on wrong data or the wrong statistical tool on correct data, will always produce an incorrect decision. It follows a simple rule, garbage in – garbage out.
Statistics has its own derived rules, and these rules are framed so that correct decision, as indicated by the data and based on the hidden information, is taken.
Statistics and Data Science
In the last two decades (as of 2020), there has been a rapid development in computers and computational techniques. Moreover, all the statistical methods were there earlier, but neither higher computing power nor sufficient data were available those days. Nowadays, data is readily available in petabytes, and for computing on data, cloud computing services and big data technologies are available.
Thus, these are the environmental factors that advocate and flourish data science. It creates a business opportunity for various industries to understand their sales, users, etc., and effectively generates or grows businesses.
Statistics, Computers and Data Science
Earlier, the emphasis was on theoretical developments in Statistics. However, computers helped in the development of Computational Statistics; even if theory and mathematical analysis become complicated, then Computational Statistics supplements it. With the computational support, the theoretical developments in statistics gained more relevance and applications. Thus, computations and statistics became the two inseparable parts of Data Science.
Once we adventure into Computational Statistics, the role and use of computers became crucial. At the same time, various software and programming languages are developed to resolves problems of that domain.
The areas of applications of statistics have increased, a domain like artificial intelligence, machine learning, supervised learning, unsupervised learning, reinforcement learning are based on statistics, but they are heavily based on computers rather than human manual efforts.
Role of Statistics in Data Science
The theoretical developments are essential, which are needed to be exposed to computational procedures. The computational procedures have their own limitations, and so optimization methods are required.
The implementation of statistical, mathematical, optimization methods, etc., are to be simultaneously implemented over a data set, and for that, data management is required.
All these aspects are logically implemented in a systematic way, and correct statistical inferences are drawn after the usage of appropriate statistical tools.
Based on the obtained inferences, proper interpretations are made and used for policy formulation, policy prescription, and further applications like forecasting, etc. So without learning the basic tools of Statistics, it is not possible to learn Data Science. Thus, proper knowledge of all the fields is required to become a good Data Scientist.
Q&A
From this note, we can get answers to the following questions.
References
- Essentials of Data Science With R Software – 1: Probability and Statistical Inference, By Prof. Shalabh, Dept. of Mathematics and Statistics, IIT Kanpur.
CITE THIS AS:
“Probability, Statistical and Data Science” From NotePub.io – Publish & Share Note! https://notepub.io/notes/mathematics/statistics/statistical-inference-for-data-science/data-science/
12,147 total views, 1 views today