Data Science – Introduction

This note series will study the following exciting courses named Data 8 and COMPSCI C8 and prepare notes for future references. Mainly, we will see how computing and statistics are merging and interacting in new ways. This course has emphasized computational and programming skills along with inferential thinking.

Data Science

It is about taking large datasets and making them useful or informative, especially understanding making informed decisions. For doing that, we need ideas from computing, statistics, and also domain knowledge that informs what the data really represents.

For example, we can’t analyze the legal domain without understanding the law, so domain knowledge plays a crucial role in data science.

Before applying any learning techniques, we need to really understand what these datasets with data points and features described. However, data is always messy and can be in different forms such as text, numerical, categorical, image, streams, or different combinations, so it is not always easy to extract meaningful information from all kinds of data.

When we combine computing, statistics, and domain knowledge, we get science about drawing useful conclusions from data using computation as our primary tools. Thus, data science, as a practice, has three core activities.

Core Activities

Exploration

  • Exploration is figuring out what patterns exist in the data when we have many observations about some phenomenon or conclude about the phenomenon itself.
  • Often, just looking at large tables of numbers, we will draw data visualizations. It is much easier to interpret a lot of information at once if it is portrayed visually.
  • Once we have found a pattern, we need to perform statistical inference because some patterns are there just by chance. Some are there because they reflect some underlying process that is really interesting about the dataset.

Inference

  • The goal of statistical inference is to quantity whether the patterns observed during the exploration phase are reliable. Also, it answers, If we collect more data, would we see this pattern again or not?
  • The primary tool we have is randomization because by simulating random processes, we can see what kinds of patterns appear just by chance. If the pattern we observed is not the kind of thing that could appear by change, we can conclude that it is because of some robust or reliable pattern in the underlying phenomenon.
  • Finally, we perform prediction. This is where we have partial information about something we want to know, and we want to guess about the things we don’t know yet.

Prediction

  • For making informed or quantitative guesses using a discipline called machine learning.
  • Normally, when we write programs, we focus on the particular logic of what the computer should do. However, machine learning is about not programming every detail but instead using the data to make decisions or choices within that program. So when we write a program, for instance, to recognize speech or automatic translate languages or control a car or a robot, we don’t actually write down all the details of what to do, but instead, use examples from the world (different datasets) to help computers automatically learn how to behave, and this is a form of prediction.

These three stages correspond to the following:

  • How to identify patterns, then quantifying whether those patterns are reliable.
  • Finally, based on the patterns that we have discovered, reliable ones can help us make informed guesses about the information we wish we knew.

Once we could do all that, we are well on our way to being data scientists. However, in doing all these things, it is important to learn how to program a computer because computing underlies each step of the way, and learning to program an essential part of participating in this discipline.

For computing, we will use the Python Programming Language. In addition, many libraries provide data visualization functionalities such as Matplotlib, Plotly, and Seaborn, Panda for data management, Numpy for numerical analysis, and Scikit learn for machine learning algorithms.

This note is published under CC BY-NC-SA 4.0 license.

References

 96 total views,  1 views today

Scroll to Top
Scroll to Top