【Information II】Welcome to the World of Data Science!

Hello everyone! In this chapter, we're going to dive into "Data Science." Some of you might be thinking, "I'm not great at math, so this sounds difficult..." but don't worry! Data science isn't just for people who are good at calculations; it's about "using data to solve the problems we face in our daily lives."
For example, things like YouTube video recommendations or how products are arranged in a convenience store are all around us thanks to data science. Let's have fun uncovering how these mechanisms work together in this note!

1. What is Data Science?

Data science is a field of study that extracts valuable information from vast amounts of data to make it useful for society. The goal isn't just to collect data, but to analyze it and lead to decision-making about "what to do next."

Key Point: The Three Pillars of Data Science

To master data science, you generally need three types of skills:
1. Business acumen: The ability to identify what problems need to be solved.
2. Data science skills: Mathematical knowledge, such as statistics.
3. Data engineering skills: The ability to use computers to process data.

💡 Trivia:
Data science is currently being used in sports like professional baseball and soccer too! Teams analyze data to figure out things like "which pitch location is hardest to hit" to build their strategies!

2. The Problem-Solving Step: The "PPDAC Cycle"

When solving problems using data, trying things at random won't work well. That's where the PPDAC cycle framework comes in. You should definitely remember this!

  • P (Problem): Clearly define what you want to investigate or solve.
  • P (Plan): Make a plan for what data to collect and how to collect it.
  • D (Data): Collect and organize the data based on your plan.
  • A (Analysis): Create graphs or perform calculations to find patterns and features.
  • C (Conclusion): Summarize what you found from the analysis and think about the next steps.

🌟 Study Tip:
Think of it like "cooking" to make it easier to understand!
P: Decide what to make → P: Check the recipe and ingredients → D: Go grocery shopping and prep → A: Cook the meal → C: Taste it and it's done!

3. Data Organization and Visualization (EDA)

Data collected as-is can be messy and hard to read. That's why we perform data cleansing to clean it up, and then conduct Exploratory Data Analysis (EDA) using graphs to get a feel for the characteristics of the data.

Representative Values: Finding the "Center" of the Data

These values help us understand what the data looks like as a whole.
Mean (Average): The sum of all values divided by the number of values. \( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \)
Median: The value exactly in the middle when you line them up in order. It's useful because it's less affected by outliers (extremely large or small values).
Mode: The value that appears most frequently.

Understanding Dispersion (Spread)

Even with the same average, the meaning changes depending on whether the data is tightly packed together or scattered widely.
Variance: The average of the squared deviations from the mean.
Standard Deviation: The square root of the variance. It's easy to use because it has the same units as the original data.

⚠️ Common Mistake:
It's tempting to think, "If I have the average, I know everything!" but that can be dangerous. For example, in income surveys, if a few ultra-rich individuals skew the average upward, it no longer represents the "average person." In cases like that, it's important to look at the median as well.

4. Correlation vs. Causation

This is the most important point in data science and the one where people make the most mistakes!

Correlation: A relationship where when A changes, B also changes. (Example: As temperature rises, ice cream sales increase)
Causation: A relationship where A is the cause that results in B. (Example: It rained, therefore umbrellas were sold)

💡 Point:
Remember: "Just because there is a correlation doesn't necessarily mean there is causation."
For example, suppose there is data showing that "the taller a child is, the higher their test scores." However, that doesn't mean "stretching to get taller will improve your grades," right? In reality, a common factor like "grade level (age)" is likely involved. This is called a spurious correlation.

5. Basics of Machine Learning

In Information II, we also touch on machine learning, where we teach computers to learn. It is broadly divided into two types:

① Supervised Learning

A method where we provide "correct answer" data for the computer to learn from.
Regression: Predicting numerical values (e.g., predicting tomorrow's temperature).
Classification: Categorizing data (e.g., determining whether an email is spam or not).

② Unsupervised Learning

A method where no correct answers are provided, and the computer finds structures within the data itself.
Clustering: Grouping similar items together (e.g., grouping customers based on their buying habits).

Summary: Key Takeaways

・Data science is about "solving problems using data."
・It's important to cycle through the PPDAC process (Problem, Plan, Data, Analysis, Conclusion).
・Don't just look at the average; be sure to check the dispersion (standard deviation) and median too.
・Be careful not to confuse correlation with causation.

You might feel overwhelmed by all the terminology at first, but simply trying to think "Is this data science?" when you see charts in the news or notice app recommendation features will deepen your understanding significantly!
Let's keep up the great work! I'm rooting for you!