【Math I】Data Analysis — Strategy Guide

Hello everyone! Today, we're going to explore "Data Analysis" together.
You might be thinking, "This is math class—why are we looking at graphs and charts?" But the truth is, this is actually the most relatable math in your daily life.
Whether it's weather forecasts, sports stats, or your test average, I'm going to teach you the tricks to interpret the sea of numbers (data) around us!
It might feel like a lot of terminology at first, but don't worry—if we break it down one piece at a time, you'll be fine.

1. Organizing Data: Frequency Distribution Tables and Histograms

Let's start with the basics of organizing messy, scattered numbers.

・Frequency Distribution Table
This divides data into several intervals (classes) and creates a table showing how many pieces of data fall into each interval (the frequency).
・Class Value
This is the middle value of a class. For example, for the class "10 or more but less than 20," the class value is \((10+20) \div 2 = 15\).

・Histogram
This is just a bar graph version of a frequency distribution table. It lets you see the "shape" of your data at a glance.

Tip:
In a histogram, the horizontal axis represents the "classes" and the vertical axis represents the "frequency." Keep an eye on where the peak of the graph is!


2. Finding the Center: Representative Values

When you have a lot of data, we use representative values to describe the "center" of the dataset. There are three main ones:

① Mean (Average):
The sum of all data divided by the total number of data points. We represent this with the symbol \(\bar{x}\) (x-bar).
Example: The mean of 1, 2, and 9 is \((1+2+9) \div 3 = 4\)

② Median:
When you sort your data in order of size, this is the value exactly in the middle.
・If the number of data points is odd: The middle value.
・If the number of data points is even: The mean of the two middle values.

③ Mode:
The value that appears most frequently in the dataset.

Common Mistake:
Many people forget to sort the data when looking for the median! Always remember to arrange them in order (smallest to largest or vice versa) before finding the middle.

Fun Fact:
The "mean" has a weakness: it is easily skewed by extremely large or small numbers. For example, if there is one billionaire in a classroom, the mean income will skyrocket, even if it doesn't reflect the reality of most students. In cases like that, the "median" is often a better representation of the truth.


3. Spread of Data: Quartiles and Box Plots

Next, let's look at how "spread out" the data is.

・Quartiles
If you sort the data and divide it into four equal parts, these are the dividing values:
1. First Quartile (\(Q_1\)): The median of the first half.
2. Second Quartile (\(Q_2\)): The median of the entire set.
3. Third Quartile (\(Q_3\)): The median of the second half.

・Interquartile Range and Quartile Deviation
Interquartile Range \(= Q_3 - Q_1\) (The span of the middle 50% of the data)
Quartile Deviation \(= \frac{Q_3 - Q_1}{2}\)

・Box Plot
A graph representing five key values: the minimum, \(Q_1\), \(Q_2\), \(Q_3\), and the maximum.

Memory Trick:
Think of quartiles as "boundary lines" that split your class into four teams. The longer the "box" in a box plot, the larger the spread of the middle 50% of your data.


4. Variance and Standard Deviation

This is the big highlight of "Data Analysis"! The calculations are a bit complex, but once you understand what they represent, they aren't scary at all.

・Deviation
The value obtained by subtracting the mean from each individual data point: \((x - \bar{x})\).

・Variance (\(s^2\))
This is the "mean of the squared deviations."
Why square them? If you just added the raw deviations, the positive and negative values would cancel each other out to zero. By squaring them, we make everything positive to measure the actual spread.

・Standard Deviation (\(s\))
This is the square root of the variance: \(s = \sqrt{\text{Variance}}\).
Think of it as bringing the units back to their original scale after they were inflated by squaring.

Summary:
If the variance or standard deviation is large, the data is spread far from the mean (it's scattered).
If the variance or standard deviation is small, the data is clustered around the mean (it's consistent).


5. Relationships Between Two Variables: Scatter Plots and Correlation Coefficients

Finally, let's look at the relationship between two sets of data (e.g., math scores vs. English scores).

・Scatter Plot
A graph where you plot two data sets on the \(x\) and \(y\) axes as points \((x, y)\).

・Correlation
1. Positive Correlation: As \(x\) increases, \(y\) also increases (an upward trend).
2. Negative Correlation: As \(x\) increases, \(y\) decreases (a downward trend).
3. No Correlation: The points are scattered with no discernible trend.

・Correlation Coefficient (\(r\))
A number between \(-1\) and \(1\) that expresses the strength of the correlation.
・Closer to \(1\): Strong positive correlation.
・Closer to \(-1\): Strong negative correlation.
・Closer to \(0\): Weak or no correlation.

Analogy:
"Study time" and "test scores" usually show a positive correlation. On the other hand, "gaming time" and "test scores" might show a negative correlation (haha!).

Important Takeaway:
The correlation coefficient \(r\) must always fall within the range \(-1 \leqq r \leqq 1\). If you calculate a value like \(1.5\), you know there's a calculation error somewhere!


Final Words: Tips for Solving Data Analysis Problems

The shortcut to acing this chapter isn't "memorizing formulas," but "carefully filling out your tables."
1. Find the mean.
2. Find the deviation for each data point (data - mean).
3. Square the deviations.
4. Find the mean of those squares (= Variance).
5. Take the square root (= Standard Deviation).

If you write this process out as a table on your paper, you will significantly reduce calculation errors. It might take more time at first, but once you get used to it, this will become your go-to section for guaranteed points. Keep going!