[Math I] Data Analysis: From Basics to Common Test Preparation
Hello everyone! In this chapter, we will learn about "Data Analysis." When you hear the word "math," you might imagine it involves difficult calculations, but this chapter is actually a very practical field about "how to interpret collected data."
These concepts are used everywhere in our daily lives—from recent news and sports statistics to trend analysis on social media. It also accounts for a significant portion of the score on the Common Test, so if you grasp the key points, it can definitely become your scoring strength. Don't be biased against it just because you "aren't good at calculations"—let’s master it together and have fun!
1. Measures of Central Tendency
When you have a large set of data, we use a single value to represent the characteristics of the whole dataset. This is called a measure of central tendency. You should mainly remember these three:
① Mean
This is calculated by adding all the values of the data together and dividing by the total number of data points.
\( \bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n} \)
Example: If your test scores are 50, 60, and 70, the mean is (50+60+70) ÷ 3 = 60 points.
② Median
When you arrange the data in order of size, the middle value is the median.
・When the number of data points is odd: The exact middle value.
・When the number of data points is even: The mean of the two middle values.
Tip: The mean is easily "pulled" by extremely large (or small) values, but the median is less affected by such outliers.
③ Mode
The value that appears most frequently in the dataset.
Example: If you have shoe sizes (23cm, 24cm, 24cm, 25cm), 24cm is the mode.
【Summary: When to use which measure】
・When you want to find an overall average → Mean
・When you want to know the "typical" value, excluding extreme values (outliers) → Median
・When you want to know the most popular item (trends, etc.) → Mode
2. Dispersion of Data and Box Plots
This is a way to examine how spread out the data is.
① Range and Quartiles
Range: Maximum value - Minimum value
Quartiles: These are the values that divide the data, when arranged in order, into four equal parts.
・First Quartile (\( Q_1 \)): The median of the lower half (the 25% position)
・Second Quartile (\( Q_2 \)): The median of the entire set (the 50% position)
・Third Quartile (\( Q_3 \)): The median of the upper half (the 75% position)
Tip: The Second Quartile is the same thing as the "median"!
② Interquartile Range and Quartile Deviation
・Interquartile Range (IQR): \( Q_3 - Q_1 \)
・Quartile Deviation: \( \frac{Q_3 - Q_1}{2} \)
These represent how spread out the middle 50% of the data is. They are useful indicators because they are not affected by extreme values.
③ Box Plot
This is a visual representation of how data is spread out. It is a very common topic on the Common Test!
・The ends of the box are \( Q_1 \) and \( Q_3 \)
・The line inside the box is \( Q_2 \) (the median)
・The ends of the lines extending to the left and right (whiskers) are the Minimum and Maximum values.
【Common Mistake】
It is easy to mistakenly think that "the longer the box, the more data points it contains," but that is wrong! The length of the box and the whiskers represent the "dispersion (range) of the data," while the number of data points in each section is actually the same (about 25% of the total in each part).
3. Variance and Standard Deviation (A bit more advanced)
This is a method for calculating "how far away the data points are from the mean." Equations may pop up, but they aren't scary once you understand what they mean!
① Deviation
This is (Each data value) - (Mean). The larger this is, the further the data point is from the mean.
② Variance (\( s^2 \))
The mean of the squared deviations.
\( s^2 = \frac{1}{n} \{ (x_1-\bar{x})^2 + (x_2-\bar{x})^2 + \dots + (x_n-\bar{x})^2 \} \)
Why square it?: If you just add the deviations, the positive and negative ones will cancel each other out and result in 0.
③ Standard Deviation (\( s \))
The square root of the variance. \( s = \sqrt{\text{Variance}} \)
Because the variance is squared, the units are different. By taking the square root, we bring the units back to the original form (points, cm, etc.).
Trivia: There is another formula for variance.
Variance = (Mean of squares) - (Square of the mean)
Try remembering it by the rhythm: "Mean of squares, minus, square of mean!" This often makes your calculations much easier.
4. Correlation (The relationship between two sets of data)
This examines whether there is a connection between two variables, such as height and weight.
① Scatter Plots and Correlation
A graph that plots two pieces of data against each other is called a scatter plot.
・Positive Correlation: The points trend upward (as one variable increases, the other also increases).
・Negative Correlation: The points trend downward (as one variable increases, the other decreases).
・No Correlation: The points are scattered randomly.
② Covariance
The mean of the products of the deviations of two variables. If positive, it suggests a positive correlation; if negative, it suggests a negative correlation.
③ Correlation Coefficient (\( r \))
This quantifies the strength of a correlation. It is super important for the Common Test!
\( r = \frac{\text{Covariance}}{\text{Standard deviation of } x \times \text{Standard deviation of } y} \)
Characteristics:
・The value is always between -1 and 1.
・Closer to 1: Strong positive correlation (points form a straight line trending upward).
・Closer to -1: Strong negative correlation (points form a straight line trending downward).
・Closer to 0: No correlation.
Warning point!:
Just because there is a "correlation" does not mean there is a "causal relationship" (where one causes the other). For example, "ice cream sales" and "number of drowning accidents" have a positive correlation, but ice cream does not cause accidents (the heat is the cause of both).
Conclusion: Study Advice
Data analysis is a field that tests your "ability to read charts and tables correctly" rather than your raw calculation speed.
You might find calculating "variance" or "standard deviation" a bit tedious at first, but it is important not just to memorize the formulas, but to have an image in your mind: "Ah, this value represents how spread out the data is."
It might feel difficult at first, but you'll be fine!
As you look at more box plots and scatter plots, you will naturally develop a feel for it. Start with the examples in your textbook and take it one step at a time! I'm rooting for you!