Statistics I - Week 3: Describing Numerical Data

  • Core Idea: This week, we build our toolkit for describing data that consists of numbers. Unlike categorical data, we can now perform arithmetic, which allows us to calculate measures of central tendency (where is the “middle” of the data?) and dispersion (how spread out is the data?).

📚 Table of contents

  1. Fundamental Concepts
  2. Question Pattern Analysis
  3. Detailed Solutions by Pattern
  4. Practice Exercises
  5. Visual Learning: Mermaid Diagrams
  6. Common Pitfalls & Traps
  7. Quick Refresher Handbook

1. Fundamental Concepts

🎯 1.1 Measures of Central Tendency

These statistics describe the “center” or “typical” value of a dataset.

  • Mean (Average): The sum of all values divided by the number of values. It is sensitive to outliers.
  • Median: The middle value of a dataset that has been sorted in ascending order.
    • If n (the number of observations) is odd, the median is the single middle value.
    • If n is even, the median is the average of the two middle values.
    • The median is resistant to outliers.
  • Mode: The value that appears most frequently in the dataset. A dataset can have one mode, more than one mode (multimodal), or no mode.

💨 1.2 Measures of Dispersion (Spread)

These statistics describe how spread out or variable the data is.

  • Range: The simplest measure of spread.
  • Variance: The average of the squared differences from the Mean. It measures how far each number in the set is from the average.
    • Sample Variance (): When calculating from a sample, we divide by n-1 to get a better estimate of the population variance.
    • Population Variance (): When you have data for the entire population, you divide by n.
  • Standard Deviation: The square root of the variance. It is the most common measure of spread and is in the same units as the original data.
    • Sample Standard Deviation ():
    • Population Standard Deviation ():

📈 1.3 Measures of Position: Percentiles and Quartiles

These statistics describe the position of a value relative to the rest of the data.

  • Percentiles: The percentile is a value such that percent of the observations fall below or at that value.
  • Quartiles: Specific percentiles that divide the data into four equal parts.
    • First Quartile (Q1): The 25th percentile. The median of the lower half of the data.
    • Second Quartile (Q2): The 50th percentile. This is the Median of the entire dataset.
    • Third Quartile (Q3): The 75th percentile. The median of the upper half of thedata.
  • Interquartile Range (IQR): The range of the middle 50% of the data. It is resistant to outliers.

📦 1.4 The Five-Number Summary and Outliers

  • Five-Number Summary: A concise summary of the distribution of numerical data. It consists of:
    1. Minimum
    2. First Quartile (Q1)
    3. Median (Q2)
    4. Third Quartile (Q3)
    5. Maximum
  • Outliers: Observations that fall well above or below the overall pattern of the data. A common rule of thumb is to identify outliers using the IQR:
    • Lower Fence:
    • Upper Fence:
    • Any data point that falls outside these fences is considered an outlier.

2. Question Pattern Analysis

From the Week_3_Graded_Assignment, the following problem patterns are key.

Pattern #Pattern NameFrequencyDifficultyCore Skill
1.1Calculating Mean with FrequenciesMediumEasyCalculating a weighted average where frequencies are given as algebraic expressions.
1.2Correcting Mean and VarianceHighMediumRecalculating the mean and variance after discovering a data entry error.
1.3Effects of Transformation on StatisticsMediumEasyUnderstanding how mean and variance change when a constant is added to all data points.
1.4Calculating Percentiles and QuartilesHighMediumFinding Q1, Q3, median, and IQR from a small, unsorted dataset.
1.5Identifying OutliersMediumMediumUsing the rule to determine if any data points are outliers.

3. Detailed Solutions by Pattern

Pattern 1.2: Correcting Mean and Variance

  • Core Skill: Understanding that you can work backward from an incorrect statistic to the incorrect sum, correct the sum, and then recalculate the correct statistic.

Example Problem:

The mean and sample standard deviation of 6 observations are 19 and 9, respectively. Later it is found that one observation, 11, was wrongly recorded as 7. a) What is the correct mean? b) What is the correct sample variance?

TAA in Action:

  1. Triage: Keywords “mean and sample standard deviation”, “wrongly noted”. This is a correction problem.
  2. Abstract:
    • For the mean: I’ll use the incorrect mean to find the incorrect sum. Then I’ll correct the sum and find the new mean.
    • For the variance: I’ll use the incorrect variance and mean to find the incorrect sum of squares (). Then I’ll correct the sum of squares and use the new mean to find the new variance.
  3. Act:
    • a) Correcting the Mean:
      1. Find Incorrect Sum: .
      2. Correct the Sum: .
      3. Calculate Correct Mean: .
    • b) Correcting the Sample Variance:
      1. Find Incorrect Sum of Squares (): Start with the sample variance formula: . Given , so . , . . .
      2. Correct the Sum of Squares: .
      3. Calculate Correct Sample Variance: .

Final Answer: a) 19.67, b) 64.32 (Note: minor rounding differences are expected).


Pattern 1.4 & 1.5: Quartiles, IQR, and Outliers

  • Core Skill: Correctly sorting the data and applying the definitions for Q1, Q3, and the outlier fences.

Example Problem:

For the 10 observations: 39, 46, 44, 30, 73, 96, 91, 115, 112, 89. a) Calculate the Inter Quartile Range (IQR). b) How many outliers are there?

TAA in Action:

  1. Triage: Keywords “IQR”, “outliers”. This requires finding Q1 and Q3.
  2. Abstract: I must first sort the data. Then find the median (Q2), the median of the lower half (Q1), and the median of the upper half (Q3). Then calculate IQR and the outlier fences.
  3. Act:
    • Step 1: Sort the data. 30, 39, 44, 46, 73, 89, 91, 96, 112, 115.
    • Step 2: Find the Quartiles.
      • The dataset has (even).
      • The Median (Q2) is the average of the 5th and 6th values: .
      • The lower half of the data is {30, 39, 44, 46, 73}. The median of this half is the middle value, which is 44. So, Q1 = 44.
      • The upper half of the data is {89, 91, 96, 112, 115}. The median of this half is the middle value, which is 96. So, Q3 = 96.
    • Step 3: Calculate the IQR. IQR = Q3 - Q1 = 96 - 44 = 52.
    • Step 4: Calculate the outlier fences.
      • Lower Fence = Q1 - 1.5 * IQR = 44 - 1.5 * 52 = 44 - 78 = -34.
      • Upper Fence = Q3 + 1.5 * IQR = 96 + 1.5 * 52 = 96 + 78 = 174.
    • Step 5: Identify outliers. The valid range is [-34, 174]. All data points (from 30 to 115) are within this range. Therefore, there are no outliers.

Final Answer: a) The IQR is 52. b) There are 0 outliers.


Memory Palace: Week 3 Concepts

  • Mean vs. Median:

    • Imagine a seesaw. The Mean is the balancing point. If a heavy person (an outlier) sits on one end, you have to move the balancing point to keep it level. The mean is sensitive.
    • The Median is the person sitting exactly in the middle of the line of people. If someone at the end is replaced by a giant, the person in the middle doesn’t move. The median is resistant.
  • Variance and Standard Deviation:

    • Think of Variance as a measure of “total anger” in a group. You measure how angry each person is compared to the average mood (x_i - mean), square it to make it positive and amplify big deviations ((x_i - mean)²), and then find the average anger.
    • The units are weird (“anger squared”). So, you take the square root to get back to the original units. This is the Standard Deviation—a more intuitive measure of the typical deviation from the mean.
  • Quartiles and IQR:

    • Think of a road trip.
    • Q1 is the 25% mark of your journey.
    • Q2 (Median) is the halfway point.
    • Q3 is the 75% mark.
    • The IQR is the distance you travel in the “middle half” of your trip (from the 25% mark to the 75% mark). It ignores the start and end of the journey, which is why it’s not affected by outliers.