Statistics I - Week 4: Association Between Two Variables

  • Core Idea: So far, we’ve described single variables. This week, we learn how to describe the relationship or association between two variables. We want to answer questions like: β€œAs one variable increases, what does the other variable tend to do?” and β€œHow strong is that relationship?”

πŸ“š Table of Contents

  1. Fundamental Concepts
  2. Question Pattern Analysis
  3. Detailed Solutions by Pattern
  4. Practice Exercises
  5. Visual Learning: Mermaid Diagrams
  6. Common Pitfalls & Traps
  7. Quick Refresher Handbook

1. Fundamental Concepts

πŸ“Š 1.1 Association Between Two Categorical Variables

When both variables are categorical, we use a Contingency Table (also called a two-way table) to show the frequencies for each combination of categories.

  • Example Contingency Table:
Economic ConditionBrightAverageTotal
Good5985144
Poor6893161
Total127178305

From this table, we can calculate various proportions:

  • Marginal Proportion: Proportion of the total. (e.g., β€œWhat proportion of all students are Bright?” ).
  • Conditional Proportion: Proportion within a specific row or column. (e.g., β€œWhat proportion of students in Good economic conditions are Bright?” ).

πŸ“ˆ 1.2 Association Between Two Numerical Variables

When both variables are numerical, we can visualize and quantify their linear relationship.

  • Scatterplot: The primary visualization tool. Each point on the graph represents one observation. The overall pattern of the points suggests the type and strength of the relationship.

    • Direction: Positive (slopes up), Negative (slopes down), or No direction.
    • Form: Linear (straight line), Curved, or No form.
    • Strength: Strong (points are tightly clustered), Moderate, or Weak (points are widely scattered).
  • Covariance: A numerical measure of the joint variability of two variables. It indicates the direction of the linear relationship.

    • Formula (Sample Covariance):
    • Interpretation:
      • cov(x,y) > 0: Positive linear relationship.
      • cov(x,y) < 0: Negative linear relationship.
      • cov(x,y) β‰ˆ 0: No linear relationship.
    • Drawback: The value of covariance depends on the units of the variables, making it hard to compare the strength of relationships across different datasets.
  • Correlation Coefficient (r): A standardized measure of the strength and direction of the linear relationship between two numerical variables.

    • Formula: where and are the sample standard deviations of x and y.
    • Properties:
      • The value of r is always between -1 and +1.
      • It has no units.
    • Interpretation of r:
      • r = +1: Perfect positive linear relationship.
      • r = -1: Perfect negative linear relationship.
      • r β‰ˆ 0: No linear relationship.
      • Strength:
        • : Strong
        • : Moderate
        • : Weak (These are general guidelines and can vary).

⚠️ 1.3 Correlation does NOT imply Causation

This is the most important rule in statistics. Just because two variables are strongly correlated does not mean that one causes the other. There could be a third, unobserved variable (a lurking variable) that is causing both to change.

  • Example: Ice cream sales and crime rates are positively correlated. This doesn’t mean eating ice cream causes crime. The lurking variable is hot weather, which causes both more people to buy ice cream and more people to be outside, leading to more opportunities for crime.

2. Question Pattern Analysis

From the Week_4_Graded_Assignment, we can identify these patterns.

Pattern #Pattern NameFrequencyDifficultyCore Skill
2.1Calculating Covariance and CorrelationHighMediumCalculating sample standard deviation, sample covariance, and the correlation coefficient r.
2.2Interpreting the Correlation Coefficient rHighEasyDescribing the strength and direction of a linear relationship based on the value of r.
2.3Analyzing Contingency TablesHighEasy-MediumCalculating marginal and conditional proportions from a two-way table.
2.4Conceptual Understanding of CorrelationMediumEasyAnswering questions about the properties of correlation (e.g., perfect correlation).

3. Detailed Solutions by Pattern

Pattern 2.1 & 2.2: Calculating and Interpreting Correlation

  • Core Skill: A step-by-step procedural calculation.

Example Problem:

Given the sales data for OnePlus (X) and BBK Electronics (Y): X: {6, 2, 1, 1, 2, 1, 6} Y: {10, 10, 11, 11, 10, 11, 16} a) Calculate the sample covariance. b) Calculate the sample standard deviations, and . c) Calculate and interpret the correlation coefficient, r.

TAA in Action:

  1. Triage: Keywords β€œcovariance”, β€œcorrelation coefficient”. This is a procedural calculation problem.
  2. Abstract: I need a table to keep track of the calculations needed for the formulas: .
  3. Act:
    • Step 1: Calculate the means.
      • .
      • .
    • Step 2: Create the calculation table.
6103.286-1.28610.801.65-4.22
210-0.714-1.2860.511.650.92
111-1.714-0.2862.940.080.49
111-1.714-0.2862.940.080.49
210-0.714-1.2860.511.650.92
111-1.714-0.2862.940.080.49
6163.2864.71410.8022.2215.49
Sum31.4427.4114.57
    • Step 3: Calculate the statistics.
      • a) Sample Covariance (): .
      • b) Sample Variances and Standard Deviations: . .
      • c) Correlation Coefficient (r): .
    • Step 4: Interpret r. The value indicates a moderate, positive linear relationship.

Final Answer: Covariance is ~2.43, and the correlation coefficient is ~0.50, indicating a moderate positive linear relationship.


Pattern 2.3: Analyzing Contingency Tables

  • Core Skill: Reading the correct numbers from the table and using the correct total for the denominator.

Example Problem:

Using the table:

Eco CondBrightAverageDullBorderlineTotal
Good598584149377
Poor689383104348
Total127178167253725
a) What proportion of total students are dull?
b) What proportion of students of good economic conditions are borderline?

TAA in Action:

  1. Triage: This is a contingency table problem asking for proportions.
  2. Abstract: I must carefully identify the numerator (the part) and the denominator (the whole) for each question.
  3. Act:
    • a) Proportion of total students who are dull:
      • Numerator = Total number of dull students = 167.
      • Denominator = Grand total number of students = 725.
      • Proportion = .
    • b) Proportion of good students who are borderline: This is a conditional proportion. Our β€œwhole” is now just the β€œGood” row.
      • Numerator = Number of students who are both Good and Borderline = 149.
      • Denominator = Total number of students in Good condition = 377.
      • Proportion = .

Final Answer: a) ~0.23, b) ~0.40.


Memory Palace: Week 4 Concepts

  • Covariance: Imagine two friends, X and Y. You measure their mood (x - mean) every hour.

    • When X is happy, Y is also happy positive product.
    • When X is sad, Y is also sad positive product.
    • When X is happy, Y is sad negative product.
    • Covariance is the average of these products. If it’s positive, they tend to be in the same mood. If it’s negative, they tend to be in opposite moods.
  • Correlation (The Translator): Covariance is hard to understand because its units are weird (like β€œmood-squared”). Correlation is a helpful translator who takes the covariance number and says, β€œOkay, on a simple scale of -1 to +1, here’s how strong that relationship is.” It standardizes the covariance.

  • Correlation vs. Causation (The Rooster): Every morning, the rooster crows, and then the sun rises.

    • Correlation: The two events are perfectly correlated.
    • Causation: Does the rooster cause the sun to rise? No. It’s a classic error. Never assume causation from correlation alone.
  • Marginal vs. Conditional Proportion:

    • Marginal: You are standing at the margin of the whole table, looking at a β€œGrand Total”.
    • Conditional: You have put on blinders and are looking only at a single row or column. Your world is now conditioned on that one category.