Statistics I - Week 4: Association Between Two Variables
- Core Idea: So far, weβve described single variables. This week, we learn how to describe the relationship or association between two variables. We want to answer questions like: βAs one variable increases, what does the other variable tend to do?β and βHow strong is that relationship?β
π Table of Contents
- Fundamental Concepts
- Question Pattern Analysis
- Detailed Solutions by Pattern
- Practice Exercises
- Visual Learning: Mermaid Diagrams
- Common Pitfalls & Traps
- Quick Refresher Handbook
1. Fundamental Concepts
π 1.1 Association Between Two Categorical Variables
When both variables are categorical, we use a Contingency Table (also called a two-way table) to show the frequencies for each combination of categories.
- Example Contingency Table:
| Economic Condition | Bright | Average | Total |
|---|---|---|---|
| Good | 59 | 85 | 144 |
| Poor | 68 | 93 | 161 |
| Total | 127 | 178 | 305 |
From this table, we can calculate various proportions:
- Marginal Proportion: Proportion of the total. (e.g., βWhat proportion of all students are Bright?β ).
- Conditional Proportion: Proportion within a specific row or column. (e.g., βWhat proportion of students in Good economic conditions are Bright?β ).
π 1.2 Association Between Two Numerical Variables
When both variables are numerical, we can visualize and quantify their linear relationship.
-
Scatterplot: The primary visualization tool. Each point on the graph represents one observation. The overall pattern of the points suggests the type and strength of the relationship.
- Direction: Positive (slopes up), Negative (slopes down), or No direction.
- Form: Linear (straight line), Curved, or No form.
- Strength: Strong (points are tightly clustered), Moderate, or Weak (points are widely scattered).
-
Covariance: A numerical measure of the joint variability of two variables. It indicates the direction of the linear relationship.
- Formula (Sample Covariance):
- Interpretation:
cov(x,y) > 0: Positive linear relationship.cov(x,y) < 0: Negative linear relationship.cov(x,y) β 0: No linear relationship.
- Drawback: The value of covariance depends on the units of the variables, making it hard to compare the strength of relationships across different datasets.
-
Correlation Coefficient (r): A standardized measure of the strength and direction of the linear relationship between two numerical variables.
- Formula: where and are the sample standard deviations of x and y.
- Properties:
- The value of
ris always between -1 and +1. - It has no units.
- The value of
- Interpretation of
r:r = +1: Perfect positive linear relationship.r = -1: Perfect negative linear relationship.r β 0: No linear relationship.- Strength:
- : Strong
- : Moderate
- : Weak (These are general guidelines and can vary).
β οΈ 1.3 Correlation does NOT imply Causation
This is the most important rule in statistics. Just because two variables are strongly correlated does not mean that one causes the other. There could be a third, unobserved variable (a lurking variable) that is causing both to change.
- Example: Ice cream sales and crime rates are positively correlated. This doesnβt mean eating ice cream causes crime. The lurking variable is hot weather, which causes both more people to buy ice cream and more people to be outside, leading to more opportunities for crime.
2. Question Pattern Analysis
From the Week_4_Graded_Assignment, we can identify these patterns.
| Pattern # | Pattern Name | Frequency | Difficulty | Core Skill |
|---|---|---|---|---|
| 2.1 | Calculating Covariance and Correlation | High | Medium | Calculating sample standard deviation, sample covariance, and the correlation coefficient r. |
| 2.2 | Interpreting the Correlation Coefficient r | High | Easy | Describing the strength and direction of a linear relationship based on the value of r. |
| 2.3 | Analyzing Contingency Tables | High | Easy-Medium | Calculating marginal and conditional proportions from a two-way table. |
| 2.4 | Conceptual Understanding of Correlation | Medium | Easy | Answering questions about the properties of correlation (e.g., perfect correlation). |
3. Detailed Solutions by Pattern
Pattern 2.1 & 2.2: Calculating and Interpreting Correlation
- Core Skill: A step-by-step procedural calculation.
Example Problem:
Given the sales data for OnePlus (X) and BBK Electronics (Y): X: {6, 2, 1, 1, 2, 1, 6} Y: {10, 10, 11, 11, 10, 11, 16} a) Calculate the sample covariance. b) Calculate the sample standard deviations, and . c) Calculate and interpret the correlation coefficient,
r.
TAA in Action:
- Triage: Keywords βcovarianceβ, βcorrelation coefficientβ. This is a procedural calculation problem.
- Abstract: I need a table to keep track of the calculations needed for the formulas: .
- Act:
- Step 1: Calculate the means.
- .
- .
- Step 2: Create the calculation table.
- Step 1: Calculate the means.
| 6 | 10 | 3.286 | -1.286 | 10.80 | 1.65 | -4.22 |
| 2 | 10 | -0.714 | -1.286 | 0.51 | 1.65 | 0.92 |
| 1 | 11 | -1.714 | -0.286 | 2.94 | 0.08 | 0.49 |
| 1 | 11 | -1.714 | -0.286 | 2.94 | 0.08 | 0.49 |
| 2 | 10 | -0.714 | -1.286 | 0.51 | 1.65 | 0.92 |
| 1 | 11 | -1.714 | -0.286 | 2.94 | 0.08 | 0.49 |
| 6 | 16 | 3.286 | 4.714 | 10.80 | 22.22 | 15.49 |
| Sum | 31.44 | 27.41 | 14.57 |
-
- Step 3: Calculate the statistics.
- a) Sample Covariance (): .
- b) Sample Variances and Standard Deviations: . .
- c) Correlation Coefficient (r): .
- Step 4: Interpret
r. The value indicates a moderate, positive linear relationship.
- Step 3: Calculate the statistics.
Final Answer: Covariance is ~2.43, and the correlation coefficient is ~0.50, indicating a moderate positive linear relationship.
Pattern 2.3: Analyzing Contingency Tables
- Core Skill: Reading the correct numbers from the table and using the correct total for the denominator.
Example Problem:
Using the table:
| Eco Cond | Bright | Average | Dull | Borderline | Total |
|---|---|---|---|---|---|
| Good | 59 | 85 | 84 | 149 | 377 |
| Poor | 68 | 93 | 83 | 104 | 348 |
| Total | 127 | 178 | 167 | 253 | 725 |
| a) What proportion of total students are dull? | |||||
| b) What proportion of students of good economic conditions are borderline? |
TAA in Action:
- Triage: This is a contingency table problem asking for proportions.
- Abstract: I must carefully identify the numerator (the part) and the denominator (the whole) for each question.
- Act:
- a) Proportion of total students who are dull:
- Numerator = Total number of dull students = 167.
- Denominator = Grand total number of students = 725.
- Proportion = .
- b) Proportion of good students who are borderline: This is a conditional proportion. Our βwholeβ is now just the βGoodβ row.
- Numerator = Number of students who are both Good and Borderline = 149.
- Denominator = Total number of students in Good condition = 377.
- Proportion = .
- a) Proportion of total students who are dull:
Final Answer: a) ~0.23, b) ~0.40.
Memory Palace: Week 4 Concepts
-
Covariance: Imagine two friends, X and Y. You measure their mood (
x - mean) every hour.- When X is happy, Y is also happy positive product.
- When X is sad, Y is also sad positive product.
- When X is happy, Y is sad negative product.
- Covariance is the average of these products. If itβs positive, they tend to be in the same mood. If itβs negative, they tend to be in opposite moods.
-
Correlation (The Translator): Covariance is hard to understand because its units are weird (like βmood-squaredβ). Correlation is a helpful translator who takes the covariance number and says, βOkay, on a simple scale of -1 to +1, hereβs how strong that relationship is.β It standardizes the covariance.
-
Correlation vs. Causation (The Rooster): Every morning, the rooster crows, and then the sun rises.
- Correlation: The two events are perfectly correlated.
- Causation: Does the rooster cause the sun to rise? No. Itβs a classic error. Never assume causation from correlation alone.
-
Marginal vs. Conditional Proportion:
- Marginal: You are standing at the margin of the whole table, looking at a βGrand Totalβ.
- Conditional: You have put on blinders and are looking only at a single row or column. Your world is now conditioned on that one category.