Stats 1 Week 1: Data Collection & Classification
0. Prerequisites
NOTE
What you need to know:
- Basic English: Understanding terms like “Category”, “Order”, “Numerical”.
- Logic: Ability to distinguish between a “Whole” (Population) and a “Part” (Sample).
Quick Refresher
- Data: Information collected for analysis.
- Variable: A characteristic that changes (e.g., Height, Color).
- Constant: A characteristic that stays same.
1. Core Concepts
1.1 Population vs Sample
- Population: The entire group you want to study. (Keyword: “All”, “Every”).
- Sample: A subset of the population used to collect data. (Keyword: “Selected”, “Surveyed”).
- Descriptive Statistics: Summarizing data (Mean, Median, Graphs). “What happened?”
- Inferential Statistics: Using sample data to make guesses about the population. “What might happen?“
1.2 Types of Variables
- Categorical (Qualitative): Describes qualities. No math allowed.
- Examples: Color, Gender, Yes/No.
- Numerical (Quantitative): Describes quantities. Math allowed.
- Discrete: Countable jumps (0, 1, 2). No “in-between”. (e.g., Number of kids).
- Continuous: Infinite values. Can be measured precisely. (e.g., Height, Weight, Time).
1.3 Scales of Measurement (NOIR)
- Nominal: Names only. No order. (Red, Blue, Green).
- Ordinal: Names + Order. (Rank 1, 2, 3; S, M, L; Good, Better, Best).
- Note: Difference between ranks is unknown.
- Interval: Order + Equal Intervals. No True Zero. (Temp in Celsius/Fahrenheit).
- Test: “Is 0 degrees ‘no heat’?” No.
- Math: Addition/Subtraction OK. Ratio NOT OK ( is not “twice as hot” as ).
- Ratio: Order + Equal Intervals + True Zero. (Height, Weight, Money).
- Test: “Is 0 cm ‘no height’?” Yes.
- Math: All operations OK. (kg is twice as heavy as kg).
1.4 Data Types
- Cross-Sectional: Data collected at a single point in time across many subjects. (e.g., Census 2020).
- Time Series: Data collected for one subject over many time points. (e.g., Stock price of Apple over a year).
- Structured: Organized in tables (Rows/Cols).
- Unstructured: Text, Images, Video.
2. Pattern Analysis & Goated Solutions
Pattern 1: Identifying Sample vs Population
Context: “An analyst surveyed 4 IITs to know the status of all engineering institutes.”
TIP
Mental Algorithm:
- Identify the Goal: Who do they want to know about? Population.
- Identify the Action: Who did they actually talk to? Sample.
Example (Detailed Solution)
Problem: “To study the health of all Indians, 1000 people from Mumbai were tested.” Solution:
- Goal: “All Indians”. This is the Population.
- Action: “1000 people from Mumbai”. This is the Sample. Answer: Sample = 1000 Mumbai residents; Population = All Indians.
Pattern 2: Classifying Variables (Discrete vs Continuous)
Context: “Is ‘Shoe Size’ discrete or continuous?”
TIP
Mental Algorithm: Ask: “Can I have half of this?” or “Can I zoom in infinitely?”
- If you count it (1, 2, 3) Discrete.
- If you measure it (1.5, 1.55, 1.559) Continuous.
- Trap: Money is usually treated as Ratio/Continuous in stats, but sometimes Discrete (cents). In this course, “Price” is usually Continuous/Ratio.
Example (Detailed Solution)
Problem: Classify “Number of assignments submitted”. Solution:
- Test: Can you submit 2.5 assignments? No.
- Result: You count them (0, 1, 2…). Answer: Numerical, Discrete.
Pattern 3: Scales of Measurement (The “Zero” Test)
Context: “What scale is ‘Temperature in Kelvin’ vs ‘Celsius’?”
TIP
Mental Algorithm:
- Is it a Name? Nominal.
- Is there Order? Ordinal.
- Does 0 mean ‘Nothing’?
- No (0 is just a point) Interval.
- Yes (0 means absence) Ratio.
Example (Detailed Solution)
Problem: Identify scale for “Year of Birth” (e.g., 1990, 2000). Solution:
- Order?: Yes, 2000 is after 1990.
- Zero?: Does “Year 0” mean “No Time”? No, it’s just a reference point.
- Result: Interval. Answer: Interval Scale.
3. Practice Exercises
- Classify: “Zip Code”.
- Hint: Looks like a number, but acts like a Name. Nominal.
- Scale: “Star Rating (1-5)“.
- Hint: Order matters. Ordinal.
- Type: “Daily temperature recorded for 1 month”.
- Hint: Over time. Time Series.
🧠 Level Up: Advanced Practice
Question 1: The Sample vs Population Trap
Problem: An analyst surveys 4 randomly selected IITs to study placements across all engineering institutes in India.
- Sample: The 4 selected IITs.
- Population: All engineering institutes in India. Trap: Don’t confuse the “Target Population” (All institutes) with the “Sampling Frame” (List of IITs). If the analyst only selected from IITs but wants to infer about all institutes, there is a Selection Bias. But strictly speaking, the population of interest is “All engineering institutes”.
Question 2: Tricky Scales of Measurement
Identify the Scale:
- Stock Price: Ratio (Money has a true zero).
- Soccer Positions (Forward, Midfielder): Nominal? No, arguably Ordinal if there’s a hierarchy (Defender < Mid < Forward?), but usually Nominal. Wait, source Q10 says “Ordinal”. Why? Maybe position on field (Back to Front)? Accepted Answer: Ordinal.
- Education Level: Ordinal (High School < Bachelor < Master).
- Influence Score (Reshares × Reach): Ratio (Calculated from ratio variables, has true zero).
Question 3: Structured vs Unstructured
Problem: A table of “Crop Type”, “Fertilizer Amount”, “Yield”. Answer: Structured Data. (It’s in a table with rows/columns). Contrast: Text reviews of a movie are Unstructured.