Stats 1 Week 1: Data Collection & Classification

0. Prerequisites

NOTE

What you need to know:

  • Basic English: Understanding terms like “Category”, “Order”, “Numerical”.
  • Logic: Ability to distinguish between a “Whole” (Population) and a “Part” (Sample).

Quick Refresher

  • Data: Information collected for analysis.
  • Variable: A characteristic that changes (e.g., Height, Color).
  • Constant: A characteristic that stays same.

1. Core Concepts

1.1 Population vs Sample

  • Population: The entire group you want to study. (Keyword: “All”, “Every”).
  • Sample: A subset of the population used to collect data. (Keyword: “Selected”, “Surveyed”).
  • Descriptive Statistics: Summarizing data (Mean, Median, Graphs). “What happened?”
  • Inferential Statistics: Using sample data to make guesses about the population. “What might happen?“

1.2 Types of Variables

  1. Categorical (Qualitative): Describes qualities. No math allowed.
    • Examples: Color, Gender, Yes/No.
  2. Numerical (Quantitative): Describes quantities. Math allowed.
    • Discrete: Countable jumps (0, 1, 2). No “in-between”. (e.g., Number of kids).
    • Continuous: Infinite values. Can be measured precisely. (e.g., Height, Weight, Time).

1.3 Scales of Measurement (NOIR)

  1. Nominal: Names only. No order. (Red, Blue, Green).
  2. Ordinal: Names + Order. (Rank 1, 2, 3; S, M, L; Good, Better, Best).
    • Note: Difference between ranks is unknown.
  3. Interval: Order + Equal Intervals. No True Zero. (Temp in Celsius/Fahrenheit).
    • Test: “Is 0 degrees ‘no heat’?” No.
    • Math: Addition/Subtraction OK. Ratio NOT OK ( is not “twice as hot” as ).
  4. Ratio: Order + Equal Intervals + True Zero. (Height, Weight, Money).
    • Test: “Is 0 cm ‘no height’?” Yes.
    • Math: All operations OK. (kg is twice as heavy as kg).

1.4 Data Types

  • Cross-Sectional: Data collected at a single point in time across many subjects. (e.g., Census 2020).
  • Time Series: Data collected for one subject over many time points. (e.g., Stock price of Apple over a year).
  • Structured: Organized in tables (Rows/Cols).
  • Unstructured: Text, Images, Video.

2. Pattern Analysis & Goated Solutions

Pattern 1: Identifying Sample vs Population

Context: “An analyst surveyed 4 IITs to know the status of all engineering institutes.”

TIP

Mental Algorithm:

  1. Identify the Goal: Who do they want to know about? Population.
  2. Identify the Action: Who did they actually talk to? Sample.

Example (Detailed Solution)

Problem: “To study the health of all Indians, 1000 people from Mumbai were tested.” Solution:

  1. Goal: “All Indians”. This is the Population.
  2. Action: “1000 people from Mumbai”. This is the Sample. Answer: Sample = 1000 Mumbai residents; Population = All Indians.

Pattern 2: Classifying Variables (Discrete vs Continuous)

Context: “Is ‘Shoe Size’ discrete or continuous?”

TIP

Mental Algorithm: Ask: “Can I have half of this?” or “Can I zoom in infinitely?”

  • If you count it (1, 2, 3) Discrete.
  • If you measure it (1.5, 1.55, 1.559) Continuous.
  • Trap: Money is usually treated as Ratio/Continuous in stats, but sometimes Discrete (cents). In this course, “Price” is usually Continuous/Ratio.

Example (Detailed Solution)

Problem: Classify “Number of assignments submitted”. Solution:

  1. Test: Can you submit 2.5 assignments? No.
  2. Result: You count them (0, 1, 2…). Answer: Numerical, Discrete.

Pattern 3: Scales of Measurement (The “Zero” Test)

Context: “What scale is ‘Temperature in Kelvin’ vs ‘Celsius’?”

TIP

Mental Algorithm:

  1. Is it a Name? Nominal.
  2. Is there Order? Ordinal.
  3. Does 0 mean ‘Nothing’?
    • No (0 is just a point) Interval.
    • Yes (0 means absence) Ratio.

Example (Detailed Solution)

Problem: Identify scale for “Year of Birth” (e.g., 1990, 2000). Solution:

  1. Order?: Yes, 2000 is after 1990.
  2. Zero?: Does “Year 0” mean “No Time”? No, it’s just a reference point.
  3. Result: Interval. Answer: Interval Scale.

3. Practice Exercises

  1. Classify: “Zip Code”.
    • Hint: Looks like a number, but acts like a Name. Nominal.
  2. Scale: “Star Rating (1-5)“.
    • Hint: Order matters. Ordinal.
  3. Type: “Daily temperature recorded for 1 month”.
    • Hint: Over time. Time Series.

🧠 Level Up: Advanced Practice

Question 1: The Sample vs Population Trap

Problem: An analyst surveys 4 randomly selected IITs to study placements across all engineering institutes in India.

  • Sample: The 4 selected IITs.
  • Population: All engineering institutes in India. Trap: Don’t confuse the “Target Population” (All institutes) with the “Sampling Frame” (List of IITs). If the analyst only selected from IITs but wants to infer about all institutes, there is a Selection Bias. But strictly speaking, the population of interest is “All engineering institutes”.

Question 2: Tricky Scales of Measurement

Identify the Scale:

  1. Stock Price: Ratio (Money has a true zero).
  2. Soccer Positions (Forward, Midfielder): Nominal? No, arguably Ordinal if there’s a hierarchy (Defender < Mid < Forward?), but usually Nominal. Wait, source Q10 says “Ordinal”. Why? Maybe position on field (Back to Front)? Accepted Answer: Ordinal.
  3. Education Level: Ordinal (High School < Bachelor < Master).
  4. Influence Score (Reshares × Reach): Ratio (Calculated from ratio variables, has true zero).

Question 3: Structured vs Unstructured

Problem: A table of “Crop Type”, “Fertilizer Amount”, “Yield”. Answer: Structured Data. (It’s in a table with rows/columns). Contrast: Text reviews of a movie are Unstructured.