Of course. Here is the comprehensive, consolidated guide of all question patterns with detailed examples and solutions for Statistics I, Weeks 1 through 4.
Master Guide of Examples: Statistics I (Weeks 1-4)
This document provides a concrete example for every identified question pattern from the first four weeks of Statistics I, complete with a step-by-step solution.
Week 1: Introduction to Statistics & Data
Pattern 1.1 & 1.2: Population/Sample and Inference
Example:
A researcher wants to know the average monthly screen time of all university students in India. They survey 1,000 students from 5 different universities and find the average is 4.2 hours per day. Their final report concludes, “The average daily screen time of Indian university students is likely over 4 hours.” a) Identify the sample and population. b) Is the conclusion “likely over 4 hours” an example of descriptive or inferential statistics?
Click for Solution
* **a) Sample and Population:** * **Population:** All university students in India. This is the entire group the researcher is interested in. * **Sample:** The 1,000 students surveyed from the 5 universities. This is the subset of the population from which data was actually collected. * **b) Descriptive or Inferential:** * The statement "The average daily screen time of Indian university students is likely over 4 hours" is a generalization about the entire population based on the sample's data. It is an inference, not just a statement of fact about the sample. * **Answer:** This is an example of **Inferential Statistics**. (A descriptive statement would have been: "The average screen time *in our sample of 1,000 students* was 4.2 hours.")Pattern 1.3 & 1.4: Classifying Variable Types & Scales
Example:
A dataset for a marathon contains the following variables. Classify each one fully (Numerical/Categorical, Discrete/Continuous, Scale of Measurement). a) The bib number assigned to each runner. b) The final position of each runner (1st, 2nd, 3rd, etc.). c) The finishing time of each runner in seconds.
Click for Solution
* **a) Bib Number:** * **Type:** You wouldn't average bib numbers. They are labels. $\rightarrow$ **Categorical**. * **Scale:** There is no inherent order to the numbers; bib #100 is not "better" than bib #50. $\rightarrow$ **Nominal**. * **b) Final Position:** * **Type:** While they are numbers, you wouldn't average the ranks. They represent ordered categories. $\rightarrow$ **Categorical**. * **Scale:** The order (1st is better than 2nd) is crucial and meaningful. $\rightarrow$ **Ordinal**. * **c) Finishing Time:** * **Type:** You can average finishing times. $\rightarrow$ **Numerical**. * **Form:** Time can be measured to fractions of a second (e.g., 9876.54s). $\rightarrow$ **Continuous**. * **Scale:** A time of 0 seconds is a true, meaningful zero (it means no time has passed). $\rightarrow$ **Ratio**.Week 2: Describing Categorical Data
Pattern 2.1 & 2.2: Frequencies and Central Tendency
Example:
A local election has four candidates: A, B, C, and D. The final vote counts are: A: 150, B: 250, C: 250, D: 100. The total number of voters is 750. a) What is the relative frequency of votes for candidate A? b) What is the mode(s) of this dataset? c) Can the median be calculated?
Click for Solution
* **a) Relative Frequency for A:** * Relative Frequency = (Frequency of A) / (Total Voters) = 150 / 750 = 0.20 or 20%. * **b) Mode(s):** * The mode is the category with the highest frequency. * Candidates B and C are tied for the highest frequency (250 votes). * **Answer:** The data is **bimodal**. The modes are **Candidate B** and **Candidate C**. * **c) Median:** * The candidates are categories on a nominal scale. There is no inherent mathematical order to them. * **Answer:** The **median is not defined** for this data.Pattern 2.3: Choosing the Appropriate Graph
Example:
You are a manager who wants to present the number of customer complaints for each of your five product lines. Your goal is to quickly identify which product line is causing the most problems so you can prioritize fixing it. Which is the most suitable graph? a) Pie Chart b) Bar Chart c) Pareto Chart
Click for Solution
* **Analysis:** * A Pie Chart is good for showing proportions, but less effective for quickly comparing counts and identifying the single largest category. * A Bar Chart is a good choice for comparing the counts across the product lines. * A **Pareto Chart** is a bar chart where the categories are **sorted from highest frequency to lowest**. This directly addresses the goal of "quickly identifying which product is causing the most problems." The most problematic product will always be the first bar on the left. * **Answer:** While a Bar Chart is appropriate, the **Pareto Chart** is the most suitable for the specific goal of prioritization.Week 3: Describing Numerical Data
Pattern 3.2: Correcting Mean and Variance
Example:
The mean score of 10 students is 75. The sample variance is 30. Later, it is discovered that a score of 60 was incorrectly entered as 90. Find the correct mean.
Click for Solution
1. **Find the Incorrect Sum of Scores:** * Sum_incorrect = Mean_incorrect × n = 75 × 10 = 750. 2. **Correct the Sum:** * Subtract the wrong value and add the correct value. * Sum_correct = Sum_incorrect - (Wrong Value) + (Correct Value) * Sum_correct = 750 - 90 + 60 = 720. 3. **Calculate the Correct Mean:** * Mean_correct = Sum_correct / n = 720 / 10 = 72.Final Answer: The correct mean is 72.
Pattern 3.3 & 3.4: Quartiles, IQR, and Outliers
Example:
Find the IQR and determine if there are any outliers for the following dataset: {10, 15, 17, 20, 22, 25, 28, 55}.
Click for Solution
1. **Data is Already Sorted:** The dataset has n=8 observations. 2. **Find the Quartiles:** * **Median (Q2):** Average of the two middle values (4th and 5th). Q2 = (20 + 22) / 2 = 21. * **Lower Half:** {10, 15, 17, 20}. * **Q1:** The median of the lower half. Average of the two middle values. Q1 = (15 + 17) / 2 = 16. * **Upper Half:** {22, 25, 28, 55}. * **Q3:** The median of the upper half. Average of the two middle values. Q3 = (25 + 28) / 2 = 26.5. 3. **Calculate the IQR:** * IQR = Q3 - Q1 = 26.5 - 16 = 10.5. 4. **Calculate the Outlier Fences:** * Lower Fence = Q1 - 1.5 * IQR = 16 - 1.5 * 10.5 = 16 - 15.75 = 0.25. * Upper Fence = Q3 + 1.5 * IQR = 26.5 + 1.5 * 10.5 = 26.5 + 15.75 = 42.25. 5. **Identify Outliers:** * The valid range for data is [0.25, 42.25]. * The data point **55** is greater than the upper fence. * **Answer:** There is **one outlier** (the value 55).Final Answer: The IQR is 10.5, and there is one outlier (55).
Week 4: Association Between Two Variables
Pattern 4.1 & 4.2: Calculating and Interpreting Correlation
Example:
For three data points (X, Y): (1, 2), (2, 4), (3, 5). You are given: , , , . The sample covariance is 1.5. Calculate and interpret the correlation coefficient
r.
Click for Solution
1. **Recall the Formula:** The correlation coefficient `r` is the covariance divided by the product of the standard deviations. * $r = \frac{s_{xy}}{s_x \cdot s_y}$ 2. **Plug in the Values:** * $r = \frac{1.5}{1 \times 1.53} = \frac{1.5}{1.53} \approx 0.98$ 3. **Interpret the Result:** * **Sign:** The sign is positive, indicating a **positive association**. As X increases, Y tends to increase. * **Magnitude:** The value 0.98 is very close to 1. This indicates a **very strong** linear relationship.Final Answer: The correlation coefficient is approximately 0.98, which indicates a very strong, positive linear relationship between X and Y.
Pattern 4.3: Analyzing Contingency Tables
Example:
A survey asks 200 people if they like dogs and/or cats.
| Likes Cats | Dislikes Cats | Total | |
|---|---|---|---|
| Likes Dogs | 70 | 50 | 120 |
| Dislikes Dogs | 20 | 60 | 80 |
| Total | 90 | 110 | 200 |
a) What proportion of all people surveyed like dogs? b) What proportion of cat lovers also like dogs?