How To Use The Empirical Rule

The empirical rule, also known as the 68-95-99.7 rule, is a statistical rule that states that for a normal distribution, almost all observed data will fall within three standard deviations (denoted by σ) of the mean or average (denoted by µ). This rule is widely used in statistics for prediction and understanding the spread of data. Knowing how to apply this rule correctly can provide quick insights into the distribution of your data, identify outliers, and make probabilistic estimations without complex calculations.

Understanding the Empirical Rule

The empirical rule provides a simple way to estimate the proportion of data that falls within certain intervals around the mean in a normal distribution. It's based on the properties of the normal distribution, a bell-shaped curve that is symmetrical around the mean. The key percentages to remember are:

68%: Approximately 68% of the data falls within one standard deviation of the mean (µ ± 1σ).
95%: Approximately 95% of the data falls within two standard deviations of the mean (µ ± 2σ).
99.7%: Approximately 99.7% of the data falls within three standard deviations of the mean (µ ± 3σ).

This means that almost all data points (99.7%) lie within three standard deviations of the average. Data points outside this range are considered outliers, which are rare and significantly different from the rest of the data.

Prerequisites

Before diving into how to use the empirical rule, it's essential to ensure that the data you're working with meets certain criteria:

Normal Distribution: The data must be approximately normally distributed. This can be checked visually using a histogram or quantile-quantile (Q-Q) plot. Statistical tests like the Shapiro-Wilk test can also be used. However, keep in mind that real-world data is rarely perfectly normal, but if it's close enough, the empirical rule can still provide useful approximations.
Known Mean (µ) and Standard Deviation (σ): You need to know the mean and standard deviation of the dataset. These values are essential for calculating the intervals in the empirical rule.

Steps to Apply the Empirical Rule

Here's a step-by-step guide on how to use the empirical rule:

Step 1: Verify Normality

Before applying the empirical rule, it is crucial to confirm that your data approximates a normal distribution. The most common methods to do this include:

Visual Inspection:
- Histogram: Create a histogram of your data. A bell-shaped, symmetric distribution suggests normality. Look for the following characteristics:
  - A single peak (unimodal)
  - Symmetry around the mean
  - Gradual tapering tails
- Q-Q Plot: A Q-Q plot (quantile-quantile plot) compares the quantiles of your data against the quantiles of a standard normal distribution. If the data is normally distributed, the points on the Q-Q plot will fall approximately along a straight diagonal line. Deviations from this line indicate departures from normality.
Statistical Tests:
- Shapiro-Wilk Test: The Shapiro-Wilk test is a formal statistical test for normality. It returns a test statistic and a p-value. If the p-value is greater than a chosen significance level (e.g., 0.05), you fail to reject the null hypothesis that the data is normally distributed.
- Kolmogorov-Smirnov Test: Similar to the Shapiro-Wilk test, the Kolmogorov-Smirnov test assesses the goodness of fit to a normal distribution.
- Anderson-Darling Test: This test is another option for assessing normality and is often more sensitive to deviations in the tails of the distribution than the Kolmogorov-Smirnov test.

Example: Suppose you have a dataset of test scores from a class. You plot a histogram and see that it is approximately bell-shaped and symmetric. You also create a Q-Q plot, and the points fall reasonably close to a straight line. Additionally, you perform a Shapiro-Wilk test and obtain a p-value of 0.10, which is greater than the significance level of 0.05. Based on these observations, you can reasonably assume that the data is approximately normally distributed.

Step 2: Calculate the Mean (µ) and Standard Deviation (σ)

Compute the mean and standard deviation of your dataset. These are the two fundamental parameters needed to apply the empirical rule.

Mean (µ): The mean is the average of all data points in the dataset.
- Formula: µ = (∑ xi) / n
  - Where:
    - ∑ xi is the sum of all data points.
    - n is the number of data points.
Standard Deviation (σ): The standard deviation measures the spread or dispersion of the data around the mean.
- Formula: σ = √[∑ (xi - µ)² / (n - 1)] (for sample standard deviation)
  - Where:
    - xi is each individual data point.
    - µ is the mean of the dataset.
    - n is the number of data points.

Example: Consider a dataset of 10 exam scores: 70, 75, 80, 85, 90, 95, 100, 65, 72, 88.

Calculate the Mean (µ):
- µ = (70 + 75 + 80 + 85 + 90 + 95 + 100 + 65 + 72 + 88) / 10 = 81
Calculate the Standard Deviation (σ):
- First, find the squared difference of each data point from the mean:
  - (70 - 81)² = 121
  - (75 - 81)² = 36
  - (80 - 81)² = 1
  - (85 - 81)² = 16
  - (90 - 81)² = 81
  - (95 - 81)² = 196
  - (100 - 81)² = 361
  - (65 - 81)² = 256
  - (72 - 81)² = 81
  - (88 - 81)² = 49
- Sum the squared differences:
  - ∑ (xi - µ)² = 121 + 36 + 1 + 16 + 81 + 196 + 361 + 256 + 81 + 49 = 1198
- Divide by (n - 1):
  - 1198 / (10 - 1) = 1198 / 9 = 133.11
- Take the square root:
  - σ = √133.11 ≈ 11.54

Therefore, for this dataset, the mean (µ) is 81, and the standard deviation (σ) is approximately 11.54.

Step 3: Determine the Intervals

Using the mean and standard deviation, calculate the intervals based on the empirical rule:

Interval 1 (µ ± 1σ): This interval contains approximately 68% of the data.
- Lower bound: µ - 1σ
- Upper bound: µ + 1σ
Interval 2 (µ ± 2σ): This interval contains approximately 95% of the data.
- Lower bound: µ - 2σ
- Upper bound: µ + 2σ
Interval 3 (µ ± 3σ): This interval contains approximately 99.7% of the data.
- Lower bound: µ - 3σ
- Upper bound: µ + 3σ

Example (Continuing from Previous Example): With a mean (µ) of 81 and a standard deviation (σ) of 11.54, the intervals are:

Interval 1 (µ ± 1σ):
- Lower bound: 81 - 1(11.54) = 69.46
- Upper bound: 81 + 1(11.54) = 92.54
- Approximately 68% of the exam scores are expected to fall between 69.46 and 92.54.
Interval 2 (µ ± 2σ):
- Lower bound: 81 - 2(11.54) = 57.92
- Upper bound: 81 + 2(11.54) = 104.08
- Approximately 95% of the exam scores are expected to fall between 57.92 and 104.08.
Interval 3 (µ ± 3σ):
- Lower bound: 81 - 3(11.54) = 46.38
- Upper bound: 81 + 3(11.54) = 115.62
- Approximately 99.7% of the exam scores are expected to fall between 46.38 and 115.62.

Step 4: Interpret the Results

Once you have the intervals, you can interpret the results in the context of your data. Here's how:

Proportion Estimation: Estimate the proportion of data falling within each interval.
Outlier Detection: Identify potential outliers—data points falling outside the µ ± 3σ interval.
Probabilistic Statements: Make probabilistic statements about future observations.

Example (Continuing from Previous Example):

Proportion Estimation:
- Approximately 68% of students scored between 69.46 and 92.54 on the exam.
- Approximately 95% of students scored between 57.92 and 104.08 on the exam.
- Approximately 99.7% of students scored between 46.38 and 115.62 on the exam.
Outlier Detection:
- If a student scored below 46.38 or above 115.62, their score would be considered an outlier, as it falls outside the range where almost all data points are expected.
Probabilistic Statements:
- If another student takes the exam, there is a 68% chance their score will be between 69.46 and 92.54, assuming the scores follow the same distribution.
- Similarly, there is a 95% chance their score will be between 57.92 and 104.08.
- And there is a 99.7% chance their score will be between 46.38 and 115.62.

Practical Applications of the Empirical Rule

The empirical rule is a valuable tool across various fields for quickly estimating the distribution of data. Here are several real-world examples:

Manufacturing Quality Control:
- Application: A manufacturing company produces bolts, and the machine is calibrated to produce bolts with a mean diameter of 10 mm and a standard deviation of 0.1 mm. The company can use the empirical rule to monitor the quality of the bolts.
- Analysis:
  - Interval 1 (µ ± 1σ): 10 mm ± 0.1 mm (9.9 mm to 10.1 mm) – Approximately 68% of the bolts should have a diameter within this range.
  - Interval 2 (µ ± 2σ): 10 mm ± 0.2 mm (9.8 mm to 10.2 mm) – Approximately 95% of the bolts should have a diameter within this range.
  - Interval 3 (µ ± 3σ): 10 mm ± 0.3 mm (9.7 mm to 10.3 mm) – Approximately 99.7% of the bolts should have a diameter within this range.
- Interpretation: If the company finds that a significant number of bolts (more than 0.3%) fall outside the 9.7 mm to 10.3 mm range, it indicates a problem with the manufacturing process that needs to be addressed.
Finance – Stock Returns:
- Application: Analyzing the daily returns of a stock to understand its volatility. Suppose a stock has an average daily return of 0% (µ = 0) and a standard deviation of 1% (σ = 0.01).
- Analysis:
  - Interval 1 (µ ± 1σ): 0% ± 1% (-1% to 1%) – Approximately 68% of the daily returns should fall within this range.
  - Interval 2 (µ ± 2σ): 0% ± 2% (-2% to 2%) – Approximately 95% of the daily returns should fall within this range.
  - Interval 3 (µ ± 3σ): 0% ± 3% (-3% to 3%) – Approximately 99.7% of the daily returns should fall within this range.
- Interpretation: If the stock return exceeds ±3%, it is a rare event, occurring only about 0.3% of the time, which may indicate a significant market event or specific news affecting the stock.
Healthcare – Patient Data:
- Application: Monitoring patient blood pressure levels. Suppose the average systolic blood pressure for a group of patients is 120 mmHg (µ = 120) with a standard deviation of 10 mmHg (σ = 10).
- Analysis:
  - Interval 1 (µ ± 1σ): 120 mmHg ± 10 mmHg (110 mmHg to 130 mmHg) – Approximately 68% of patients should have blood pressure within this range.
  - Interval 2 (µ ± 2σ): 120 mmHg ± 20 mmHg (100 mmHg to 140 mmHg) – Approximately 95% of patients should have blood pressure within this range.
  - Interval 3 (µ ± 3σ): 120 mmHg ± 30 mmHg (90 mmHg to 150 mmHg) – Approximately 99.7% of patients should have blood pressure within this range.
- Interpretation: A patient with systolic blood pressure consistently above 150 mmHg or below 90 mmHg would be considered an outlier and may require medical intervention.
Environmental Science – Rainfall Data:
- Application: Analyzing rainfall data for a region to understand typical rainfall patterns. Suppose the average monthly rainfall is 50 mm (µ = 50) with a standard deviation of 15 mm (σ = 15).
- Analysis:
  - Interval 1 (µ ± 1σ): 50 mm ± 15 mm (35 mm to 65 mm) – Approximately 68% of the months should have rainfall within this range.
  - Interval 2 (µ ± 2σ): 50 mm ± 30 mm (20 mm to 80 mm) – Approximately 95% of the months should have rainfall within this range.
  - Interval 3 (µ ± 3σ): 50 mm ± 45 mm (5 mm to 95 mm) – Approximately 99.7% of the months should have rainfall within this range.
- Interpretation: Months with rainfall below 5 mm or above 95 mm would be considered unusually dry or wet, respectively, and could be indicative of extreme weather events.
Education – Test Scores:
- Application: Evaluating the distribution of scores in a standardized test. Suppose the average score is 70 (µ = 70) with a standard deviation of 8 (σ = 8).
- Analysis:
  - Interval 1 (µ ± 1σ): 70 ± 8 (62 to 78) – Approximately 68% of students should score within this range.
  - Interval 2 (µ ± 2σ): 70 ± 16 (54 to 86) – Approximately 95% of students should score within this range.
  - Interval 3 (µ ± 3σ): 70 ± 24 (46 to 94) – Approximately 99.7% of students should score within this range.
- Interpretation: A student scoring below 46 or above 94 is an outlier. If a large number of students fall outside these ranges, it may indicate issues with teaching methods or the test's validity.

Limitations of the Empirical Rule

While the empirical rule is a handy tool, it's important to be aware of its limitations:

Normality Assumption: The empirical rule is only accurate if the data is approximately normally distributed. If the distribution is heavily skewed or has significant departures from normality, the rule's estimations may be unreliable.
Approximation: The percentages provided (68%, 95%, 99.7%) are approximations. In some cases, the actual percentages may vary slightly, especially if the data isn't perfectly normal.
Sample Size: The accuracy of the empirical rule improves with larger sample sizes. Smaller samples may not accurately represent the population, leading to less reliable estimations.
Outliers: While the empirical rule helps identify potential outliers, it doesn't provide a definitive test for outliers. Further investigation and statistical tests may be necessary to confirm whether data points are genuine outliers.

Alternatives to the Empirical Rule

If your data doesn't meet the assumptions of the empirical rule or you need more precise estimations, consider using alternative methods:

Chebyshev's Inequality: Chebyshev's inequality is a more general rule that applies to any distribution, regardless of its shape. It provides a lower bound on the proportion of data within k standard deviations of the mean.
Z-Scores: Z-scores standardize data by transforming it into a standard normal distribution (mean = 0, standard deviation = 1). Z-scores can be used to find the exact proportion of data falling within specific intervals using a standard normal distribution table or calculator.
Percentiles and Quartiles: Percentiles and quartiles divide the data into specific proportions. For example, the 25th percentile (Q1) is the value below which 25% of the data falls, and the 75th percentile (Q3) is the value below which 75% of the data falls. These measures are useful for understanding the distribution of data without assuming normality.
Non-parametric Tests: Non-parametric tests, such as the Wilcoxon rank-sum test or the Kruskal-Wallis test, are statistical tests that do not assume a specific distribution for the data. These tests are useful when dealing with non-normal data or when the normality assumption is questionable.

Conclusion

The empirical rule is a powerful tool for quickly estimating the distribution of data and identifying potential outliers in a normally distributed dataset. By understanding the rule and following the steps outlined in this guide, you can gain valuable insights into your data and make informed decisions. Always remember to verify the normality assumption and be aware of the rule's limitations. When the empirical rule isn't appropriate, consider alternative methods for more accurate and reliable analysis.