Outlier Detection Techniques

Table Of Contents:

  1. What Is An Outlier ?
  2. Characteristics Of An Outlier.
  3. Causes Of Outlier.
  4. Why Handle Outliers?
  5. Outlier Handling Techniques.
  6. Python Examples.

(1) What Is An Outlier?

  • An outlier is a data point that significantly deviates from the rest of the dataset.
  • It is unusually large or small compared to other values and may indicate variability, errors, or rare events.

(2) Example Of An Outlier.

  • Data Set: 150, 160, 162, 158, 170, 165, 175, 169, 180, 300.

  • In this example, the height 300 cm is an outlier because it is much higher than the other values, which are more closely clustered around the 150-180 cm range.

  • The 300 cm value is far outside the typical range for human heights, making it an outlier.

(3) Characteristics Of An Outlier.

  • Extreme Value: Far from the majority of data points.
  • Infrequent Occurrence: Often appears rarely in the dataset.
  • Influential Impact: Can heavily influence statistical calculations like the mean or variance.

(4) Causes Of Outliers.

  • Measurement Errors: Mistakes in data entry or collection.
  • Natural Variability: True but rare events (e.g., extremely high stock prices).
  • Sampling Issues: Including unrelated or incorrect samples.

(5) Why To Handle Outliers.

  • Skew statistical analysis.
  • Mislead machine learning models.
  • Distort visualizations and interpretations.

(6) Outliers Handling Techniques

(1) Identify Outliers

(a) Visual Methods
  • Box plots
  • Scatter plots
(b) Statistical Methods
  • Z-Score: Values with a Z-score > 3 or < -3 are potential outliers.
  • Interquartile Range (IQR): 
    • Outliers: Values < Q1−1.5×IQR  or > Q3+1.5×IQR

(2) Techniques To Handle Outliers

(a) Remove Outliers
  • When: If the outliers are errors or irrelevant.
  • How:
    • Remove rows based on Z-scores or IQR thresholds.
(b) Transform Outliers
  • When: To reduce the impact of outliers on statistical measures.
  • How:
    • Apply log, square root, or Box-Cox transformations.
(c) Cap Outliers(Winsorization)
  • When: To limit the effect of extreme values without removing data.
  • How:
    • Replace outliers with the nearest acceptable value (e.g., Q1−1.5×IQR or Q3+1.5×IQR).
(d) Impute Outliers
  • When: To retain the data while normalizing its effect.
  • How:
    • Replace outliers with mean, median, or a predicted value.
(e) Use Robust Models
  • When: To minimize sensitivity to outliers during analysis or modeling.
  • How:
    • Use models like robust regression or tree-based algorithms that handle outliers naturally.

(7) Box Plot Example.

What Is A Box Plot?
  • A box plot (also known as a box-and-whisker plot) is a graphical representation of the distribution of a dataset.
  • It provides a visual summary of the data’s central tendency, spread, and potential outliers.
Components Of Box Plot.
  • Box:

    • Represents the interquartile range (IQR), which contains the middle 50% of the data.
    • The bottom of the box marks the first quartile (Q1), and the top marks the third quartile (Q3).
    • The line inside the box indicates the median (Q2) of the data.
  • Whiskers:

    • Extend from the box to the smallest and largest data points within a defined range.
    • The range is usually calculated as:
      • Lower Whisker: Extends to the smallest data point within Q1−1.5×IQR.
      • Upper Whisker: Extends to the largest data point within Q3+1.5×IQR.
  • Outliers:

    • Data points outside the whiskers are plotted individually and are considered potential outliers.
Features of a Box Plot:
  • Box:

    • Represents the interquartile range (IQR), which contains the middle 50% of the data.
    • The bottom of the box marks the first quartile (Q1), and the top marks the third quartile (Q3).
    • The line inside the box indicates the median (Q2) of the data.
  • Whiskers:

    • Extend from the box to the smallest and largest data points within a defined range.
    • The range is usually calculated as:
      • Lower Whisker: Extends to the smallest data point within Q1−1.5×IQR.
      • Upper Whisker: Extends to the largest data point within Q3+1.5×IQR.
  • Outliers:

    • Data points outside the whiskers are plotted individually and are considered potential outliers.
Example of a Box Plot:
  • Here is the box plot for the dataset [3,6,7,8,9,10,15,25,50][3, 6, 7, 8, 9, 10, 15, 25, 50].
  • Box:

    • Spans from Q1=7.5 to Q3=15, representing the middle 50% of the data.
  • Median:

    • The red line inside the box represents the median (Q2=9).
  • Whiskers:

    • Lower Whisker: Extends to Q1−1.5×IQR, adjusted to min⁡(data)=3.
    • Upper Whisker: Extends to Q3+1.5×IQR, adjusted to 25.
  • Outlier:

    • The value 50 is an outlier and is marked as an orange dot.

(8) What Is An Scatter Plot?

What Is A Scatter Plot?
  • A scatter plot is a type of data visualization that displays individual data points on a two-dimensional graph.
  • It is commonly used to observe and analyze relationships or patterns between two numerical variables.
  • Each point on the plot represents a single data observation, with the following characteristics:
  • X-axis: Represents one variable.
  • Y-axis: Represents another variable.
  • Points: The position of each point is determined by the values of the two variables for that observation.
Key Features:
  • Correlation:

    • Scatter plots reveal relationships between variables, such as positive, negative, or no correlation.
    • A positive correlation means as one variable increases, the other also increases.
    • A negative correlation means as one variable increases, the other decreases.
    • No correlation indicates no discernible relationship.
  • Outliers:

    • Points that fall far outside the general pattern can be visually identified as outliers.
  • Clusters:

    • Data points can form clusters, which may indicate subgroups or patterns within the data.
Example Of Finding Outlier:
  • Identifying outliers for a single column using a scatter plot involves visualizing the data in a way that highlights deviations from the central tendency.
  • Since a scatter plot typically requires two dimensions, you can plot the index of the data points on the x-axis and the column values on the y-axis.
  • Here’s how you can do it:
  • Steps:
    1. Plot the values of the column against their indices.
    2. Identify statistical measures (e.g., mean, standard deviation, interquartile range) to determine thresholds for outliers.
    3. Mark points that fall outside these thresholds (e.g., 1.5 times the interquartile range or more than 3 standard deviations away from the mean).
# Generating synthetic data for a single column
np.random.seed(42)
data = np.random.normal(50, 10, 100)  # Normal distribution around 50, std 10

# Adding outliers
outliers = np.array([20, 80, 100])
data_with_outliers = np.concatenate([data, outliers])

# Identifying outliers using the 1.5 IQR rule
q1, q3 = np.percentile(data_with_outliers, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outlier_indices = np.where((data_with_outliers < lower_bound) | (data_with_outliers > upper_bound))[0]

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(range(len(data_with_outliers)), data_with_outliers, color='blue', label='Data Points')
plt.scatter(outlier_indices, data_with_outliers[outlier_indices], color='red', label='Outliers', edgecolor='black')
plt.axhline(lower_bound, color='green', linestyle='--', label=f"Lower Bound ({lower_bound:.2f})")
plt.axhline(upper_bound, color='purple', linestyle='--', label=f"Upper Bound ({upper_bound:.2f})")
plt.title("Outlier Detection in a Single Column", fontsize=16)
plt.xlabel("Index", fontsize=12)
plt.ylabel("Value", fontsize=12)
plt.legend(fontsize=10)
plt.grid(alpha=0.4)
plt.show()
  • Here is an example of a scatter plot showing outlier detection for a single column.

    • The blue points represent the data values.
    • The red points are identified as outliers based on the 1.5 IQR rule.
    • The green and purple dashed lines denote the lower and upper bounds, respectively, used for outlier detection.

Leave a Reply

Your email address will not be published. Required fields are marked *