Progress3/3 lessons (100%)

Lesson 3

Detecting and Treating Outliers

Detecting and treating outliers involves identifying unusual data points that can distort analysis and reduce model accuracy. By using statistical methods, visualizations, and domain knowledge, outliers can be removed, transformed, or corrected to improve data quality and ensure more accurate, stable machine learning results.

10 min read18 views

What are Outliers in Machine Learning?

Outliers are data points that differ significantly from other observations in a dataset. These extreme values can arise from measurement errors, data entry mistakes, or genuine rare events. In machine learning, outliers can distort statistical analyses, bias model training, and lead to poor generalization on new data.

Understanding whether outliers represent errors or valuable edge cases is crucial for deciding how to handle them.

Types of Outliers

Univariate Outliers

Extreme values in a single variable. For example, an age value of 150 years in a customer dataset.

Multivariate Outliers

Unusual combinations of values across multiple variables. For example, a 25-year-old with 40 years of work experience—each value alone might be normal, but together they're impossible.

Point Outliers

Individual data points far from the rest of the data.

Contextual Outliers

Values that are outliers in a specific context. A temperature of 35°C is normal in summer but an outlier in winter.

Detecting Outliers: Statistical Methods

Method 1: Z-Score

The Z-score measures how many standard deviations a value is from the mean.

import pandas as pd
import numpy as np
from scipy import stats

# Sample data with outliers
data = pd.DataFrame({
    'salary': [50000, 52000, 48000, 55000, 51000, 49000, 
               200000, 53000, 47000, 54000]
})

# Calculate Z-scores
data['z_score'] = stats.zscore(data['salary'])
print(data)

The zscore() function calculates standardized scores for each value. Values with absolute Z-scores greater than 3 are typically considered outliers.

# Identify outliers using Z-score threshold
threshold = 3
outliers = data[np.abs(data['z_score']) > threshold]
print(f"\nOutliers detected:\n{outliers}")

This code flags observations where the absolute Z-score exceeds 3, meaning the value is more than 3 standard deviations from the mean.

When to Use: Z-score works well for normally distributed data with large sample sizes.

Method 2: Interquartile Range (IQR)

IQR is the range between the 25th percentile (Q1) and 75th percentile (Q3).

# Calculate IQR
Q1 = data['salary'].quantile(0.25)
Q3 = data['salary'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

The IQR method defines outliers as values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR. This is the same method used in box plots.

# Detect outliers
outliers_iqr = data[(data['salary'] < lower_bound) | 
                     (data['salary'] > upper_bound)]
print(f"\nIQR Outliers:\n{outliers_iqr}")

This code identifies rows where salary falls outside the calculated bounds.

When to Use: IQR is robust to extreme values and doesn't assume normal distribution.

Method 3: Modified Z-Score (MAD)

Median Absolute Deviation is more robust than standard deviation for outlier detection.

# Calculate Modified Z-score using MAD
def modified_zscore(data):
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    modified_z = 0.6745 * (data - median) / mad
    return modified_z

data['modified_z'] = modified_zscore(data['salary'].values)

# Outliers where |modified Z| > 3.5
outliers_mad = data[np.abs(data['modified_z']) > 3.5]
print(f"MAD Outliers:\n{outliers_mad}")

The modified Z-score uses the median and MAD instead of mean and standard deviation. The constant 0.6745 makes it comparable to standard Z-scores for normal distributions.

When to Use: When data has extreme outliers that would skew mean and standard deviation.

Detecting Outliers: Visualization Methods

Box Plot Visualization

import matplotlib.pyplot as plt

# Create box plot
plt.figure(figsize=(8, 6))
plt.boxplot(data['salary'], vert=True)
plt.title('Salary Distribution Box Plot')
plt.ylabel('Salary')
plt.show()

Box plots visually display the median, quartiles, and outliers (points beyond the whiskers). This provides immediate visual identification of extreme values.

Scatter Plot for Bivariate Outliers

# Sample bivariate data
bivariate_data = pd.DataFrame({
    'experience': [2, 3, 4, 5, 6, 7, 8, 15, 9, 10],
    'salary': [40000, 45000, 50000, 55000, 60000, 
               65000, 70000, 40000, 75000, 80000]
})

plt.figure(figsize=(8, 6))
plt.scatter(bivariate_data['experience'], 
            bivariate_data['salary'])
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Experience vs Salary')
plt.show()

Scatter plots reveal multivariate outliers that might not be detected by univariate methods. The point with 15 years experience but low salary stands out visually.

Histogram Analysis

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(data['salary'], bins=20, edgecolor='black')
plt.title('Salary Distribution')
plt.xlabel('Salary')

plt.subplot(1, 2, 2)
plt.hist(data['salary'], bins=20, edgecolor='black', log=True)
plt.title('Salary Distribution (Log Scale)')
plt.xlabel('Salary')
plt.tight_layout()
plt.show()

Histograms show the overall distribution shape and isolated extreme values. Using a log scale can help visualize the relationship between outliers and the main data mass.

Machine Learning-Based Outlier Detection

Isolation Forest

Isolation Forest isolates outliers by randomly selecting features and split values.

from sklearn.ensemble import IsolationForest

# Prepare data
X = data[['salary']].values

# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
outlier_labels = iso_forest.fit_predict(X)

# Add labels to data
data['isolation_forest'] = outlier_labels
print(data)
# -1 indicates outlier, 1 indicates inlier

Isolation Forest assigns -1 to outliers and 1 to normal points. The contamination parameter specifies the expected proportion of outliers.

When to Use: High-dimensional data where traditional methods fail.

Local Outlier Factor (LOF)

LOF compares the local density of a point to its neighbors.

from sklearn.neighbors import LocalOutlierFactor

# Apply LOF
lof = LocalOutlierFactor(n_neighbors=3, contamination=0.1)
outlier_labels = lof.fit_predict(X)

data['lof_label'] = outlier_labels
print(data[['salary', 'lof_label']])

LOF identifies points in sparse regions compared to their neighbors. It's particularly effective for data with varying densities.

Treating Outliers: Strategies

Strategy 1: Removal

Simply remove outlier rows from the dataset.

# Remove outliers using IQR bounds
df_cleaned = data[(data['salary'] >= lower_bound) & 
                   (data['salary'] <= upper_bound)]
print(f"Original size: {len(data)}")
print(f"After removal: {len(df_cleaned)}")

Removal is straightforward but reduces sample size. Use this when outliers are clearly errors or when you have ample data.

Caution: Don't remove too many observations or you'll lose valuable information.

Strategy 2: Capping (Winsorization)

Replace outliers with boundary values.

# Cap outliers at IQR bounds
data['salary_capped'] = data['salary'].clip(lower=lower_bound, 
                                              upper=upper_bound)
print(data[['salary', 'salary_capped']])

The clip() function replaces values below the lower bound with the lower bound, and values above the upper bound with the upper bound.

# Alternative: Cap at percentiles
lower_cap = data['salary'].quantile(0.05)
upper_cap = data['salary'].quantile(0.95)
data['salary_percentile_capped'] = data['salary'].clip(lower_cap, 
                                                        upper_cap)

Percentile capping preserves more of the original distribution while limiting extreme values.

Strategy 3: Transformation

Apply mathematical transformations to reduce outlier impact.

# Log transformation
data['salary_log'] = np.log1p(data['salary'])

# Square root transformation
data['salary_sqrt'] = np.sqrt(data['salary'])

print(data[['salary', 'salary_log', 'salary_sqrt']])

Log and square root transformations compress the range of large values, reducing the relative impact of outliers.

# Visualize transformation effect
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(data['salary'], bins=15)
axes[0].set_title('Original')

axes[1].hist(data['salary_log'], bins=15)
axes[1].set_title('Log Transformed')

axes[2].hist(data['salary_sqrt'], bins=15)
axes[2].set_title('Square Root Transformed')

plt.tight_layout()
plt.show()

This visualization compares the distribution before and after transformation, showing how outliers are pulled closer to the main data.

Strategy 4: Binning

Convert numerical variables into categorical bins.

# Create salary bins
data['salary_binned'] = pd.cut(data['salary'], 
                                bins=[0, 50000, 70000, float('inf')],
                                labels=['Low', 'Medium', 'High'])
print(data[['salary', 'salary_binned']])

Binning groups values into categories, naturally handling outliers by placing them in extreme bins.

Strategy 5: Imputation

Treat outliers like missing values and impute them.

# Replace outliers with median
median_salary = data['salary'].median()
data['salary_imputed'] = data['salary'].apply(
    lambda x: median_salary if x > upper_bound or x < lower_bound else x
)
print(data[['salary', 'salary_imputed']])

This approach replaces outliers with a central tendency measure, maintaining sample size while neutralizing extreme effects.

Creating a Reusable Outlier Handler

class OutlierHandler:
    def __init__(self, method='iqr', threshold=1.5):
        self.method = method
        self.threshold = threshold
        self.bounds_ = {}
    
    def fit(self, df, columns):
        for col in columns:
            if self.method == 'iqr':
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower = Q1 - self.threshold * IQR
                upper = Q3 + self.threshold * IQR
            self.bounds_[col] = (lower, upper)
        return self
    
    def transform(self, df, strategy='cap'):
        df_transformed = df.copy()
        for col, (lower, upper) in self.bounds_.items():
            if strategy == 'cap':
                df_transformed[col] = df_transformed[col].clip(lower, upper)
            elif strategy == 'remove':
                mask = (df_transformed[col] >= lower) & (df_transformed[col] <= upper)
                df_transformed = df_transformed[mask]
        return df_transformed

# Usage
handler = OutlierHandler(method='iqr', threshold=1.5)
handler.fit(data, ['salary'])
data_cleaned = handler.transform(data, strategy='cap')

This reusable class fits on training data and applies consistent outlier treatment to new data, preventing data leakage in machine learning pipelines.

Best Practices for Outlier Treatment

Investigate Before Acting: Understand whether outliers are errors or genuine values
Document Decisions: Record why and how outliers were handled
Preserve Original Data: Always keep a copy of raw data
Consider Domain Context: Some fields naturally have extreme values
Test Impact: Compare model performance with and without outlier treatment
Fit on Training Data: Calculate bounds using only training data

Summary

Outlier detection and treatment is essential for building reliable machine learning models. Statistical methods like Z-score and IQR, combined with visualization techniques, help identify extreme values. Treatment strategies range from removal to transformation, each with appropriate use cases. Always consider the context and impact of your outlier handling decisions to ensure your models learn meaningful patterns rather than noise.

Related Lessons

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of exploring and visualizing data to understand its patterns, trends, and relationships before building models. This complete guide helps you analyze data effectively, detect anomalies, summarize key insights, and make informed decisions using charts, statistics, and data exploration techniques.

Handling Missing Values

Handling missing values involves identifying, analyzing, and addressing gaps in a dataset to ensure accuracy and reliability. It improves data quality by using techniques like deletion, imputation, or predictive modeling, helping machine learning models perform better and produce trustworthy insights.