Back to Association and Anomaly Detection

Progress2/2 lessons (100%)

Lesson 2

Anomaly Detection

Learn how anomaly detection identifies rare or unusual patterns in data. This lesson covers statistical, clustering-based, and machine learning methods used in fraud detection, system monitoring, and security analytics.

February 9, 202610 min read33 views

Introduction to Anomaly Detection

Anomaly detection, also known as outlier detection, is an unsupervised machine learning technique that identifies data points deviating significantly from the majority of observations. These unusual patterns often represent critical information—fraudulent transactions, system failures, manufacturing defects, or security breaches.

Unlike classification problems where you have labeled examples of each category, anomaly detection typically works with datasets where anomalies are rare, unlabeled, or previously unknown. This makes it a crucial tool for discovering unexpected patterns and protecting systems from emerging threats.

Types of Anomalies

Understanding the different types of anomalies helps you choose the appropriate detection technique.

Point Anomalies

Point anomalies are individual data points that are abnormal compared to the rest of the data. This is the simplest and most common type of anomaly.

Example: A single transaction of $50,000 when typical transactions are under $500.

Contextual Anomalies

Contextual anomalies are data points that are abnormal only within a specific context. The same value might be normal in one situation but anomalous in another.

Example: A temperature of 35°F is normal in winter but anomalous in summer for the same location.

Collective Anomalies

Collective anomalies are groups of data points that together constitute an anomaly, even though individual points might appear normal.

Example: A sequence of small transactions occurring rapidly might indicate card testing fraud, though each transaction amount is normal.

Real-World Applications

Anomaly detection is critical across numerous industries:

Fraud Detection: Identifying suspicious credit card transactions or insurance claims
Network Security: Detecting intrusion attempts and unusual network traffic
Manufacturing: Finding defective products in quality control processes
Healthcare: Identifying unusual patient vitals or rare disease patterns
System Monitoring: Detecting server failures, performance degradation, or resource exhaustion
Financial Markets: Spotting market manipulation or unusual trading patterns

Statistical Anomaly Detection Methods

Statistical methods provide simple yet effective approaches for detecting anomalies in data with known distributions.

Z-Score Method

The Z-score measures how many standard deviations a data point is from the mean. Points with extreme Z-scores are considered anomalies.

import numpy as np
import pandas as pd

# Generate sample data with outliers
np.random.seed(42)
normal_data = np.random.normal(loc=50, scale=10, size=100)
outliers = np.array([120, 5, 130, -10])
data = np.concatenate([normal_data, outliers])

# Calculate Z-scores
mean = np.mean(data)
std = np.std(data)
z_scores = np.abs((data - mean) / std)

# Identify anomalies (Z-score > 3)
threshold = 3
anomalies = data[z_scores > threshold]

print(f"Mean: {mean:.2f}, Std: {std:.2f}")
print(f"Anomalies detected: {anomalies}")
print(f"Number of anomalies: {len(anomalies)}")

The Z-score method flags data points more than 3 standard deviations from the mean as anomalies. This threshold captures approximately 99.7% of normally distributed data as "normal."

Interquartile Range (IQR) Method

The IQR method is more robust to extreme outliers than Z-scores because it uses median-based statistics:

# Calculate IQR bounds
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

# Define bounds for anomaly detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify anomalies
iqr_anomalies = data[(data < lower_bound) | (data > upper_bound)]

print(f"Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
print(f"Normal range: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"IQR anomalies: {iqr_anomalies}")

The IQR method defines anomalies as points falling below Q1 - 1.5×IQR or above Q3 + 1.5×IQR. This approach works well even when data isn't normally distributed.

Isolation Forest

Isolation Forest is a powerful tree-based algorithm specifically designed for anomaly detection. It works on a simple principle: anomalies are easier to isolate than normal points.

How Isolation Forest Works

The algorithm builds an ensemble of random trees that isolate observations by:

Randomly selecting a feature
Randomly selecting a split value between the minimum and maximum of that feature
Recursively partitioning data until each point is isolated

Anomalies require fewer splits to isolate because they lie in sparse regions of the feature space. The average path length to isolate a point becomes its anomaly score.

Implementing Isolation Forest

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs

# Create dataset with anomalies
X_normal, _ = make_blobs(n_samples=300, centers=1, 
                          cluster_std=0.5, random_state=42)
X_anomalies = np.random.uniform(low=-4, high=4, size=(15, 2))
X = np.vstack([X_normal, X_anomalies])

print(f"Dataset shape: {X.shape}")
print(f"Expected anomalies: 15")

We create a dataset with 300 normal points clustered together and 15 randomly scattered anomalies.

# Train Isolation Forest
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.05,  # Expected proportion of anomalies
    random_state=42
)

# Predict: 1 for normal, -1 for anomaly
predictions = iso_forest.fit_predict(X)

# Count results
n_anomalies = (predictions == -1).sum()
n_normal = (predictions == 1).sum()

print(f"Detected anomalies: {n_anomalies}")
print(f"Normal points: {n_normal}")

The contamination parameter specifies the expected proportion of anomalies in the dataset, helping the algorithm set an appropriate decision threshold.

Understanding Anomaly Scores

# Get anomaly scores (lower = more anomalous)
scores = iso_forest.decision_function(X)

print(f"Score range: [{scores.min():.3f}, {scores.max():.3f}]")
print(f"Mean score: {scores.mean():.3f}")

# Points with lowest scores are most anomalous
most_anomalous_idx = np.argsort(scores)[:5]
print(f"Most anomalous point scores: {scores[most_anomalous_idx]}")

The decision function returns anomaly scores where more negative values indicate stronger anomalies. This allows you to rank observations by their degree of abnormality.

Local Outlier Factor (LOF)

Local Outlier Factor detects anomalies by comparing the local density of a point to the local densities of its neighbors. Points with substantially lower density than their neighbors are considered outliers.

How LOF Works

LOF calculates:

k-distance: Distance to the k-th nearest neighbor
Reachability distance: A smoothed distance measure
Local reachability density (LRD): Inverse of average reachability distance
LOF score: Ratio of average LRD of neighbors to the point's LRD

A LOF score close to 1 indicates a normal point. Scores significantly greater than 1 indicate anomalies.

Implementing LOF

from sklearn.neighbors import LocalOutlierFactor

# Create LOF detector
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.05
)

# Fit and predict (LOF is primarily for training data)
lof_predictions = lof.fit_predict(X)

# Count anomalies
lof_anomalies = (lof_predictions == -1).sum()
print(f"LOF detected anomalies: {lof_anomalies}")

LOF examines 20 nearest neighbors to estimate local density. Points in sparse regions compared to their neighbors receive anomaly labels.

Analyzing LOF Scores

# Get negative outlier factor (more negative = more anomalous)
lof_scores = lof.negative_outlier_factor_

print(f"LOF score range: [{lof_scores.min():.3f}, {lof_scores.max():.3f}]")

# Find most anomalous points
most_anomalous = np.argsort(lof_scores)[:5]
print(f"Most anomalous LOF scores: {lof_scores[most_anomalous]}")

The negative_outlier_factor_ attribute provides anomaly scores where more negative values indicate stronger anomalies.

LOF vs Isolation Forest

# Compare detection results
both_detected = np.sum((predictions == -1) & (lof_predictions == -1))
iso_only = np.sum((predictions == -1) & (lof_predictions == 1))
lof_only = np.sum((predictions == 1) & (lof_predictions == -1))

print(f"Detected by both: {both_detected}")
print(f"Isolation Forest only: {iso_only}")
print(f"LOF only: {lof_only}")

Different algorithms may identify different anomalies. Comparing results helps validate findings and understand the nature of detected outliers.

One-Class SVM

One-Class SVM learns a boundary around normal data and classifies points outside this boundary as anomalies. It's particularly effective when you have a clean training set of only normal observations.

How One-Class SVM Works

The algorithm:

Maps data into a high-dimensional feature space using a kernel function
Finds the smallest hypersphere (or hyperplane) enclosing most of the training data
Points outside this boundary are classified as anomalies

Implementing One-Class SVM

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

# Scale features (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create One-Class SVM
oc_svm = OneClassSVM(
    kernel='rbf',
    gamma='scale',
    nu=0.05  # Upper bound on fraction of outliers
)

# Fit and predict
svm_predictions = oc_svm.fit_predict(X_scaled)

svm_anomalies = (svm_predictions == -1).sum()
print(f"One-Class SVM detected anomalies: {svm_anomalies}")

The nu parameter sets an upper bound on the fraction of training errors and a lower bound on the fraction of support vectors, effectively controlling sensitivity to anomalies.

Getting Decision Scores

# Get distance from the decision boundary
svm_scores = oc_svm.decision_function(X_scaled)

print(f"SVM score range: [{svm_scores.min():.3f}, {svm_scores.max():.3f}]")

# Negative scores indicate anomalies
anomaly_scores = svm_scores[svm_predictions == -1]
print(f"Anomaly score range: [{anomaly_scores.min():.3f}, {anomaly_scores.max():.3f}]")

The decision function returns signed distances to the separating hyperplane, where negative values indicate anomalies.

Comparing Anomaly Detection Methods

Let's create a comprehensive comparison of all methods:

from sklearn.metrics import confusion_matrix

# Create ground truth labels (last 15 points are anomalies)
true_labels = np.array([1] * 300 + [-1] * 15)

# Store all predictions
methods = {
    'Isolation Forest': predictions,
    'LOF': lof_predictions,
    'One-Class SVM': svm_predictions
}

# Compare each method
print("Method Comparison:")
print("-" * 50)

for name, preds in methods.items():
    true_positives = np.sum((preds == -1) & (true_labels == -1))
    false_positives = np.sum((preds == -1) & (true_labels == 1))
    
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / 15  # 15 actual anomalies
    
    print(f"{name}:")
    print(f"  True Positives: {true_positives}/15")
    print(f"  False Positives: {false_positives}")
    print(f"  Precision: {precision:.2%}")
    print(f"  Recall: {recall:.2%}")
    print()

This comparison reveals how each method performs at identifying true anomalies while minimizing false alarms.

Ensemble Anomaly Detection

Combining multiple methods often produces more robust results:

# Create ensemble score (voting)
ensemble_votes = (
    (predictions == -1).astype(int) +
    (lof_predictions == -1).astype(int) +
    (svm_predictions == -1).astype(int)
)

# Anomaly if at least 2 methods agree
ensemble_predictions = np.where(ensemble_votes >= 2, -1, 1)

ensemble_anomalies = (ensemble_predictions == -1).sum()
print(f"Ensemble detected anomalies: {ensemble_anomalies}")

# Check agreement level
for votes in range(4):
    count = (ensemble_votes == votes).sum()
    print(f"Points with {votes} anomaly votes: {count}")

Ensemble approaches reduce false positives by requiring multiple methods to agree before flagging a point as anomalous.

Anomaly Detection for Time Series Data

Time series data requires special consideration for detecting anomalies:

# Generate time series with anomalies
np.random.seed(42)
time_points = 200
normal_series = np.sin(np.linspace(0, 4*np.pi, time_points)) + \
                np.random.normal(0, 0.1, time_points)

# Inject anomalies
anomaly_indices = [50, 100, 150]
time_series = normal_series.copy()
time_series[anomaly_indices] = [2.5, -2.0, 3.0]

print(f"Time series length: {len(time_series)}")
print(f"Injected anomalies at indices: {anomaly_indices}")

Rolling Statistics for Time Series

# Use rolling window statistics
window_size = 10
series_df = pd.DataFrame({'value': time_series})

# Calculate rolling mean and std
series_df['rolling_mean'] = series_df['value'].rolling(
    window=window_size, center=True
).mean()
series_df['rolling_std'] = series_df['value'].rolling(
    window=window_size, center=True
).std()

# Calculate Z-score relative to rolling window
series_df['z_score'] = (
    (series_df['value'] - series_df['rolling_mean']) / 
    series_df['rolling_std']
)

# Detect anomalies
ts_anomalies = series_df[series_df['z_score'].abs() > 3].index.tolist()
print(f"Detected time series anomalies at: {ts_anomalies}")

Rolling window statistics adapt to local patterns, making anomaly detection context-aware for time series data.

Choosing the Right Method

Method	Best For	Strengths	Limitations
Z-Score	Simple univariate data	Fast, interpretable	Assumes normality
IQR	Data with unknown distribution	Robust to extremes	Univariate only
Isolation Forest	High-dimensional data	Scalable, no density estimation	May struggle with local anomalies
LOF	Data with varying densities	Detects local anomalies	Computationally expensive
One-Class SVM	Clean training data available	Effective boundary learning	Sensitive to parameters

Decision Guidelines

Large datasets: Start with Isolation Forest for efficiency
Varying density clusters: Use Local Outlier Factor
Clean training data: Consider One-Class SVM
Simple numeric data: Try statistical methods first
Critical applications: Use ensemble approaches

Best Practices for Anomaly Detection

Data Preprocessing

from sklearn.preprocessing import RobustScaler

# RobustScaler is less sensitive to outliers
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)

print("Use RobustScaler to prevent outliers from affecting scaling")

RobustScaler uses median and interquartile range instead of mean and standard deviation, making it more suitable for data containing anomalies.

Handling Imbalanced Detection

Adjust contamination parameter: Set based on domain knowledge or exploratory analysis
Use anomaly scores: Rank observations instead of hard classification
Set business-appropriate thresholds: Balance false positives against missed anomalies

Validation Strategies

# When labels are available, use precision-recall
from sklearn.metrics import precision_recall_curve, average_precision_score

# Convert predictions to scores for ranking
scores = iso_forest.decision_function(X)
# Negate scores so higher = more anomalous
scores_for_pr = -scores

# Calculate average precision
ap = average_precision_score(true_labels == -1, scores_for_pr)
print(f"Average Precision: {ap:.3f}")

Average Precision provides a single metric summarizing performance across all thresholds, particularly useful for imbalanced anomaly detection tasks.

Common Pitfalls to Avoid

Pitfall 1: Ignoring Feature Scaling

# Always scale features for distance-based methods
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Then apply anomaly detection
lof_scaled = LocalOutlierFactor(n_neighbors=20)
predictions_scaled = lof_scaled.fit_predict(X_scaled)

Distance-based methods like LOF and One-Class SVM are sensitive to feature scales.

Pitfall 2: Setting Wrong Contamination

# Test different contamination values
contamination_values = [0.01, 0.05, 0.10, 0.15]

for cont in contamination_values:
    iso = IsolationForest(contamination=cont, random_state=42)
    preds = iso.fit_predict(X)
    n_anomalies = (preds == -1).sum()
    print(f"Contamination {cont}: {n_anomalies} anomalies detected")

Incorrect contamination settings lead to too many or too few detections. Start with domain knowledge or use score distributions to set thresholds.

Pitfall 3: Ignoring Temporal Patterns

For time-dependent data, always consider whether anomalies are contextual. A value might be normal at one time but anomalous at another.

Key Takeaways

Anomaly detection identifies rare, unusual observations that deviate from expected patterns
Statistical methods (Z-score, IQR) work well for simple univariate data with known distributions
Isolation Forest efficiently detects anomalies by measuring how easily points can be isolated
Local Outlier Factor identifies anomalies based on local density comparisons
One-Class SVM learns a boundary around normal data when clean training examples are available
Ensemble methods combine multiple algorithms for more robust detection
Always scale features, validate results, and tune the contamination parameter based on domain knowledge
Different methods suit different data characteristics—experiment to find the best approach for your use case

Related Lessons

Association Rule Learning

Discover Association Rule Learning, a technique used to uncover hidden relationships between items. This lesson explains support, confidence, lift, and how Market Basket Analysis helps businesses improve recommendations and sales strategies.