Learn how anomaly detection identifies rare or unusual patterns in data. This lesson covers statistical, clustering-based, and machine learning methods used in fraud detection, system monitoring, and security analytics.
Anomaly detection, also known as outlier detection, is an unsupervised machine learning technique that identifies data points deviating significantly from the majority of observations. These unusual patterns often represent critical information—fraudulent transactions, system failures, manufacturing defects, or security breaches.
Unlike classification problems where you have labeled examples of each category, anomaly detection typically works with datasets where anomalies are rare, unlabeled, or previously unknown. This makes it a crucial tool for discovering unexpected patterns and protecting systems from emerging threats.
Understanding the different types of anomalies helps you choose the appropriate detection technique.
Point anomalies are individual data points that are abnormal compared to the rest of the data. This is the simplest and most common type of anomaly.
Example: A single transaction of $50,000 when typical transactions are under $500.
Contextual anomalies are data points that are abnormal only within a specific context. The same value might be normal in one situation but anomalous in another.
Example: A temperature of 35°F is normal in winter but anomalous in summer for the same location.
Collective anomalies are groups of data points that together constitute an anomaly, even though individual points might appear normal.
Example: A sequence of small transactions occurring rapidly might indicate card testing fraud, though each transaction amount is normal.
Anomaly detection is critical across numerous industries:
Statistical methods provide simple yet effective approaches for detecting anomalies in data with known distributions.
The Z-score measures how many standard deviations a data point is from the mean. Points with extreme Z-scores are considered anomalies.
import numpy as np
import pandas as pd
# Generate sample data with outliers
np.random.seed(42)
normal_data = np.random.normal(loc=50, scale=10, size=100)
outliers = np.array([120, 5, 130, -10])
data = np.concatenate([normal_data, outliers])
# Calculate Z-scores
mean = np.mean(data)
std = np.std(data)
z_scores = np.abs((data - mean) / std)
# Identify anomalies (Z-score > 3)
threshold = 3
anomalies = data[z_scores > threshold]
print(f"Mean: {mean:.2f}, Std: {std:.2f}")
print(f"Anomalies detected: {anomalies}")
print(f"Number of anomalies: {len(anomalies)}")
The Z-score method flags data points more than 3 standard deviations from the mean as anomalies. This threshold captures approximately 99.7% of normally distributed data as "normal."
The IQR method is more robust to extreme outliers than Z-scores because it uses median-based statistics:
# Calculate IQR bounds
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
# Define bounds for anomaly detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify anomalies
iqr_anomalies = data[(data < lower_bound) | (data > upper_bound)]
print(f"Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
print(f"Normal range: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"IQR anomalies: {iqr_anomalies}")
The IQR method defines anomalies as points falling below Q1 - 1.5×IQR or above Q3 + 1.5×IQR. This approach works well even when data isn't normally distributed.
Isolation Forest is a powerful tree-based algorithm specifically designed for anomaly detection. It works on a simple principle: anomalies are easier to isolate than normal points.
The algorithm builds an ensemble of random trees that isolate observations by:
Anomalies require fewer splits to isolate because they lie in sparse regions of the feature space. The average path length to isolate a point becomes its anomaly score.
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
# Create dataset with anomalies
X_normal, _ = make_blobs(n_samples=300, centers=1,
cluster_std=0.5, random_state=42)
X_anomalies = np.random.uniform(low=-4, high=4, size=(15, 2))
X = np.vstack([X_normal, X_anomalies])
print(f"Dataset shape: {X.shape}")
print(f"Expected anomalies: 15")
We create a dataset with 300 normal points clustered together and 15 randomly scattered anomalies.
# Train Isolation Forest
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.05, # Expected proportion of anomalies
random_state=42
)
# Predict: 1 for normal, -1 for anomaly
predictions = iso_forest.fit_predict(X)
# Count results
n_anomalies = (predictions == -1).sum()
n_normal = (predictions == 1).sum()
print(f"Detected anomalies: {n_anomalies}")
print(f"Normal points: {n_normal}")
The contamination parameter specifies the expected proportion of anomalies in the dataset, helping the algorithm set an appropriate decision threshold.
# Get anomaly scores (lower = more anomalous)
scores = iso_forest.decision_function(X)
print(f"Score range: [{scores.min():.3f}, {scores.max():.3f}]")
print(f"Mean score: {scores.mean():.3f}")
# Points with lowest scores are most anomalous
most_anomalous_idx = np.argsort(scores)[:5]
print(f"Most anomalous point scores: {scores[most_anomalous_idx]}")
The decision function returns anomaly scores where more negative values indicate stronger anomalies. This allows you to rank observations by their degree of abnormality.
Local Outlier Factor detects anomalies by comparing the local density of a point to the local densities of its neighbors. Points with substantially lower density than their neighbors are considered outliers.
LOF calculates:
A LOF score close to 1 indicates a normal point. Scores significantly greater than 1 indicate anomalies.
from sklearn.neighbors import LocalOutlierFactor
# Create LOF detector
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.05
)
# Fit and predict (LOF is primarily for training data)
lof_predictions = lof.fit_predict(X)
# Count anomalies
lof_anomalies = (lof_predictions == -1).sum()
print(f"LOF detected anomalies: {lof_anomalies}")
LOF examines 20 nearest neighbors to estimate local density. Points in sparse regions compared to their neighbors receive anomaly labels.
# Get negative outlier factor (more negative = more anomalous)
lof_scores = lof.negative_outlier_factor_
print(f"LOF score range: [{lof_scores.min():.3f}, {lof_scores.max():.3f}]")
# Find most anomalous points
most_anomalous = np.argsort(lof_scores)[:5]
print(f"Most anomalous LOF scores: {lof_scores[most_anomalous]}")
The negative_outlier_factor_ attribute provides anomaly scores where more negative values indicate stronger anomalies.
# Compare detection results
both_detected = np.sum((predictions == -1) & (lof_predictions == -1))
iso_only = np.sum((predictions == -1) & (lof_predictions == 1))
lof_only = np.sum((predictions == 1) & (lof_predictions == -1))
print(f"Detected by both: {both_detected}")
print(f"Isolation Forest only: {iso_only}")
print(f"LOF only: {lof_only}")
Different algorithms may identify different anomalies. Comparing results helps validate findings and understand the nature of detected outliers.
One-Class SVM learns a boundary around normal data and classifies points outside this boundary as anomalies. It's particularly effective when you have a clean training set of only normal observations.
The algorithm:
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
# Scale features (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create One-Class SVM
oc_svm = OneClassSVM(
kernel='rbf',
gamma='scale',
nu=0.05 # Upper bound on fraction of outliers
)
# Fit and predict
svm_predictions = oc_svm.fit_predict(X_scaled)
svm_anomalies = (svm_predictions == -1).sum()
print(f"One-Class SVM detected anomalies: {svm_anomalies}")
The nu parameter sets an upper bound on the fraction of training errors and a lower bound on the fraction of support vectors, effectively controlling sensitivity to anomalies.
# Get distance from the decision boundary
svm_scores = oc_svm.decision_function(X_scaled)
print(f"SVM score range: [{svm_scores.min():.3f}, {svm_scores.max():.3f}]")
# Negative scores indicate anomalies
anomaly_scores = svm_scores[svm_predictions == -1]
print(f"Anomaly score range: [{anomaly_scores.min():.3f}, {anomaly_scores.max():.3f}]")
The decision function returns signed distances to the separating hyperplane, where negative values indicate anomalies.
Let's create a comprehensive comparison of all methods:
from sklearn.metrics import confusion_matrix
# Create ground truth labels (last 15 points are anomalies)
true_labels = np.array([1] * 300 + [-1] * 15)
# Store all predictions
methods = {
'Isolation Forest': predictions,
'LOF': lof_predictions,
'One-Class SVM': svm_predictions
}
# Compare each method
print("Method Comparison:")
print("-" * 50)
for name, preds in methods.items():
true_positives = np.sum((preds == -1) & (true_labels == -1))
false_positives = np.sum((preds == -1) & (true_labels == 1))
precision = true_positives / (true_positives + false_positives)
recall = true_positives / 15 # 15 actual anomalies
print(f"{name}:")
print(f" True Positives: {true_positives}/15")
print(f" False Positives: {false_positives}")
print(f" Precision: {precision:.2%}")
print(f" Recall: {recall:.2%}")
print()
This comparison reveals how each method performs at identifying true anomalies while minimizing false alarms.
Combining multiple methods often produces more robust results:
# Create ensemble score (voting)
ensemble_votes = (
(predictions == -1).astype(int) +
(lof_predictions == -1).astype(int) +
(svm_predictions == -1).astype(int)
)
# Anomaly if at least 2 methods agree
ensemble_predictions = np.where(ensemble_votes >= 2, -1, 1)
ensemble_anomalies = (ensemble_predictions == -1).sum()
print(f"Ensemble detected anomalies: {ensemble_anomalies}")
# Check agreement level
for votes in range(4):
count = (ensemble_votes == votes).sum()
print(f"Points with {votes} anomaly votes: {count}")
Ensemble approaches reduce false positives by requiring multiple methods to agree before flagging a point as anomalous.
Time series data requires special consideration for detecting anomalies:
# Generate time series with anomalies
np.random.seed(42)
time_points = 200
normal_series = np.sin(np.linspace(0, 4*np.pi, time_points)) + \
np.random.normal(0, 0.1, time_points)
# Inject anomalies
anomaly_indices = [50, 100, 150]
time_series = normal_series.copy()
time_series[anomaly_indices] = [2.5, -2.0, 3.0]
print(f"Time series length: {len(time_series)}")
print(f"Injected anomalies at indices: {anomaly_indices}")
# Use rolling window statistics
window_size = 10
series_df = pd.DataFrame({'value': time_series})
# Calculate rolling mean and std
series_df['rolling_mean'] = series_df['value'].rolling(
window=window_size, center=True
).mean()
series_df['rolling_std'] = series_df['value'].rolling(
window=window_size, center=True
).std()
# Calculate Z-score relative to rolling window
series_df['z_score'] = (
(series_df['value'] - series_df['rolling_mean']) /
series_df['rolling_std']
)
# Detect anomalies
ts_anomalies = series_df[series_df['z_score'].abs() > 3].index.tolist()
print(f"Detected time series anomalies at: {ts_anomalies}")
Rolling window statistics adapt to local patterns, making anomaly detection context-aware for time series data.
| Method | Best For | Strengths | Limitations |
|---|---|---|---|
| Z-Score | Simple univariate data | Fast, interpretable | Assumes normality |
| IQR | Data with unknown distribution | Robust to extremes | Univariate only |
| Isolation Forest | High-dimensional data | Scalable, no density estimation | May struggle with local anomalies |
| LOF | Data with varying densities | Detects local anomalies | Computationally expensive |
| One-Class SVM | Clean training data available | Effective boundary learning | Sensitive to parameters |
from sklearn.preprocessing import RobustScaler
# RobustScaler is less sensitive to outliers
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)
print("Use RobustScaler to prevent outliers from affecting scaling")
RobustScaler uses median and interquartile range instead of mean and standard deviation, making it more suitable for data containing anomalies.
# When labels are available, use precision-recall
from sklearn.metrics import precision_recall_curve, average_precision_score
# Convert predictions to scores for ranking
scores = iso_forest.decision_function(X)
# Negate scores so higher = more anomalous
scores_for_pr = -scores
# Calculate average precision
ap = average_precision_score(true_labels == -1, scores_for_pr)
print(f"Average Precision: {ap:.3f}")
Average Precision provides a single metric summarizing performance across all thresholds, particularly useful for imbalanced anomaly detection tasks.
# Always scale features for distance-based methods
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Then apply anomaly detection
lof_scaled = LocalOutlierFactor(n_neighbors=20)
predictions_scaled = lof_scaled.fit_predict(X_scaled)
Distance-based methods like LOF and One-Class SVM are sensitive to feature scales.
# Test different contamination values
contamination_values = [0.01, 0.05, 0.10, 0.15]
for cont in contamination_values:
iso = IsolationForest(contamination=cont, random_state=42)
preds = iso.fit_predict(X)
n_anomalies = (preds == -1).sum()
print(f"Contamination {cont}: {n_anomalies} anomalies detected")
Incorrect contamination settings lead to too many or too few detections. Start with domain knowledge or use score distributions to set thresholds.
For time-dependent data, always consider whether anomalies are contextual. A value might be normal at one time but anomalous at another.