Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.
Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while retaining the most important information. In machine learning, datasets often contain many features, some of which may be redundant or correlated. Reducing dimensions helps:
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms correlated features into a set of uncorrelated variables called principal components. These components are ordered by the amount of variance they explain.
Imagine plotting height and weight data in 2D. These variables are correlated—taller people tend to weigh more. PCA finds a new axis (the first principal component) that captures this relationship, pointing in the direction of maximum variance. The second component is perpendicular to the first, capturing remaining variance.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Create sample data with correlated features
np.random.seed(42)
n_samples = 200
# Generate correlated features
feature_1 = np.random.randn(n_samples)
feature_2 = feature_1 * 0.8 + np.random.randn(n_samples) * 0.3
feature_3 = feature_1 * 0.5 + np.random.randn(n_samples) * 0.5
feature_4 = np.random.randn(n_samples) # Independent feature
data = pd.DataFrame({
'feature_1': feature_1,
'feature_2': feature_2,
'feature_3': feature_3,
'feature_4': feature_4
})
print("Original Data Shape:", data.shape)
print("\nCorrelation Matrix:")
print(data.corr().round(2))
This creates sample data with some correlated features to demonstrate how PCA handles redundancy.
# Step 1: Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
print("\nScaled Data Statistics:")
print(f"Mean: {data_scaled.mean(axis=0).round(4)}")
print(f"Std: {data_scaled.std(axis=0).round(4)}")
PCA requires standardized data because it's sensitive to feature scales.
# Step 2: Apply PCA
pca = PCA()
data_pca = pca.fit_transform(data_scaled)
print("\nPCA Results:")
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Cumulative Variance: {np.cumsum(pca.explained_variance_ratio_).round(4)}")
explained_variance_ratio_ shows the proportion of variance captured by each component. The cumulative sum reveals how many components are needed to explain a target percentage of variance.
# Scree plot
plt.figure(figsize=(12, 4))
# Individual explained variance
plt.subplot(1, 2, 1)
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1),
pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')
# Cumulative explained variance
plt.subplot(1, 2, 2)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'bo-')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Variance Explained')
plt.legend()
plt.tight_layout()
plt.show()
The scree plot helps determine the optimal number of components. Look for the "elbow" where adding more components provides diminishing returns.
# Method 1: Specify exact number of components
pca_2 = PCA(n_components=2)
data_2d = pca_2.fit_transform(data_scaled)
print(f"Shape after PCA (n=2): {data_2d.shape}")
print(f"Variance explained: {sum(pca_2.explained_variance_ratio_):.2%}")
# Method 2: Specify variance threshold
pca_95 = PCA(n_components=0.95) # Keep 95% of variance
data_95 = pca_95.fit_transform(data_scaled)
print(f"\nComponents for 95% variance: {pca_95.n_components_}")
print(f"Shape: {data_95.shape}")
Setting n_components to a float between 0 and 1 automatically selects enough components to explain that proportion of variance.
# Get loadings (feature contributions to each component)
loadings = pd.DataFrame(
pca.components_.T,
columns=[f'PC{i+1}' for i in range(len(pca.components_))],
index=data.columns
)
print("PCA Loadings (Feature Contributions):")
print(loadings.round(3))
Loadings show how much each original feature contributes to each principal component. Large absolute values indicate strong contributions.
# Visualize loadings
plt.figure(figsize=(10, 6))
loadings_plot = loadings.iloc[:, :2] # First 2 components
loadings_plot.plot(kind='bar')
plt.title('Feature Loadings for First Two Principal Components')
plt.xlabel('Original Features')
plt.ylabel('Loading Value')
plt.legend(title='Component')
plt.axhline(y=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()
PCA is commonly used to visualize high-dimensional data in 2D or 3D.
from sklearn.datasets import load_iris
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Standardize and apply PCA
X_scaled = StandardScaler().fit_transform(X)
pca_vis = PCA(n_components=2)
X_pca = pca_vis.fit_transform(X_scaled)
# Create visualization
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y,
cmap='viridis', alpha=0.7, edgecolors='black')
plt.xlabel(f'PC1 ({pca_vis.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_vis.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Iris Dataset - PCA Visualization')
plt.colorbar(scatter, label='Species')
plt.show()
print(f"Total variance explained: {sum(pca_vis.explained_variance_ratio_):.1%}")
This visualization projects the 4-dimensional Iris data to 2D while preserving the most information possible.
from mpl_toolkits.mplot3d import Axes3D
# Apply PCA with 3 components
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)
# Create 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca_3d[:, 0], X_pca_3d[:, 1], X_pca_3d[:, 2],
c=y, cmap='viridis', alpha=0.7)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('Iris Dataset - 3D PCA Visualization')
plt.colorbar(scatter, label='Species', shrink=0.5)
plt.show()
print(f"Total variance explained (3D): {sum(pca_3d.explained_variance_ratio_):.1%}")
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Create pipeline with PCA
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('classifier', LogisticRegression(max_iter=1000))
])
# Train and evaluate
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Training Accuracy: {train_score:.3f}")
print(f"Test Accuracy: {test_score:.3f}")
# Compare with full features
pipeline_full = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))
])
pipeline_full.fit(X_train, y_train)
full_train = pipeline_full.score(X_train, y_train)
full_test = pipeline_full.score(X_test, y_test)
print(f"\nWithout PCA - Train: {full_train:.3f}, Test: {full_test:.3f}")
print(f"With PCA (2 components) - Train: {train_score:.3f}, Test: {test_score:.3f}")
This comparison shows the trade-off between dimensionality reduction and model performance.
from sklearn.model_selection import cross_val_score
# Test different numbers of components
results = []
for n_comp in range(1, X.shape[1] + 1):
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_comp)),
('classifier', LogisticRegression(max_iter=1000))
])
scores = cross_val_score(pipe, X, y, cv=5)
results.append({
'n_components': n_comp,
'mean_score': scores.mean(),
'std_score': scores.std()
})
results_df = pd.DataFrame(results)
print("Cross-Validation Results:")
print(results_df)
# Plot results
plt.figure(figsize=(8, 5))
plt.errorbar(results_df['n_components'], results_df['mean_score'],
yerr=results_df['std_score'], marker='o', capsize=5)
plt.xlabel('Number of Components')
plt.ylabel('Cross-Validation Score')
plt.title('PCA Components vs Model Performance')
plt.show()
This analysis helps determine the minimum number of components needed for good model performance.
PCA transformations can be reversed to reconstruct approximations of original data.
# Apply PCA with 2 components
pca_reconstruct = PCA(n_components=2)
X_reduced = pca_reconstruct.fit_transform(X_scaled)
# Reconstruct data
X_reconstructed = pca_reconstruct.inverse_transform(X_reduced)
# Calculate reconstruction error
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Mean Squared Reconstruction Error: {reconstruction_error:.4f}")
# Compare original and reconstructed
comparison = pd.DataFrame({
'Original': X_scaled[0],
'Reconstructed': X_reconstructed[0],
'Difference': X_scaled[0] - X_reconstructed[0]
}, index=iris.feature_names)
print("\nFirst Sample Comparison:")
print(comparison.round(3))
Lower reconstruction error indicates better preservation of original data information.
# Kernel PCA for non-linear relationships
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf')
X_kpca = kpca.fit_transform(X_scaled)
print("Kernel PCA applied successfully")
print(f"Transformed shape: {X_kpca.shape}")
Kernel PCA can capture non-linear patterns using different kernel functions.
class PCAAnalyzer:
def __init__(self, variance_threshold=0.95):
self.variance_threshold = variance_threshold
self.scaler = StandardScaler()
self.pca = None
def fit(self, X):
# Scale data
X_scaled = self.scaler.fit_transform(X)
# Fit PCA with variance threshold
self.pca = PCA(n_components=self.variance_threshold)
self.pca.fit(X_scaled)
return self
def transform(self, X):
X_scaled = self.scaler.transform(X)
return self.pca.transform(X_scaled)
def get_summary(self):
return {
'n_components': self.pca.n_components_,
'explained_variance': self.pca.explained_variance_ratio_,
'total_variance': sum(self.pca.explained_variance_ratio_)
}
def plot_variance(self):
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.bar(range(1, len(self.pca.explained_variance_ratio_) + 1),
self.pca.explained_variance_ratio_)
plt.xlabel('Component')
plt.ylabel('Variance Ratio')
plt.title('Explained Variance per Component')
plt.subplot(1, 2, 2)
plt.plot(range(1, len(self.pca.explained_variance_ratio_) + 1),
np.cumsum(self.pca.explained_variance_ratio_), 'bo-')
plt.axhline(y=self.variance_threshold, color='r', linestyle='--')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance')
plt.title('Cumulative Explained Variance')
plt.tight_layout()
plt.show()
# Usage
analyzer = PCAAnalyzer(variance_threshold=0.95)
analyzer.fit(X)
X_transformed = analyzer.transform(X)
print("PCA Summary:")
for key, value in analyzer.get_summary().items():
print(f" {key}: {value}")
analyzer.plot_variance()
This reusable class encapsulates the complete PCA workflow for easy application to new datasets.
Principal Component Analysis is a fundamental dimensionality reduction technique that transforms correlated features into uncorrelated principal components. By capturing maximum variance in fewer dimensions, PCA enables visualization of high-dimensional data, speeds up model training, and combats the curse of dimensionality. Understanding how to choose the number of components, interpret loadings, and integrate PCA into machine learning pipelines is essential for effective data preprocessing. While PCA has limitations with non-linear relationships, it remains one of the most widely used techniques for dimensionality reduction in machine learning.
Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.
Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.
Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.