Progress3/4 lessons (75%)

Lesson 3

Dimensionality Reduction with PCA

Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.

10 min read18 views

What is Dimensionality Reduction?

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while retaining the most important information. In machine learning, datasets often contain many features, some of which may be redundant or correlated. Reducing dimensions helps:

Visualize high-dimensional data in 2D or 3D
Speed up training by reducing computational complexity
Combat the curse of dimensionality
Remove noise and redundant features
Prevent overfitting in models with many features

Understanding Principal Component Analysis

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms correlated features into a set of uncorrelated variables called principal components. These components are ordered by the amount of variance they explain.

The Intuition Behind PCA

Imagine plotting height and weight data in 2D. These variables are correlated—taller people tend to weigh more. PCA finds a new axis (the first principal component) that captures this relationship, pointing in the direction of maximum variance. The second component is perpendicular to the first, capturing remaining variance.

Key Concepts

Principal Components: New uncorrelated variables created from linear combinations of original features
Explained Variance: How much information each component captures
Loadings: Weights that define how original features contribute to each component

How PCA Works: Step by Step

Standardize the data to zero mean and unit variance
Compute the covariance matrix to understand feature relationships
Calculate eigenvectors and eigenvalues of the covariance matrix
Sort eigenvectors by decreasing eigenvalues
Select top k eigenvectors as principal components
Transform data to the new feature space

Implementing PCA in Python

Basic PCA Implementation

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Create sample data with correlated features
np.random.seed(42)
n_samples = 200

# Generate correlated features
feature_1 = np.random.randn(n_samples)
feature_2 = feature_1 * 0.8 + np.random.randn(n_samples) * 0.3
feature_3 = feature_1 * 0.5 + np.random.randn(n_samples) * 0.5
feature_4 = np.random.randn(n_samples)  # Independent feature

data = pd.DataFrame({
    'feature_1': feature_1,
    'feature_2': feature_2,
    'feature_3': feature_3,
    'feature_4': feature_4
})

print("Original Data Shape:", data.shape)
print("\nCorrelation Matrix:")
print(data.corr().round(2))

This creates sample data with some correlated features to demonstrate how PCA handles redundancy.

# Step 1: Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

print("\nScaled Data Statistics:")
print(f"Mean: {data_scaled.mean(axis=0).round(4)}")
print(f"Std: {data_scaled.std(axis=0).round(4)}")

PCA requires standardized data because it's sensitive to feature scales.

# Step 2: Apply PCA
pca = PCA()
data_pca = pca.fit_transform(data_scaled)

print("\nPCA Results:")
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Cumulative Variance: {np.cumsum(pca.explained_variance_ratio_).round(4)}")

explained_variance_ratio_ shows the proportion of variance captured by each component. The cumulative sum reveals how many components are needed to explain a target percentage of variance.

Visualizing Explained Variance

# Scree plot
plt.figure(figsize=(12, 4))

# Individual explained variance
plt.subplot(1, 2, 1)
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), 
        pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')

# Cumulative explained variance
plt.subplot(1, 2, 2)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), 
         np.cumsum(pca.explained_variance_ratio_), 'bo-')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Variance Explained')
plt.legend()

plt.tight_layout()
plt.show()

The scree plot helps determine the optimal number of components. Look for the "elbow" where adding more components provides diminishing returns.

Choosing the Number of Components

# Method 1: Specify exact number of components
pca_2 = PCA(n_components=2)
data_2d = pca_2.fit_transform(data_scaled)
print(f"Shape after PCA (n=2): {data_2d.shape}")
print(f"Variance explained: {sum(pca_2.explained_variance_ratio_):.2%}")

# Method 2: Specify variance threshold
pca_95 = PCA(n_components=0.95)  # Keep 95% of variance
data_95 = pca_95.fit_transform(data_scaled)
print(f"\nComponents for 95% variance: {pca_95.n_components_}")
print(f"Shape: {data_95.shape}")

Setting n_components to a float between 0 and 1 automatically selects enough components to explain that proportion of variance.

Understanding Principal Component Loadings

# Get loadings (feature contributions to each component)
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(len(pca.components_))],
    index=data.columns
)

print("PCA Loadings (Feature Contributions):")
print(loadings.round(3))

Loadings show how much each original feature contributes to each principal component. Large absolute values indicate strong contributions.

# Visualize loadings
plt.figure(figsize=(10, 6))
loadings_plot = loadings.iloc[:, :2]  # First 2 components
loadings_plot.plot(kind='bar')
plt.title('Feature Loadings for First Two Principal Components')
plt.xlabel('Original Features')
plt.ylabel('Loading Value')
plt.legend(title='Component')
plt.axhline(y=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

PCA for Visualization

PCA is commonly used to visualize high-dimensional data in 2D or 3D.

from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize and apply PCA
X_scaled = StandardScaler().fit_transform(X)
pca_vis = PCA(n_components=2)
X_pca = pca_vis.fit_transform(X_scaled)

# Create visualization
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, 
                      cmap='viridis', alpha=0.7, edgecolors='black')
plt.xlabel(f'PC1 ({pca_vis.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_vis.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Iris Dataset - PCA Visualization')
plt.colorbar(scatter, label='Species')
plt.show()

print(f"Total variance explained: {sum(pca_vis.explained_variance_ratio_):.1%}")

This visualization projects the 4-dimensional Iris data to 2D while preserving the most information possible.

3D PCA Visualization

from mpl_toolkits.mplot3d import Axes3D

# Apply PCA with 3 components
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)

# Create 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(X_pca_3d[:, 0], X_pca_3d[:, 1], X_pca_3d[:, 2],
                     c=y, cmap='viridis', alpha=0.7)

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('Iris Dataset - 3D PCA Visualization')

plt.colorbar(scatter, label='Species', shrink=0.5)
plt.show()

print(f"Total variance explained (3D): {sum(pca_3d.explained_variance_ratio_):.1%}")

PCA in Machine Learning Pipelines

Using PCA for Model Training

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create pipeline with PCA
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Training Accuracy: {train_score:.3f}")
print(f"Test Accuracy: {test_score:.3f}")

# Compare with full features
pipeline_full = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

pipeline_full.fit(X_train, y_train)
full_train = pipeline_full.score(X_train, y_train)
full_test = pipeline_full.score(X_test, y_test)

print(f"\nWithout PCA - Train: {full_train:.3f}, Test: {full_test:.3f}")
print(f"With PCA (2 components) - Train: {train_score:.3f}, Test: {test_score:.3f}")

This comparison shows the trade-off between dimensionality reduction and model performance.

Finding Optimal Components with Cross-Validation

from sklearn.model_selection import cross_val_score

# Test different numbers of components
results = []
for n_comp in range(1, X.shape[1] + 1):
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=n_comp)),
        ('classifier', LogisticRegression(max_iter=1000))
    ])
    
    scores = cross_val_score(pipe, X, y, cv=5)
    results.append({
        'n_components': n_comp,
        'mean_score': scores.mean(),
        'std_score': scores.std()
    })

results_df = pd.DataFrame(results)
print("Cross-Validation Results:")
print(results_df)

# Plot results
plt.figure(figsize=(8, 5))
plt.errorbar(results_df['n_components'], results_df['mean_score'],
             yerr=results_df['std_score'], marker='o', capsize=5)
plt.xlabel('Number of Components')
plt.ylabel('Cross-Validation Score')
plt.title('PCA Components vs Model Performance')
plt.show()

This analysis helps determine the minimum number of components needed for good model performance.

Inverse Transform: Reconstructing Data

PCA transformations can be reversed to reconstruct approximations of original data.

# Apply PCA with 2 components
pca_reconstruct = PCA(n_components=2)
X_reduced = pca_reconstruct.fit_transform(X_scaled)

# Reconstruct data
X_reconstructed = pca_reconstruct.inverse_transform(X_reduced)

# Calculate reconstruction error
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Mean Squared Reconstruction Error: {reconstruction_error:.4f}")

# Compare original and reconstructed
comparison = pd.DataFrame({
    'Original': X_scaled[0],
    'Reconstructed': X_reconstructed[0],
    'Difference': X_scaled[0] - X_reconstructed[0]
}, index=iris.feature_names)
print("\nFirst Sample Comparison:")
print(comparison.round(3))

Lower reconstruction error indicates better preservation of original data information.

PCA Limitations and Considerations

When PCA May Not Work Well

Non-linear relationships: PCA captures only linear correlations
Important variance in minor components: Sometimes small variance features are important
Categorical data: PCA requires numerical features
Interpretability needed: Principal components are linear combinations of all features

Alternatives to PCA

# Kernel PCA for non-linear relationships
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel='rbf')
X_kpca = kpca.fit_transform(X_scaled)

print("Kernel PCA applied successfully")
print(f"Transformed shape: {X_kpca.shape}")

Kernel PCA can capture non-linear patterns using different kernel functions.

Complete PCA Workflow Example

class PCAAnalyzer:
    def __init__(self, variance_threshold=0.95):
        self.variance_threshold = variance_threshold
        self.scaler = StandardScaler()
        self.pca = None
        
    def fit(self, X):
        # Scale data
        X_scaled = self.scaler.fit_transform(X)
        
        # Fit PCA with variance threshold
        self.pca = PCA(n_components=self.variance_threshold)
        self.pca.fit(X_scaled)
        
        return self
    
    def transform(self, X):
        X_scaled = self.scaler.transform(X)
        return self.pca.transform(X_scaled)
    
    def get_summary(self):
        return {
            'n_components': self.pca.n_components_,
            'explained_variance': self.pca.explained_variance_ratio_,
            'total_variance': sum(self.pca.explained_variance_ratio_)
        }
    
    def plot_variance(self):
        plt.figure(figsize=(10, 4))
        
        plt.subplot(1, 2, 1)
        plt.bar(range(1, len(self.pca.explained_variance_ratio_) + 1),
                self.pca.explained_variance_ratio_)
        plt.xlabel('Component')
        plt.ylabel('Variance Ratio')
        plt.title('Explained Variance per Component')
        
        plt.subplot(1, 2, 2)
        plt.plot(range(1, len(self.pca.explained_variance_ratio_) + 1),
                 np.cumsum(self.pca.explained_variance_ratio_), 'bo-')
        plt.axhline(y=self.variance_threshold, color='r', linestyle='--')
        plt.xlabel('Number of Components')
        plt.ylabel('Cumulative Variance')
        plt.title('Cumulative Explained Variance')
        
        plt.tight_layout()
        plt.show()

# Usage
analyzer = PCAAnalyzer(variance_threshold=0.95)
analyzer.fit(X)
X_transformed = analyzer.transform(X)

print("PCA Summary:")
for key, value in analyzer.get_summary().items():
    print(f"  {key}: {value}")

analyzer.plot_variance()

This reusable class encapsulates the complete PCA workflow for easy application to new datasets.

Best Practices for PCA

Always standardize first: PCA is sensitive to feature scales
Choose components wisely: Use scree plots or variance thresholds
Fit on training data only: Prevent data leakage in ML pipelines
Consider interpretability: PCA components may be harder to explain than original features
Check reconstruction error: Verify information preservation
Combine with domain knowledge: Statistical importance doesn't always equal domain importance

Summary

Principal Component Analysis is a fundamental dimensionality reduction technique that transforms correlated features into uncorrelated principal components. By capturing maximum variance in fewer dimensions, PCA enables visualization of high-dimensional data, speeds up model training, and combats the curse of dimensionality. Understanding how to choose the number of components, interpret loadings, and integrate PCA into machine learning pipelines is essential for effective data preprocessing. While PCA has limitations with non-linear relationships, it remains one of the most widely used techniques for dimensionality reduction in machine learning.

Related Lessons

Feature Scaling

Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.

Encoding Categorical Variables

Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.

Feature Selection Techniques

Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.