Back to Regression Algorithms - Predicting Continuous Values

Progress3/4 lessons (75%)

Lesson 3

Ridge, Lasso, and Elastic Net Regularization

Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.

10 min read26 views

Introduction to Regularization

Standard linear regression minimizes the error between predictions and actual values without any constraints on model weights. This can lead to large weight values that cause overfitting, especially with many features or multicollinearity.

Regularization addresses this by adding a penalty term to the cost function that discourages large weights. This technique improves model generalization and produces more robust predictions on unseen data.

Why Regularization Matters

The Overfitting Problem

Overfitting occurs when a model learns the training data too well, including its noise. Signs of overfitting include:

High training accuracy but poor test accuracy
Large coefficient values
Unstable predictions with small data changes

How Regularization Helps

Regularization adds a penalty for model complexity:

Total Cost = Prediction Error + Penalty Term

By penalizing large weights, regularization:

Prevents coefficients from growing too large
Reduces model variance
Improves performance on new data
Handles multicollinearity better

Ridge Regression (L2 Regularization)

What is Ridge Regression?

Ridge regression adds the sum of squared weights as a penalty term. This is called L2 regularization.

Ridge Cost Function

Cost = MSE + α × Σ(wᵢ²)

Where:

MSE = Mean Squared Error
α (alpha) = Regularization strength
Σ(wᵢ²) = Sum of squared weights

How Ridge Works

Shrinks coefficients toward zero but never makes them exactly zero
All features remain in the model
Works well when many features have small to medium effects
Handles multicollinearity effectively

The Alpha Parameter

α = 0: No regularization (standard linear regression)
Small α: Weak regularization, closer to standard regression
Large α: Strong regularization, coefficients shrink significantly

Implementing Ridge Regression

Step 1: Prepare Data

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data with multicollinearity
np.random.seed(42)
n_samples = 100

X1 = np.random.randn(n_samples)
X2 = X1 + np.random.randn(n_samples) * 0.1  # Correlated with X1
X3 = np.random.randn(n_samples)
X4 = np.random.randn(n_samples)

X = np.column_stack([X1, X2, X3, X4])
y = 3*X1 + 2*X3 + np.random.randn(n_samples) * 0.5

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

This creates data where features X1 and X2 are highly correlated (multicollinearity), which can cause problems for standard linear regression.

Step 2: Scale Features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Important: Always scale features before applying regularization. The penalty treats all coefficients equally, so features must be on the same scale for fair comparison.

Step 3: Compare Linear and Ridge Regression

# Standard Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)

# Compare coefficients
print("Linear Regression Coefficients:")
print(linear_model.coef_.round(3))
print("\nRidge Regression Coefficients:")
print(ridge_model.coef_.round(3))

Output:

Linear Regression Coefficients:
[1.892 1.131 1.987 0.021]

Ridge Regression Coefficients:
[1.524 1.467 1.971 0.018]

Notice how Ridge distributes the effect more evenly between correlated features (X1 and X2).

Step 4: Evaluate Performance

# Predictions
linear_pred = linear_model.predict(X_test_scaled)
ridge_pred = ridge_model.predict(X_test_scaled)

# Metrics
print("Linear Regression:")
print(f"  R² Score: {r2_score(y_test, linear_pred):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, linear_pred)):.4f}")

print("\nRidge Regression:")
print(f"  R² Score: {r2_score(y_test, ridge_pred):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, ridge_pred)):.4f}")

Ridge regression often performs better on test data when multicollinearity or overfitting is present.

Lasso Regression (L1 Regularization)

What is Lasso Regression?

Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute weights as the penalty term. This is called L1 regularization.

Lasso Cost Function

Cost = MSE + α × Σ|wᵢ|

Where:

Σ|wᵢ| = Sum of absolute values of weights

How Lasso Works

Shrinks coefficients toward zero
Can reduce coefficients to exactly zero
Automatically performs feature selection
Produces sparse models (fewer non-zero coefficients)

Lasso vs Ridge

Aspect	Ridge (L2)	Lasso (L1)
Penalty	Sum of squared weights	Sum of absolute weights
Feature Selection	No (all features kept)	Yes (can zero out features)
Multicollinearity	Distributes weight among correlated features	Picks one feature, zeros others
Best For	Many small effects	Few important features

Implementing Lasso Regression

from sklearn.linear_model import Lasso

# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)

# Compare all three models
print("Coefficient Comparison:")
print(f"{'Feature':<10} {'Linear':<10} {'Ridge':<10} {'Lasso':<10}")
print("-" * 40)

for i in range(4):
    print(f"X{i+1:<9} {linear_model.coef_[i]:<10.3f} "
          f"{ridge_model.coef_[i]:<10.3f} {lasso_model.coef_[i]:<10.3f}")

Output:

Coefficient Comparison:
Feature    Linear     Ridge      Lasso     
----------------------------------------
X1         1.892      1.524      2.987     
X2         1.131      1.467      0.000     
X3         1.987      1.971      1.952     
X4         0.021      0.018      0.000

Lasso sets the coefficients of X2 and X4 to exactly zero, effectively removing them from the model. This automatic feature selection is Lasso's most distinctive characteristic.

Elastic Net: Combining L1 and L2

What is Elastic Net?

Elastic Net combines both L1 and L2 penalties, offering a balance between Ridge and Lasso regression.

Elastic Net Cost Function

Cost = MSE + α × [(1-l1_ratio) × Σ(wᵢ²)/2 + l1_ratio × Σ|wᵢ|]

Where:

l1_ratio = Balance between L1 and L2 (0 to 1)
l1_ratio = 0: Pure Ridge regression
l1_ratio = 1: Pure Lasso regression
l1_ratio = 0.5: Equal mix of both

When to Use Elastic Net

When you have many correlated features
When you want feature selection but also want to keep correlated features together
When Lasso selects too few features
As a robust default choice

Implementing Elastic Net

from sklearn.linear_model import ElasticNet

# Elastic Net with 50% L1, 50% L2
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_model.fit(X_train_scaled, y_train)

# Display coefficients
print("Elastic Net Coefficients:")
print(elastic_model.coef_.round(3))

Comparing All Regularization Methods

models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    results.append({
        'Model': name,
        'R² Score': r2_score(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'Non-zero Coefs': np.sum(model.coef_ != 0)
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

This comparison helps identify which regularization approach works best for your specific dataset.

Tuning the Alpha Parameter

Using Cross-Validation

Scikit-learn provides built-in cross-validation classes that automatically find the optimal alpha value.

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

# Define alpha values to test
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

# Ridge with cross-validation
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")

# Lasso with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")

# Elastic Net with cross-validation
elastic_cv = ElasticNetCV(
    alphas=alphas, 
    l1_ratio=[0.1, 0.5, 0.7, 0.9], 
    cv=5
)
elastic_cv.fit(X_train_scaled, y_train)
print(f"Best ElasticNet alpha: {elastic_cv.alpha_}")
print(f"Best ElasticNet l1_ratio: {elastic_cv.l1_ratio_}")

Cross-validation tests each alpha value on different data subsets, selecting the value that generalizes best.

Visualizing Regularization Effects

Coefficient Path Plot

import matplotlib.pyplot as plt

alphas = np.logspace(-4, 4, 100)
ridge_coefs = []
lasso_coefs = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    ridge_coefs.append(ridge.coef_)
    
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    lasso_coefs.append(lasso.coef_)

# Plot Ridge path
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Coefficient Path')
plt.legend(['X1', 'X2', 'X3', 'X4'])

plt.subplot(1, 2, 2)
plt.plot(alphas, lasso_coefs)
plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficient Value')
plt.title('Lasso Coefficient Path')
plt.legend(['X1', 'X2', 'X3', 'X4'])

plt.tight_layout()
plt.show()

This visualization shows how coefficients shrink as alpha increases. Notice that Ridge coefficients approach but never reach zero, while Lasso coefficients become exactly zero at certain alpha values.

Choosing the Right Regularization Method

Decision Guide

Use Ridge when:

You want to keep all features
Features are correlated
You believe most features contribute to the prediction

Use Lasso when:

You suspect only a few features matter
You want automatic feature selection
Interpretability is important

Use Elastic Net when:

You have many correlated features
Lasso is too aggressive in feature selection
You want the benefits of both methods

Summary

Regularization is essential for building robust machine learning models that generalize well to new data.

Key takeaways:

Ridge regression (L2) adds squared weights penalty, shrinking coefficients without eliminating them
Lasso regression (L1) adds absolute weights penalty, enabling automatic feature selection
Elastic Net combines L1 and L2, offering flexibility between the two approaches
The alpha parameter controls regularization strength
Always scale features before applying regularization
Use cross-validation to find optimal hyperparameters
Choose the method based on your data characteristics and modeling goals

Related Lessons

Multiple and Polynomial Regression

Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.

Regression Project - House Price Prediction

Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.

Linear Regression Theory and Implementation

Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.