VIDHYAI
HomeBlogTutorialsNewsAboutContact
VIDHYAI

Your Gateway to AI Knowledge

CONTENT

  • Blog
  • Tutorials
  • News

COMPANY

  • About
  • Contact

LEGAL

  • Privacy Policy
  • Terms of Service
  • Disclaimer
Home
Tutorials
Machine Learning
Supervised Learning - Regression
Regression Algorithms - Predicting Continuous Values
Ridge, Lasso, and Elastic Net Regularization
Back to Regression Algorithms - Predicting Continuous Values
Progress3/4 lessons (75%)
Lesson 3

Ridge, Lasso, and Elastic Net Regularization

Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.

10 min read10 views

Introduction to Regularization

Standard linear regression minimizes the error between predictions and actual values without any constraints on model weights. This can lead to large weight values that cause overfitting, especially with many features or multicollinearity.

Regularization addresses this by adding a penalty term to the cost function that discourages large weights. This technique improves model generalization and produces more robust predictions on unseen data.


Why Regularization Matters

The Overfitting Problem

Overfitting occurs when a model learns the training data too well, including its noise. Signs of overfitting include:

  • High training accuracy but poor test accuracy
  • Large coefficient values
  • Unstable predictions with small data changes

How Regularization Helps

Regularization adds a penalty for model complexity:

Total Cost = Prediction Error + Penalty Term

By penalizing large weights, regularization:

  • Prevents coefficients from growing too large
  • Reduces model variance
  • Improves performance on new data
  • Handles multicollinearity better

Ridge Regression (L2 Regularization)

What is Ridge Regression?

Ridge regression adds the sum of squared weights as a penalty term. This is called L2 regularization.

Ridge Cost Function

Cost = MSE + α × Σ(wᵢ²)

Where:

  • MSE = Mean Squared Error
  • α (alpha) = Regularization strength
  • Σ(wᵢ²) = Sum of squared weights

How Ridge Works

  • Shrinks coefficients toward zero but never makes them exactly zero
  • All features remain in the model
  • Works well when many features have small to medium effects
  • Handles multicollinearity effectively

The Alpha Parameter

  • α = 0: No regularization (standard linear regression)
  • Small α: Weak regularization, closer to standard regression
  • Large α: Strong regularization, coefficients shrink significantly

Implementing Ridge Regression

Step 1: Prepare Data

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data with multicollinearity
np.random.seed(42)
n_samples = 100

X1 = np.random.randn(n_samples)
X2 = X1 + np.random.randn(n_samples) * 0.1  # Correlated with X1
X3 = np.random.randn(n_samples)
X4 = np.random.randn(n_samples)

X = np.column_stack([X1, X2, X3, X4])
y = 3*X1 + 2*X3 + np.random.randn(n_samples) * 0.5

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

This creates data where features X1 and X2 are highly correlated (multicollinearity), which can cause problems for standard linear regression.

Step 2: Scale Features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Important: Always scale features before applying regularization. The penalty treats all coefficients equally, so features must be on the same scale for fair comparison.

Step 3: Compare Linear and Ridge Regression

# Standard Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)

# Compare coefficients
print("Linear Regression Coefficients:")
print(linear_model.coef_.round(3))
print("\nRidge Regression Coefficients:")
print(ridge_model.coef_.round(3))

Output:

Linear Regression Coefficients:
[1.892 1.131 1.987 0.021]

Ridge Regression Coefficients:
[1.524 1.467 1.971 0.018]

Notice how Ridge distributes the effect more evenly between correlated features (X1 and X2).

Step 4: Evaluate Performance

# Predictions
linear_pred = linear_model.predict(X_test_scaled)
ridge_pred = ridge_model.predict(X_test_scaled)

# Metrics
print("Linear Regression:")
print(f"  R² Score: {r2_score(y_test, linear_pred):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, linear_pred)):.4f}")

print("\nRidge Regression:")
print(f"  R² Score: {r2_score(y_test, ridge_pred):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, ridge_pred)):.4f}")

Ridge regression often performs better on test data when multicollinearity or overfitting is present.


Lasso Regression (L1 Regularization)

What is Lasso Regression?

Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute weights as the penalty term. This is called L1 regularization.

Lasso Cost Function

Cost = MSE + α × Σ|wᵢ|

Where:

  • Σ|wᵢ| = Sum of absolute values of weights

How Lasso Works

  • Shrinks coefficients toward zero
  • Can reduce coefficients to exactly zero
  • Automatically performs feature selection
  • Produces sparse models (fewer non-zero coefficients)

Lasso vs Ridge

Aspect Ridge (L2) Lasso (L1)
Penalty Sum of squared weights Sum of absolute weights
Feature Selection No (all features kept) Yes (can zero out features)
Multicollinearity Distributes weight among correlated features Picks one feature, zeros others
Best For Many small effects Few important features

Implementing Lasso Regression

from sklearn.linear_model import Lasso

# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)

# Compare all three models
print("Coefficient Comparison:")
print(f"{'Feature':<10} {'Linear':<10} {'Ridge':<10} {'Lasso':<10}")
print("-" * 40)

for i in range(4):
    print(f"X{i+1:<9} {linear_model.coef_[i]:<10.3f} "
          f"{ridge_model.coef_[i]:<10.3f} {lasso_model.coef_[i]:<10.3f}")

Output:

Coefficient Comparison:
Feature    Linear     Ridge      Lasso     
----------------------------------------
X1         1.892      1.524      2.987     
X2         1.131      1.467      0.000     
X3         1.987      1.971      1.952     
X4         0.021      0.018      0.000     

Lasso sets the coefficients of X2 and X4 to exactly zero, effectively removing them from the model. This automatic feature selection is Lasso's most distinctive characteristic.


Elastic Net: Combining L1 and L2

What is Elastic Net?

Elastic Net combines both L1 and L2 penalties, offering a balance between Ridge and Lasso regression.

Elastic Net Cost Function

Cost = MSE + α × [(1-l1_ratio) × Σ(wᵢ²)/2 + l1_ratio × Σ|wᵢ|]

Where:

  • l1_ratio = Balance between L1 and L2 (0 to 1)
  • l1_ratio = 0: Pure Ridge regression
  • l1_ratio = 1: Pure Lasso regression
  • l1_ratio = 0.5: Equal mix of both

When to Use Elastic Net

  • When you have many correlated features
  • When you want feature selection but also want to keep correlated features together
  • When Lasso selects too few features
  • As a robust default choice

Implementing Elastic Net

from sklearn.linear_model import ElasticNet

# Elastic Net with 50% L1, 50% L2
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_model.fit(X_train_scaled, y_train)

# Display coefficients
print("Elastic Net Coefficients:")
print(elastic_model.coef_.round(3))

Comparing All Regularization Methods

models = {
    'Linear': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    results.append({
        'Model': name,
        'R² Score': r2_score(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'Non-zero Coefs': np.sum(model.coef_ != 0)
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

This comparison helps identify which regularization approach works best for your specific dataset.


Tuning the Alpha Parameter

Using Cross-Validation

Scikit-learn provides built-in cross-validation classes that automatically find the optimal alpha value.

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

# Define alpha values to test
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

# Ridge with cross-validation
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")

# Lasso with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")

# Elastic Net with cross-validation
elastic_cv = ElasticNetCV(
    alphas=alphas, 
    l1_ratio=[0.1, 0.5, 0.7, 0.9], 
    cv=5
)
elastic_cv.fit(X_train_scaled, y_train)
print(f"Best ElasticNet alpha: {elastic_cv.alpha_}")
print(f"Best ElasticNet l1_ratio: {elastic_cv.l1_ratio_}")

Cross-validation tests each alpha value on different data subsets, selecting the value that generalizes best.


Visualizing Regularization Effects

Coefficient Path Plot

import matplotlib.pyplot as plt

alphas = np.logspace(-4, 4, 100)
ridge_coefs = []
lasso_coefs = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    ridge_coefs.append(ridge.coef_)
    
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    lasso_coefs.append(lasso.coef_)

# Plot Ridge path
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Coefficient Path')
plt.legend(['X1', 'X2', 'X3', 'X4'])

plt.subplot(1, 2, 2)
plt.plot(alphas, lasso_coefs)
plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficient Value')
plt.title('Lasso Coefficient Path')
plt.legend(['X1', 'X2', 'X3', 'X4'])

plt.tight_layout()
plt.show()

This visualization shows how coefficients shrink as alpha increases. Notice that Ridge coefficients approach but never reach zero, while Lasso coefficients become exactly zero at certain alpha values.


Choosing the Right Regularization Method

Decision Guide

Use Ridge when:

  • You want to keep all features
  • Features are correlated
  • You believe most features contribute to the prediction

Use Lasso when:

  • You suspect only a few features matter
  • You want automatic feature selection
  • Interpretability is important

Use Elastic Net when:

  • You have many correlated features
  • Lasso is too aggressive in feature selection
  • You want the benefits of both methods

Summary

Regularization is essential for building robust machine learning models that generalize well to new data.

Key takeaways:

  • Ridge regression (L2) adds squared weights penalty, shrinking coefficients without eliminating them
  • Lasso regression (L1) adds absolute weights penalty, enabling automatic feature selection
  • Elastic Net combines L1 and L2, offering flexibility between the two approaches
  • The alpha parameter controls regularization strength
  • Always scale features before applying regularization
  • Use cross-validation to find optimal hyperparameters
  • Choose the method based on your data characteristics and modeling goals
Back to Regression Algorithms - Predicting Continuous Values

Previous Lesson

Multiple and Polynomial Regression

Next Lesson

Regression Project - House Price Prediction

Related Lessons

1

Regression Project - House Price Prediction

Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.

2

Multiple and Polynomial Regression

Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.

3

Linear Regression Theory and Implementation

Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.

In this track (4)

1Linear Regression Theory and Implementation2Multiple and Polynomial Regression3Ridge, Lasso, and Elastic Net Regularization4Regression Project - House Price Prediction