Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.
Standard linear regression minimizes the error between predictions and actual values without any constraints on model weights. This can lead to large weight values that cause overfitting, especially with many features or multicollinearity.
Regularization addresses this by adding a penalty term to the cost function that discourages large weights. This technique improves model generalization and produces more robust predictions on unseen data.
Overfitting occurs when a model learns the training data too well, including its noise. Signs of overfitting include:
Regularization adds a penalty for model complexity:
Total Cost = Prediction Error + Penalty Term
By penalizing large weights, regularization:
Ridge regression adds the sum of squared weights as a penalty term. This is called L2 regularization.
Cost = MSE + α × Σ(wᵢ²)
Where:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data with multicollinearity
np.random.seed(42)
n_samples = 100
X1 = np.random.randn(n_samples)
X2 = X1 + np.random.randn(n_samples) * 0.1 # Correlated with X1
X3 = np.random.randn(n_samples)
X4 = np.random.randn(n_samples)
X = np.column_stack([X1, X2, X3, X4])
y = 3*X1 + 2*X3 + np.random.randn(n_samples) * 0.5
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
This creates data where features X1 and X2 are highly correlated (multicollinearity), which can cause problems for standard linear regression.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Important: Always scale features before applying regularization. The penalty treats all coefficients equally, so features must be on the same scale for fair comparison.
# Standard Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)
# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)
# Compare coefficients
print("Linear Regression Coefficients:")
print(linear_model.coef_.round(3))
print("\nRidge Regression Coefficients:")
print(ridge_model.coef_.round(3))
Output:
Linear Regression Coefficients:
[1.892 1.131 1.987 0.021]
Ridge Regression Coefficients:
[1.524 1.467 1.971 0.018]
Notice how Ridge distributes the effect more evenly between correlated features (X1 and X2).
# Predictions
linear_pred = linear_model.predict(X_test_scaled)
ridge_pred = ridge_model.predict(X_test_scaled)
# Metrics
print("Linear Regression:")
print(f" R² Score: {r2_score(y_test, linear_pred):.4f}")
print(f" RMSE: {np.sqrt(mean_squared_error(y_test, linear_pred)):.4f}")
print("\nRidge Regression:")
print(f" R² Score: {r2_score(y_test, ridge_pred):.4f}")
print(f" RMSE: {np.sqrt(mean_squared_error(y_test, ridge_pred)):.4f}")
Ridge regression often performs better on test data when multicollinearity or overfitting is present.
Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute weights as the penalty term. This is called L1 regularization.
Cost = MSE + α × Σ|wᵢ|
Where:
| Aspect | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | Sum of squared weights | Sum of absolute weights |
| Feature Selection | No (all features kept) | Yes (can zero out features) |
| Multicollinearity | Distributes weight among correlated features | Picks one feature, zeros others |
| Best For | Many small effects | Few important features |
from sklearn.linear_model import Lasso
# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)
# Compare all three models
print("Coefficient Comparison:")
print(f"{'Feature':<10} {'Linear':<10} {'Ridge':<10} {'Lasso':<10}")
print("-" * 40)
for i in range(4):
print(f"X{i+1:<9} {linear_model.coef_[i]:<10.3f} "
f"{ridge_model.coef_[i]:<10.3f} {lasso_model.coef_[i]:<10.3f}")
Output:
Coefficient Comparison:
Feature Linear Ridge Lasso
----------------------------------------
X1 1.892 1.524 2.987
X2 1.131 1.467 0.000
X3 1.987 1.971 1.952
X4 0.021 0.018 0.000
Lasso sets the coefficients of X2 and X4 to exactly zero, effectively removing them from the model. This automatic feature selection is Lasso's most distinctive characteristic.
Elastic Net combines both L1 and L2 penalties, offering a balance between Ridge and Lasso regression.
Cost = MSE + α × [(1-l1_ratio) × Σ(wᵢ²)/2 + l1_ratio × Σ|wᵢ|]
Where:
from sklearn.linear_model import ElasticNet
# Elastic Net with 50% L1, 50% L2
elastic_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_model.fit(X_train_scaled, y_train)
# Display coefficients
print("Elastic Net Coefficients:")
print(elastic_model.coef_.round(3))
models = {
'Linear': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
results.append({
'Model': name,
'R² Score': r2_score(y_test, y_pred),
'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
'Non-zero Coefs': np.sum(model.coef_ != 0)
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
This comparison helps identify which regularization approach works best for your specific dataset.
Scikit-learn provides built-in cross-validation classes that automatically find the optimal alpha value.
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
# Define alpha values to test
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
# Ridge with cross-validation
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
# Lasso with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")
# Elastic Net with cross-validation
elastic_cv = ElasticNetCV(
alphas=alphas,
l1_ratio=[0.1, 0.5, 0.7, 0.9],
cv=5
)
elastic_cv.fit(X_train_scaled, y_train)
print(f"Best ElasticNet alpha: {elastic_cv.alpha_}")
print(f"Best ElasticNet l1_ratio: {elastic_cv.l1_ratio_}")
Cross-validation tests each alpha value on different data subsets, selecting the value that generalizes best.
import matplotlib.pyplot as plt
alphas = np.logspace(-4, 4, 100)
ridge_coefs = []
lasso_coefs = []
for alpha in alphas:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train_scaled, y_train)
ridge_coefs.append(ridge.coef_)
lasso = Lasso(alpha=alpha, max_iter=10000)
lasso.fit(X_train_scaled, y_train)
lasso_coefs.append(lasso.coef_)
# Plot Ridge path
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Coefficient Path')
plt.legend(['X1', 'X2', 'X3', 'X4'])
plt.subplot(1, 2, 2)
plt.plot(alphas, lasso_coefs)
plt.xscale('log')
plt.xlabel('Alpha (log scale)')
plt.ylabel('Coefficient Value')
plt.title('Lasso Coefficient Path')
plt.legend(['X1', 'X2', 'X3', 'X4'])
plt.tight_layout()
plt.show()
This visualization shows how coefficients shrink as alpha increases. Notice that Ridge coefficients approach but never reach zero, while Lasso coefficients become exactly zero at certain alpha values.
Use Ridge when:
Use Lasso when:
Use Elastic Net when:
Regularization is essential for building robust machine learning models that generalize well to new data.
Key takeaways:
Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.
Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.
Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.