Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.
Real-world prediction problems rarely depend on just one variable. House prices depend on size, location, bedrooms, and age. Salary depends on experience, education, and skills. Multiple regression handles these multi-feature scenarios effectively.
Similarly, many relationships in nature aren't perfectly linear. Polynomial regression captures curved patterns that simple linear models miss. Together, these techniques significantly expand your predictive modeling capabilities.
Multiple linear regression extends simple linear regression to include multiple input features. Instead of fitting a line, it fits a hyperplane through multi-dimensional space.
ŷ = w₀ + w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ
Where:
Each weight represents the impact of its corresponding feature on the prediction, assuming all other features remain constant.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Create sample dataset
data = {
'size_sqft': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450],
'bedrooms': [3, 3, 2, 4, 2, 3, 4, 4],
'age_years': [10, 15, 20, 12, 25, 8, 5, 3],
'price': [245, 312, 279, 308, 199, 325, 405, 450]
}
df = pd.DataFrame(data)
This creates a dataset with multiple features (square footage, bedrooms, age) to predict house prices.
X = df[['size_sqft', 'bedrooms', 'age_years']]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
The features matrix X contains all input variables, while y holds the target variable (price).
model = LinearRegression()
model.fit(X_train, y_train)
# View coefficients
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
})
print(feature_importance)
print(f"\nIntercept: {model.intercept_:.2f}")
Output:
Feature Coefficient
0 size_sqft 0.12
1 bedrooms 25.40
2 age_years -3.85
Intercept: 85.23
This reveals that:
y_pred = model.predict(X_test)
# Predict price for a new house
new_house = np.array([[2000, 3, 7]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:.2f}k")
The model combines all feature values with their respective weights to generate the final prediction.
Understanding which features matter most helps in feature selection and model interpretation.
# Calculate absolute importance
importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_,
'Abs_Importance': np.abs(model.coef_)
})
importance = importance.sort_values('Abs_Importance', ascending=False)
print(importance)
Note: When features have different scales, coefficients alone don't indicate true importance. Feature scaling (covered in preprocessing) allows fair comparison.
Polynomial regression models non-linear relationships by adding polynomial terms (squared, cubed, etc.) of the original features. Despite the curved fit, it remains a linear model because it's linear in its coefficients.
For a single feature with degree 2:
ŷ = w₀ + w₁x + w₂x²
For degree 3:
ŷ = w₀ + w₁x + w₂x² + w₃x³
Higher degrees allow the model to capture more complex curves.
Polynomial regression is appropriate when:
Real-world examples:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
# Generate data with quadratic relationship
np.random.seed(42)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 + 3*X.flatten() - 0.5*X.flatten()**2 + np.random.randn(50)*2
This generates data following a quadratic pattern with some random noise, simulating real-world non-linear data.
plt.scatter(X, y, color='blue', alpha=0.6)
plt.xlabel('Feature X')
plt.ylabel('Target y')
plt.title('Non-Linear Data Pattern')
plt.show()
The scatter plot reveals a curved relationship that a straight line cannot capture effectively.
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features of degree 2
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Polynomial shape: {X_poly.shape}")
print(f"Feature names: {poly_features.get_feature_names_out()}")
Output:
Original shape: (50, 1)
Polynomial shape: (50, 2)
Feature names: ['x0' 'x0^2']
PolynomialFeatures transforms the original feature x into [x, x²], enabling the linear regression model to fit a curve.
model = LinearRegression()
model.fit(X_poly, y)
# Make predictions
y_pred = model.predict(X_poly)
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.2f}")
The model learns coefficients for both x and x², effectively fitting a parabola to the data.
# Sort for smooth line plot
sort_idx = X.flatten().argsort()
plt.scatter(X, y, color='blue', alpha=0.6, label='Data')
plt.plot(X[sort_idx], y_pred[sort_idx], color='red',
linewidth=2, label='Polynomial Fit')
plt.xlabel('Feature X')
plt.ylabel('Target y')
plt.title('Polynomial Regression (Degree 2)')
plt.legend()
plt.show()
The curved red line shows how polynomial regression captures the non-linear pattern in the data.
Scikit-learn's Pipeline combines preprocessing and modeling into a single object, making code cleaner and preventing data leakage.
from sklearn.pipeline import Pipeline
# Create polynomial regression pipeline
poly_pipeline = Pipeline([
('poly_features', PolynomialFeatures(degree=2)),
('linear_regression', LinearRegression())
])
# Fit and predict in one step
poly_pipeline.fit(X, y)
y_pred = poly_pipeline.predict(X)
# Evaluate
r2 = poly_pipeline.score(X, y)
print(f"R² Score: {r2:.4f}")
The pipeline automatically transforms features before fitting, simplifying the workflow considerably.
Selecting the polynomial degree involves balancing underfitting and overfitting:
from sklearn.metrics import mean_squared_error
degrees = [1, 2, 3, 5, 10]
plt.figure(figsize=(12, 4))
for i, degree in enumerate(degrees, 1):
plt.subplot(1, 5, i)
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
mse = mean_squared_error(y, y_pred)
sort_idx = X.flatten().argsort()
plt.scatter(X, y, alpha=0.5, s=20)
plt.plot(X[sort_idx], y_pred[sort_idx], 'r-', linewidth=2)
plt.title(f'Degree {degree}\nMSE: {mse:.2f}')
plt.tight_layout()
plt.show()
This comparison shows how different degrees affect the fit. Degree 2 typically fits quadratic data well, while degree 10 often creates an overly complex curve.
from sklearn.model_selection import cross_val_score
best_degree = 1
best_score = -np.inf
for degree in range(1, 8):
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('model', LinearRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='r2')
mean_score = scores.mean()
print(f"Degree {degree}: R² = {mean_score:.4f}")
if mean_score > best_score:
best_score = mean_score
best_degree = degree
print(f"\nBest degree: {best_degree}")
Cross-validation tests each degree on held-out data, helping identify the degree that generalizes best.
When you have multiple features and non-linear relationships, polynomial features creates interaction terms and polynomial terms for all features.
# Example with 2 features
X_multi = np.array([[1, 2], [3, 4], [5, 6]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_transformed = poly.fit_transform(X_multi)
print("Feature names:")
print(poly.get_feature_names_out())
Output:
Feature names:
['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2']
The transformation creates:
x0, x1x0², x1²x0 × x1Warning: Feature count grows rapidly with more features and higher degrees. This can lead to overfitting and computational expense.
Multiple and polynomial regression extend basic linear regression to handle real-world complexity.
Key takeaways:
PolynomialFeatures to transform features before fittingMaster regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.
Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.
Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.