Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.
This project brings together all the regression concepts you've learned. You'll build a machine learning model to predict house prices based on various features like size, location characteristics, and property attributes.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)
These imports provide all the tools needed for data manipulation, visualization, and machine learning.
# Generate realistic housing data
n_samples = 500
data = {
'square_feet': np.random.randint(800, 4000, n_samples),
'bedrooms': np.random.randint(1, 6, n_samples),
'bathrooms': np.random.randint(1, 4, n_samples),
'age_years': np.random.randint(0, 50, n_samples),
'lot_size': np.random.randint(2000, 15000, n_samples),
'garage_spaces': np.random.randint(0, 4, n_samples),
'neighborhood_score': np.random.randint(1, 10, n_samples),
'has_pool': np.random.randint(0, 2, n_samples)
}
# Create target variable with realistic relationships
df = pd.DataFrame(data)
df['price'] = (
50000 +
df['square_feet'] * 150 +
df['bedrooms'] * 15000 +
df['bathrooms'] * 20000 -
df['age_years'] * 1500 +
df['lot_size'] * 5 +
df['garage_spaces'] * 10000 +
df['neighborhood_score'] * 25000 +
df['has_pool'] * 30000 +
np.random.randn(n_samples) * 30000 # Add noise
)
print(f"Dataset shape: {df.shape}")
print(df.head())
This creates a synthetic but realistic housing dataset where price depends on multiple features with known relationships.
print("Dataset Statistics:")
print(df.describe().round(2))
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
Understanding your data's basic statistics helps identify potential issues and informs preprocessing decisions.
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(df['price'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.title('House Price Distribution')
plt.subplot(1, 2, 2)
plt.boxplot(df['price'])
plt.ylabel('Price ($)')
plt.title('Price Box Plot')
plt.tight_layout()
plt.show()
print(f"Price Range: ${df['price'].min():,.0f} - ${df['price'].max():,.0f}")
print(f"Mean Price: ${df['price'].mean():,.0f}")
print(f"Median Price: ${df['price'].median():,.0f}")
Visualizing the target variable reveals its distribution and potential outliers.
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
The correlation matrix shows relationships between features and identifies potential multicollinearity issues.
# Most correlated features with price
price_correlations = correlation_matrix['price'].sort_values(ascending=False)
print("Feature Correlations with Price:")
print(price_correlations)
This identifies which features have the strongest relationship with the target variable.
# Create meaningful derived features
df['price_per_sqft'] = df['price'] / df['square_feet']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['bed_bath_ratio'] = df['bedrooms'] / df['bathrooms']
df['is_new'] = (df['age_years'] <= 5).astype(int)
df['large_lot'] = (df['lot_size'] > 8000).astype(int)
print("New Features Created:")
print(df[['price_per_sqft', 'total_rooms', 'bed_bath_ratio',
'is_new', 'large_lot']].head())
Feature engineering creates new variables that may capture important patterns not present in the original features.
# Remove price_per_sqft as it's derived from target
# Keep it only for analysis, not for training
df_model = df.drop(columns=['price_per_sqft'])
Features derived from the target variable cause data leakage and must be excluded from training.
# Define feature columns
feature_columns = ['square_feet', 'bedrooms', 'bathrooms', 'age_years',
'lot_size', 'garage_spaces', 'neighborhood_score',
'has_pool', 'total_rooms', 'bed_bath_ratio',
'is_new', 'large_lot']
X = df_model[feature_columns]
y = df_model['price']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
The 80-20 split provides enough training data while reserving samples for unbiased evaluation.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns)
Scaling ensures all features contribute equally during training and is essential for regularized models.
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Lasso Regression': Lasso(alpha=100),
'Elastic Net': ElasticNet(alpha=100, l1_ratio=0.5)
}
def evaluate_model(model, X_train, X_test, y_train, y_test):
"""Train model and return evaluation metrics."""
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
metrics = {
'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
'MAE': mean_absolute_error(y_test, y_pred),
'R² Score': r2_score(y_test, y_pred)
}
return metrics, y_pred
# Store results
results = []
for name, model in models.items():
metrics, _ = evaluate_model(model, X_train_scaled, X_test_scaled,
y_train, y_test)
metrics['Model'] = name
results.append(metrics)
results_df = pd.DataFrame(results)
results_df = results_df[['Model', 'R² Score', 'RMSE', 'MAE']]
print("Model Comparison:")
print(results_df.to_string(index=False))
Output:
Model Comparison:
Model R² Score RMSE MAE
Linear Regression 0.9521 31842.45 25123.67
Ridge Regression 0.9519 31901.23 25189.34
Lasso Regression 0.9498 32612.78 25834.12
Elastic Net 0.9507 32321.56 25623.89
print("\nCross-Validation Results (5-fold):")
print("-" * 50)
for name, model in models.items():
cv_scores = cross_val_score(model, X_train_scaled, y_train,
cv=5, scoring='r2')
print(f"{name}:")
print(f" Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
Cross-validation provides a more reliable estimate of model performance by testing on multiple data splits.
# Train final model (using Linear Regression for interpretability)
final_model = LinearRegression()
final_model.fit(X_train_scaled, y_train)
# Feature importance
feature_importance = pd.DataFrame({
'Feature': feature_columns,
'Coefficient': final_model.coef_,
'Abs_Coefficient': np.abs(final_model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)
print("Feature Importance (by coefficient magnitude):")
print(feature_importance.to_string(index=False))
plt.figure(figsize=(10, 6))
colors = ['green' if c > 0 else 'red' for c in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'],
color=colors, alpha=0.7)
plt.xlabel('Coefficient Value')
plt.title('Feature Coefficients (Green = Positive, Red = Negative)')
plt.tight_layout()
plt.show()
This visualization shows which features increase (green) or decrease (red) the predicted price.
y_pred = final_model.predict(X_test_scaled)
residuals = y_test - y_pred
plt.figure(figsize=(12, 4))
# Residuals vs Predicted
plt.subplot(1, 3, 1)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')
# Residual Distribution
plt.subplot(1, 3, 2)
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Residual Value')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
# Actual vs Predicted
plt.subplot(1, 3, 3)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
'r--', linewidth=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted')
plt.tight_layout()
plt.show()
Residual analysis helps identify patterns the model missed and validates modeling assumptions.
What to look for:
from sklearn.linear_model import RidgeCV, LassoCV
# Ridge CV
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0, 1000.0], cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
# Lasso CV
lasso_cv = LassoCV(alphas=[1, 10, 100, 1000, 10000], cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")
# Use the best model based on CV results
best_model = Ridge(alpha=ridge_cv.alpha_)
best_model.fit(X_train_scaled, y_train)
y_pred_best = best_model.predict(X_test_scaled)
print(f"\nFinal Model Performance:")
print(f"R² Score: {r2_score(y_test, y_pred_best):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred_best)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred_best):,.2f}")
def predict_house_price(model, scaler, house_features):
"""Predict price for a single house."""
# Create DataFrame with feature names
house_df = pd.DataFrame([house_features], columns=feature_columns)
# Scale features
house_scaled = scaler.transform(house_df)
# Predict
predicted_price = model.predict(house_scaled)[0]
return predicted_price
# Example: New house to price
new_house = {
'square_feet': 2200,
'bedrooms': 4,
'bathrooms': 2,
'age_years': 10,
'lot_size': 8500,
'garage_spaces': 2,
'neighborhood_score': 7,
'has_pool': 1,
'total_rooms': 6,
'bed_bath_ratio': 2.0,
'is_new': 0,
'large_lot': 1
}
predicted_price = predict_house_price(best_model, scaler, new_house)
print(f"\nPredicted Price for New House: ${predicted_price:,.2f}")
# Calculate prediction interval approximation
training_rmse = np.sqrt(mean_squared_error(y_train,
best_model.predict(X_train_scaled)))
print(f"Predicted Price: ${predicted_price:,.2f}")
print(f"Approximate Range: ${predicted_price - 2*training_rmse:,.2f} "
f"to ${predicted_price + 2*training_rmse:,.2f}")
Providing a range alongside predictions gives users a sense of the uncertainty involved.
import joblib
# Save model and scaler
joblib.dump(best_model, 'house_price_model.pkl')
joblib.dump(scaler, 'feature_scaler.pkl')
print("Model and scaler saved successfully!")
def load_and_predict(house_data):
"""Load saved model and make prediction."""
model = joblib.load('house_price_model.pkl')
scaler = joblib.load('feature_scaler.pkl')
house_df = pd.DataFrame([house_data], columns=feature_columns)
house_scaled = scaler.transform(house_df)
return model.predict(house_scaled)[0]
Saving model components enables deployment and future predictions without retraining.
A complete house price prediction system that:
| Metric | Value |
|---|---|
| Best Model | Ridge Regression |
| R² Score | ~0.95 |
| RMSE | ~$32,000 |
| Most Important Feature | Square Feet |
To enhance this project further, consider:
# Full pipeline in minimal code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error, r2_score
# Load and prepare data
# ... (data loading code here)
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train with cross-validation
model = RidgeCV(cv=5)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")
This condensed version shows the essential workflow for any regression project.
Congratulations! You've completed a full machine learning regression project, applying concepts from linear regression through regularization to build a practical predictive model.
Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.
Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.
Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.