Back to Regression Algorithms - Predicting Continuous Values

Progress4/4 lessons (100%)

Lesson 4

Regression Project - House Price Prediction

Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.

10 min read23 views

Project Overview

This project brings together all the regression concepts you've learned. You'll build a machine learning model to predict house prices based on various features like size, location characteristics, and property attributes.

Project Goals

Load and explore housing data
Prepare data for machine learning
Engineer meaningful features
Train and compare multiple regression models
Evaluate model performance
Make predictions on new data

Step 1: Import Libraries and Load Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)

These imports provide all the tools needed for data manipulation, visualization, and machine learning.

Create Sample Housing Dataset

# Generate realistic housing data
n_samples = 500

data = {
    'square_feet': np.random.randint(800, 4000, n_samples),
    'bedrooms': np.random.randint(1, 6, n_samples),
    'bathrooms': np.random.randint(1, 4, n_samples),
    'age_years': np.random.randint(0, 50, n_samples),
    'lot_size': np.random.randint(2000, 15000, n_samples),
    'garage_spaces': np.random.randint(0, 4, n_samples),
    'neighborhood_score': np.random.randint(1, 10, n_samples),
    'has_pool': np.random.randint(0, 2, n_samples)
}

# Create target variable with realistic relationships
df = pd.DataFrame(data)

df['price'] = (
    50000 +
    df['square_feet'] * 150 +
    df['bedrooms'] * 15000 +
    df['bathrooms'] * 20000 -
    df['age_years'] * 1500 +
    df['lot_size'] * 5 +
    df['garage_spaces'] * 10000 +
    df['neighborhood_score'] * 25000 +
    df['has_pool'] * 30000 +
    np.random.randn(n_samples) * 30000  # Add noise
)

print(f"Dataset shape: {df.shape}")
print(df.head())

This creates a synthetic but realistic housing dataset where price depends on multiple features with known relationships.

Step 2: Exploratory Data Analysis (EDA)

Basic Statistics

print("Dataset Statistics:")
print(df.describe().round(2))

print("\nData Types:")
print(df.dtypes)

print("\nMissing Values:")
print(df.isnull().sum())

Understanding your data's basic statistics helps identify potential issues and informs preprocessing decisions.

Target Variable Distribution

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.hist(df['price'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.title('House Price Distribution')

plt.subplot(1, 2, 2)
plt.boxplot(df['price'])
plt.ylabel('Price ($)')
plt.title('Price Box Plot')

plt.tight_layout()
plt.show()

print(f"Price Range: ${df['price'].min():,.0f} - ${df['price'].max():,.0f}")
print(f"Mean Price: ${df['price'].mean():,.0f}")
print(f"Median Price: ${df['price'].median():,.0f}")

Visualizing the target variable reveals its distribution and potential outliers.

Feature Correlations

plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', 
            center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

The correlation matrix shows relationships between features and identifies potential multicollinearity issues.

Key Feature Relationships

# Most correlated features with price
price_correlations = correlation_matrix['price'].sort_values(ascending=False)
print("Feature Correlations with Price:")
print(price_correlations)

This identifies which features have the strongest relationship with the target variable.

Step 3: Feature Engineering

Create New Features

# Create meaningful derived features
df['price_per_sqft'] = df['price'] / df['square_feet']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['bed_bath_ratio'] = df['bedrooms'] / df['bathrooms']
df['is_new'] = (df['age_years'] <= 5).astype(int)
df['large_lot'] = (df['lot_size'] > 8000).astype(int)

print("New Features Created:")
print(df[['price_per_sqft', 'total_rooms', 'bed_bath_ratio', 
          'is_new', 'large_lot']].head())

Feature engineering creates new variables that may capture important patterns not present in the original features.

Handle the Derived Target-Based Feature

# Remove price_per_sqft as it's derived from target
# Keep it only for analysis, not for training
df_model = df.drop(columns=['price_per_sqft'])

Features derived from the target variable cause data leakage and must be excluded from training.

Step 4: Prepare Data for Modeling

Separate Features and Target

# Define feature columns
feature_columns = ['square_feet', 'bedrooms', 'bathrooms', 'age_years',
                   'lot_size', 'garage_spaces', 'neighborhood_score', 
                   'has_pool', 'total_rooms', 'bed_bath_ratio', 
                   'is_new', 'large_lot']

X = df_model[feature_columns]
y = df_model['price']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Split Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")

The 80-20 split provides enough training data while reserving samples for unbiased evaluation.

Scale Features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns)

Scaling ensures all features contribute equally during training and is essential for regularized models.

Step 5: Train Multiple Regression Models

Define Models

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=100),
    'Elastic Net': ElasticNet(alpha=100, l1_ratio=0.5)
}

Train and Evaluate Each Model

def evaluate_model(model, X_train, X_test, y_train, y_test):
    """Train model and return evaluation metrics."""
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    metrics = {
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'MAE': mean_absolute_error(y_test, y_pred),
        'R² Score': r2_score(y_test, y_pred)
    }
    return metrics, y_pred

# Store results
results = []

for name, model in models.items():
    metrics, _ = evaluate_model(model, X_train_scaled, X_test_scaled, 
                                y_train, y_test)
    metrics['Model'] = name
    results.append(metrics)
    
results_df = pd.DataFrame(results)
results_df = results_df[['Model', 'R² Score', 'RMSE', 'MAE']]
print("Model Comparison:")
print(results_df.to_string(index=False))

Output:

Model Comparison:
              Model  R² Score       RMSE        MAE
  Linear Regression    0.9521   31842.45   25123.67
   Ridge Regression    0.9519   31901.23   25189.34
   Lasso Regression    0.9498   32612.78   25834.12
        Elastic Net    0.9507   32321.56   25623.89

Cross-Validation for Robust Evaluation

print("\nCross-Validation Results (5-fold):")
print("-" * 50)

for name, model in models.items():
    cv_scores = cross_val_score(model, X_train_scaled, y_train, 
                                cv=5, scoring='r2')
    print(f"{name}:")
    print(f"  Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

Cross-validation provides a more reliable estimate of model performance by testing on multiple data splits.

Step 6: Analyze Feature Importance

Extract Coefficients

# Train final model (using Linear Regression for interpretability)
final_model = LinearRegression()
final_model.fit(X_train_scaled, y_train)

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': final_model.coef_,
    'Abs_Coefficient': np.abs(final_model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("Feature Importance (by coefficient magnitude):")
print(feature_importance.to_string(index=False))

Visualize Feature Importance

plt.figure(figsize=(10, 6))
colors = ['green' if c > 0 else 'red' for c in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], 
         color=colors, alpha=0.7)
plt.xlabel('Coefficient Value')
plt.title('Feature Coefficients (Green = Positive, Red = Negative)')
plt.tight_layout()
plt.show()

This visualization shows which features increase (green) or decrease (red) the predicted price.

Step 7: Model Diagnostics

Residual Analysis

y_pred = final_model.predict(X_test_scaled)
residuals = y_test - y_pred

plt.figure(figsize=(12, 4))

# Residuals vs Predicted
plt.subplot(1, 3, 1)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')

# Residual Distribution
plt.subplot(1, 3, 2)
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Residual Value')
plt.ylabel('Frequency')
plt.title('Residual Distribution')

# Actual vs Predicted
plt.subplot(1, 3, 3)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', linewidth=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted')

plt.tight_layout()
plt.show()

Residual analysis helps identify patterns the model missed and validates modeling assumptions.

Interpret Diagnostic Plots

What to look for:

Residuals vs Predicted: Random scatter indicates good fit; patterns suggest missing non-linear relationships
Residual Distribution: Should be approximately normal and centered at zero
Actual vs Predicted: Points should fall close to the diagonal line

Step 8: Hyperparameter Tuning

Find Optimal Regularization Strength

from sklearn.linear_model import RidgeCV, LassoCV

# Ridge CV
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0, 1000.0], cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")

# Lasso CV
lasso_cv = LassoCV(alphas=[1, 10, 100, 1000, 10000], cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")

Train Final Model with Optimal Parameters

# Use the best model based on CV results
best_model = Ridge(alpha=ridge_cv.alpha_)
best_model.fit(X_train_scaled, y_train)

y_pred_best = best_model.predict(X_test_scaled)
print(f"\nFinal Model Performance:")
print(f"R² Score: {r2_score(y_test, y_pred_best):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred_best)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred_best):,.2f}")

Step 9: Make Predictions on New Data

Predict for a Single House

def predict_house_price(model, scaler, house_features):
    """Predict price for a single house."""
    # Create DataFrame with feature names
    house_df = pd.DataFrame([house_features], columns=feature_columns)
    
    # Scale features
    house_scaled = scaler.transform(house_df)
    
    # Predict
    predicted_price = model.predict(house_scaled)[0]
    
    return predicted_price

# Example: New house to price
new_house = {
    'square_feet': 2200,
    'bedrooms': 4,
    'bathrooms': 2,
    'age_years': 10,
    'lot_size': 8500,
    'garage_spaces': 2,
    'neighborhood_score': 7,
    'has_pool': 1,
    'total_rooms': 6,
    'bed_bath_ratio': 2.0,
    'is_new': 0,
    'large_lot': 1
}

predicted_price = predict_house_price(best_model, scaler, new_house)
print(f"\nPredicted Price for New House: ${predicted_price:,.2f}")

Prediction with Confidence Context

# Calculate prediction interval approximation
training_rmse = np.sqrt(mean_squared_error(y_train, 
                        best_model.predict(X_train_scaled)))

print(f"Predicted Price: ${predicted_price:,.2f}")
print(f"Approximate Range: ${predicted_price - 2*training_rmse:,.2f} "
      f"to ${predicted_price + 2*training_rmse:,.2f}")

Providing a range alongside predictions gives users a sense of the uncertainty involved.

Step 10: Save and Document the Model

Save Model Components

import joblib

# Save model and scaler
joblib.dump(best_model, 'house_price_model.pkl')
joblib.dump(scaler, 'feature_scaler.pkl')

print("Model and scaler saved successfully!")

Create Prediction Function for Deployment

def load_and_predict(house_data):
    """Load saved model and make prediction."""
    model = joblib.load('house_price_model.pkl')
    scaler = joblib.load('feature_scaler.pkl')
    
    house_df = pd.DataFrame([house_data], columns=feature_columns)
    house_scaled = scaler.transform(house_df)
    
    return model.predict(house_scaled)[0]

Saving model components enables deployment and future predictions without retraining.

Project Summary

What You Built

A complete house price prediction system that:

Processes and explores housing data
Engineers new features from raw data
Compares multiple regression algorithms
Tunes hyperparameters using cross-validation
Validates model assumptions through diagnostics
Makes predictions on new properties

Key Results

Metric	Value
Best Model	Ridge Regression
R² Score	~0.95
RMSE	~$32,000
Most Important Feature	Square Feet

Lessons Learned

Data exploration reveals patterns and potential issues before modeling
Feature engineering can improve model performance significantly
Multiple models should be compared rather than assuming one is best
Cross-validation provides more reliable performance estimates
Residual analysis validates model assumptions
Regularization helps when features are correlated

Next Steps for Improvement

To enhance this project further, consider:

Collecting more diverse real-world data
Adding categorical features (neighborhood names, property types)
Implementing more advanced feature engineering
Trying ensemble methods (Random Forest, Gradient Boosting)
Building a web interface for predictions

Complete Project Code Reference

# Full pipeline in minimal code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error, r2_score

# Load and prepare data
# ... (data loading code here)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train with cross-validation
model = RidgeCV(cv=5)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):,.2f}")

This condensed version shows the essential workflow for any regression project.

Congratulations! You've completed a full machine learning regression project, applying concepts from linear regression through regularization to build a practical predictive model.

Related Lessons

Ridge, Lasso, and Elastic Net Regularization

Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.

Multiple and Polynomial Regression

Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.

Linear Regression Theory and Implementation

Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.