Progress6/6 lessons (100%)

Lesson 6

Classification Project - Customer Churn Prediction

Apply classification algorithms to a real Customer Churn Prediction project. Learn to preprocess data, evaluate models, and build insights that help businesses reduce customer loss.

10 min read20 views

Project Overview

Customer churn prediction is one of the most important applications of classification in business. Predicting which customers are likely to leave allows companies to take proactive retention actions.

Business Context

Acquiring new customers costs significantly more than retaining existing ones. By identifying at-risk customers early, businesses can:

Offer targeted promotions
Improve customer service
Address pain points before customers leave
Optimize marketing spend

Project Goals

Analyze customer data to understand churn patterns
Engineer features that predict churn behavior
Train and compare multiple classification models
Select and tune the best performing model
Build a prediction system for new customers

Step 1: Import Libraries and Create Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_auc_score, roc_curve)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
pd.set_option('display.max_columns', None)

Generate Realistic Churn Dataset

def generate_churn_data(n_samples=2000):
    """Generate synthetic customer churn dataset."""
    np.random.seed(42)
    
    data = {
        'customer_id': range(1, n_samples + 1),
        'tenure_months': np.random.randint(1, 72, n_samples),
        'monthly_charges': np.random.uniform(20, 120, n_samples),
        'total_charges': np.zeros(n_samples),
        'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], 
                                          n_samples, p=[0.5, 0.3, 0.2]),
        'payment_method': np.random.choice(['Electronic check', 'Mailed check', 
                                            'Bank transfer', 'Credit card'], n_samples),
        'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], 
                                             n_samples, p=[0.35, 0.45, 0.2]),
        'online_security': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
        'tech_support': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
        'streaming_tv': np.random.choice(['Yes', 'No', 'No internet'], n_samples),
        'num_support_tickets': np.random.poisson(2, n_samples),
        'num_referrals': np.random.poisson(1, n_samples),
        'satisfaction_score': np.random.randint(1, 6, n_samples),
    }
    
    df = pd.DataFrame(data)
    
    # Calculate total charges
    df['total_charges'] = df['tenure_months'] * df['monthly_charges'] * np.random.uniform(0.9, 1.1, n_samples)
    
    # Generate churn based on realistic patterns
    churn_prob = np.zeros(n_samples)
    
    # Higher churn for month-to-month contracts
    churn_prob += (df['contract_type'] == 'Month-to-month') * 0.25
    
    # Higher churn for shorter tenure
    churn_prob += (df['tenure_months'] < 12) * 0.15
    
    # Higher churn for high monthly charges
    churn_prob += (df['monthly_charges'] > 80) * 0.1
    
    # Higher churn for fiber optic (faster but more issues)
    churn_prob += (df['internet_service'] == 'Fiber optic') * 0.1
    
    # Lower churn with security and support
    churn_prob -= (df['online_security'] == 'Yes') * 0.1
    churn_prob -= (df['tech_support'] == 'Yes') * 0.1
    
    # Higher churn with more support tickets
    churn_prob += df['num_support_tickets'] * 0.03
    
    # Lower churn with referrals (engaged customers)
    churn_prob -= df['num_referrals'] * 0.05
    
    # Lower churn with higher satisfaction
    churn_prob -= (df['satisfaction_score'] - 3) * 0.08
    
    # Add noise and clip
    churn_prob += np.random.uniform(-0.1, 0.1, n_samples)
    churn_prob = np.clip(churn_prob, 0.05, 0.95)
    
    # Generate churn labels
    df['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)
    
    return df

# Generate dataset
df = generate_churn_data(2000)
print(f"Dataset shape: {df.shape}")
print(df.head())

Step 2: Exploratory Data Analysis (EDA)

Basic Data Overview

print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")
print(f"\nBasic Statistics:\n{df.describe()}")

Target Variable Distribution

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
churn_counts = df['churn'].value_counts()
plt.bar(['Retained', 'Churned'], churn_counts.values, color=['green', 'red'], alpha=0.7)
plt.ylabel('Count')
plt.title('Customer Churn Distribution')
for i, v in enumerate(churn_counts.values):
    plt.text(i, v + 20, str(v), ha='center')

plt.subplot(1, 2, 2)
plt.pie(churn_counts.values, labels=['Retained', 'Churned'], 
        autopct='%1.1f%%', colors=['green', 'red'], alpha=0.7)
plt.title('Churn Percentage')

plt.tight_layout()
plt.show()

print(f"Churn Rate: {df['churn'].mean()*100:.2f}%")

Churn by Categorical Features

categorical_cols = ['contract_type', 'payment_method', 'internet_service', 
                    'online_security', 'tech_support']

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(categorical_cols):
    churn_by_cat = df.groupby(col)['churn'].mean().sort_values(ascending=False)
    axes[i].bar(range(len(churn_by_cat)), churn_by_cat.values, color='coral', alpha=0.7)
    axes[i].set_xticks(range(len(churn_by_cat)))
    axes[i].set_xticklabels(churn_by_cat.index, rotation=45, ha='right')
    axes[i].set_ylabel('Churn Rate')
    axes[i].set_title(f'Churn Rate by {col}')
    axes[i].axhline(y=df['churn'].mean(), color='red', linestyle='--', label='Overall Rate')

# Hide empty subplot
axes[5].axis('off')

plt.tight_layout()
plt.show()

Churn by Numerical Features

numerical_cols = ['tenure_months', 'monthly_charges', 'total_charges', 
                  'num_support_tickets', 'satisfaction_score']

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    # Box plot by churn status
    df.boxplot(column=col, by='churn', ax=axes[i])
    axes[i].set_xlabel('Churn (0=No, 1=Yes)')
    axes[i].set_ylabel(col)
    axes[i].set_title(f'{col} by Churn Status')
    plt.suptitle('')  # Remove automatic title

axes[5].axis('off')
plt.tight_layout()
plt.show()

Correlation Analysis

# Encode categorical variables for correlation
df_encoded = df.copy()
label_encoders = {}

for col in ['contract_type', 'payment_method', 'internet_service', 
            'online_security', 'tech_support', 'streaming_tv']:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le

# Correlation with target
correlations = df_encoded.drop('customer_id', axis=1).corr()['churn'].sort_values(ascending=False)
print("Feature Correlations with Churn:")
print(correlations)

# Correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(df_encoded.drop('customer_id', axis=1).corr(), 
            annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

Step 3: Feature Engineering

Create New Features

# Create a copy for feature engineering
df_features = df.copy()

# Average monthly charges relative to tenure
df_features['avg_monthly_value'] = df_features['total_charges'] / (df_features['tenure_months'] + 1)

# Tenure categories
df_features['tenure_group'] = pd.cut(df_features['tenure_months'], 
                                      bins=[0, 12, 24, 48, 72],
                                      labels=['0-1yr', '1-2yr', '2-4yr', '4+yr'])

# High value customer flag
df_features['high_value'] = (df_features['monthly_charges'] > df_features['monthly_charges'].median()).astype(int)

# Risk score based on known churn indicators
df_features['risk_score'] = (
    (df_features['contract_type'] == 'Month-to-month').astype(int) * 2 +
    (df_features['tenure_months'] < 12).astype(int) * 2 +
    (df_features['num_support_tickets'] > 3).astype(int) * 1 +
    (df_features['satisfaction_score'] < 3).astype(int) * 2 -
    (df_features['num_referrals'] > 0).astype(int) * 1
)

# Service count
service_cols = ['online_security', 'tech_support', 'streaming_tv']
df_features['num_services'] = sum((df_features[col] == 'Yes').astype(int) for col in service_cols)

print("New Features Created:")
print(df_features[['avg_monthly_value', 'tenure_group', 'high_value', 
                   'risk_score', 'num_services']].head(10))

Encode Categorical Variables

# One-hot encoding for categorical variables
categorical_features = ['contract_type', 'payment_method', 'internet_service',
                        'online_security', 'tech_support', 'streaming_tv', 'tenure_group']

df_final = pd.get_dummies(df_features, columns=categorical_features, drop_first=True)

print(f"Final dataset shape: {df_final.shape}")
print(f"Columns: {list(df_final.columns)}")

Step 4: Prepare Data for Modeling

Select Features and Target

# Remove non-feature columns
drop_cols = ['customer_id', 'churn']
feature_cols = [col for col in df_final.columns if col not in drop_cols]

X = df_final[feature_cols]
y = df_final['churn']

print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"Target distribution:\n{y.value_counts()}")

Split Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
print(f"Training churn rate: {y_train.mean():.2%}")
print(f"Testing churn rate: {y_test.mean():.2%}")

Scale Numerical Features

# Identify numerical columns to scale
numerical_features = ['tenure_months', 'monthly_charges', 'total_charges',
                      'num_support_tickets', 'num_referrals', 'satisfaction_score',
                      'avg_monthly_value', 'risk_score', 'num_services']

scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test_scaled[numerical_features] = scaler.transform(X_test[numerical_features])

Step 5: Train and Compare Classification Models

Define Models

models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'SVM': SVC(random_state=42, probability=True),
    'Naive Bayes': GaussianNB()
}

Evaluate All Models

def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Train and evaluate a classification model."""
    # Train
    model.fit(X_train, y_train)
    
    # Predict
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    metrics = {
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_prob) if y_prob is not None else None
    }
    
    return metrics, y_pred, y_prob

# Train and evaluate all models
results = []
predictions = {}

for name, model in models.items():
    # Use scaled data for distance-based models
    if name in ['K-Nearest Neighbors', 'SVM']:
        metrics, y_pred, y_prob = evaluate_model(
            model, X_train_scaled, X_test_scaled, y_train, y_test, name
        )
    else:
        metrics, y_pred, y_prob = evaluate_model(
            model, X_train, X_test, y_train, y_test, name
        )
    
    results.append(metrics)
    predictions[name] = {'y_pred': y_pred, 'y_prob': y_prob}

# Display results
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('ROC-AUC', ascending=False)
print("Model Comparison Results:")
print(results_df.to_string(index=False))

Visualize Model Comparison

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of metrics
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
x = np.arange(len(results_df))
width = 0.15

for i, metric in enumerate(metrics_to_plot):
    axes[0].bar(x + i*width, results_df[metric], width, label=metric)

axes[0].set_xlabel('Model')
axes[0].set_ylabel('Score')
axes[0].set_title('Model Performance Comparison')
axes[0].set_xticks(x + width * 2)
axes[0].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[0].legend(loc='lower right')
axes[0].set_ylim(0, 1)

# ROC Curves
for name in models.keys():
    if predictions[name]['y_prob'] is not None:
        fpr, tpr, _ = roc_curve(y_test, predictions[name]['y_prob'])
        auc = roc_auc_score(y_test, predictions[name]['y_prob'])
        axes[1].plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})')

axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curves')
axes[1].legend(loc='lower right')

plt.tight_layout()
plt.show()

Cross-Validation for Robust Comparison

print("Cross-Validation Results (5-fold):")
print("-" * 60)

cv_results = []

for name, model in models.items():
    # Use appropriate data
    if name in ['K-Nearest Neighbors', 'SVM']:
        X_cv = X_train_scaled
    else:
        X_cv = X_train
    
    scores = cross_val_score(model, X_cv, y_train, cv=5, scoring='roc_auc')
    cv_results.append({
        'Model': name,
        'Mean ROC-AUC': scores.mean(),
        'Std': scores.std()
    })
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

cv_df = pd.DataFrame(cv_results).sort_values('Mean ROC-AUC', ascending=False)

Step 6: Select and Tune the Best Model

Based on the comparison, let's tune Random Forest (typically performs well for churn prediction).

Hyperparameter Tuning

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV ROC-AUC: {grid_search.best_score_:.4f}")

Train Final Model

# Train with best parameters
best_rf = grid_search.best_estimator_

# Evaluate on test set
y_pred_final = best_rf.predict(X_test)
y_prob_final = best_rf.predict_proba(X_test)[:, 1]

print("\nFinal Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_final):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_final):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_final):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_final):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob_final):.4f}")

Step 7: Analyze Model Results

Confusion Matrix

cm = confusion_matrix(y_test, y_pred_final)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted: Stay', 'Predicted: Churn'],
            yticklabels=['Actual: Stay', 'Actual: Churn'])
plt.title('Confusion Matrix - Final Model')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Calculate business metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nBusiness Impact Analysis:")
print(f"True Negatives (Correct: Stay): {tn}")
print(f"True Positives (Correct: Churn): {tp}")
print(f"False Positives (Wrong: Predicted Churn): {fp}")
print(f"False Negatives (Missed: Actual Churn): {fn}")
print(f"\nChurn Detection Rate: {tp/(tp+fn)*100:.1f}%")
print(f"False Alarm Rate: {fp/(fp+tn)*100:.1f}%")

Feature Importance

# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': best_rf.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot top 15 features
plt.figure(figsize=(10, 8))
top_features = feature_importance.head(15)
plt.barh(range(len(top_features)), top_features['Importance'].values)
plt.yticks(range(len(top_features)), top_features['Feature'].values)
plt.xlabel('Importance')
plt.title('Top 15 Most Important Features for Churn Prediction')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("Top 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

Step 8: Threshold Optimization

For business applications, the default 0.5 threshold may not be optimal.

# Test different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
threshold_results = []

for thresh in thresholds:
    y_pred_thresh = (y_prob_final >= thresh).astype(int)
    
    threshold_results.append({
        'Threshold': thresh,
        'Precision': precision_score(y_test, y_pred_thresh),
        'Recall': recall_score(y_test, y_pred_thresh),
        'F1-Score': f1_score(y_test, y_pred_thresh),
        'Churners Caught': (y_pred_thresh & y_test).sum(),
        'False Alarms': ((y_pred_thresh == 1) & (y_test == 0)).sum()
    })

threshold_df = pd.DataFrame(threshold_results)

# Plot precision-recall tradeoff
plt.figure(figsize=(10, 5))
plt.plot(threshold_df['Threshold'], threshold_df['Precision'], 'b-', label='Precision')
plt.plot(threshold_df['Threshold'], threshold_df['Recall'], 'r-', label='Recall')
plt.plot(threshold_df['Threshold'], threshold_df['F1-Score'], 'g-', label='F1-Score')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision, Recall, and F1-Score vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Find optimal threshold for F1
optimal_idx = threshold_df['F1-Score'].idxmax()
optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold']
print(f"Optimal Threshold (Max F1): {optimal_threshold:.2f}")
print(threshold_df.loc[optimal_idx])

Step 9: Build Prediction System

Create Prediction Function

def predict_churn(customer_data, model, scaler, feature_cols, numerical_features):
    """
    Predict churn probability for a single customer.
    
    Parameters:
    - customer_data: dict with customer features
    - model: trained model
    - scaler: fitted StandardScaler
    - feature_cols: list of feature column names
    - numerical_features: list of numerical feature names
    
    Returns:
    - churn_probability: float
    - churn_prediction: int (0 or 1)
    - risk_level: str
    """
    # Convert to DataFrame
    df_customer = pd.DataFrame([customer_data])
    
    # Feature engineering (same as training)
    df_customer['avg_monthly_value'] = df_customer['total_charges'] / (df_customer['tenure_months'] + 1)
    df_customer['high_value'] = (df_customer['monthly_charges'] > 60).astype(int)
    
    df_customer['risk_score'] = (
        (df_customer['contract_type'] == 'Month-to-month').astype(int) * 2 +
        (df_customer['tenure_months'] < 12).astype(int) * 2 +
        (df_customer['num_support_tickets'] > 3).astype(int) * 1 +
        (df_customer['satisfaction_score'] < 3).astype(int) * 2 -
        (df_customer['num_referrals'] > 0).astype(int) * 1
    )
    
    service_cols = ['online_security', 'tech_support', 'streaming_tv']
    df_customer['num_services'] = sum(
        (df_customer[col] == 'Yes').astype(int) for col in service_cols
    )
    
    # Tenure group
    if df_customer['tenure_months'].values[0] <= 12:
        tenure_group = '0-1yr'
    elif df_customer['tenure_months'].values[0] <= 24:
        tenure_group = '1-2yr'
    elif df_customer['tenure_months'].values[0] <= 48:
        tenure_group = '2-4yr'
    else:
        tenure_group = '4+yr'
    
    # Create feature vector matching training columns
    customer_features = {}
    
    # Add numerical features
    for col in numerical_features:
        if col in df_customer.columns:
            customer_features[col] = df_customer[col].values[0]
    
    # Add encoded categorical features
    categorical_mappings = {
        'contract_type': ['One year', 'Two year'],
        'payment_method': ['Credit card', 'Electronic check', 'Mailed check'],
        'internet_service': ['Fiber optic', 'No'],
        'online_security': ['No internet', 'Yes'],
        'tech_support': ['No internet', 'Yes'],
        'streaming_tv': ['No internet', 'Yes'],
        'tenure_group': ['1-2yr', '2-4yr', '4+yr']
    }
    
    for cat_col, values in categorical_mappings.items():
        for val in values:
            col_name = f"{cat_col}_{val}"
            if cat_col == 'tenure_group':
                customer_features[col_name] = 1 if tenure_group == val else 0
            else:
                customer_features[col_name] = 1 if df_customer[cat_col].values[0] == val else 0
    
    # Create DataFrame with correct column order
    X_customer = pd.DataFrame([customer_features])
    
    # Ensure all feature columns exist
    for col in feature_cols:
        if col not in X_customer.columns:
            X_customer[col] = 0
    
    X_customer = X_customer[feature_cols]
    
    # Scale numerical features
    X_customer[numerical_features] = scaler.transform(X_customer[numerical_features])
    
    # Predict
    churn_prob = model.predict_proba(X_customer)[0][1]
    churn_pred = 1 if churn_prob >= optimal_threshold else 0
    
    # Determine risk level
    if churn_prob < 0.3:
        risk_level = "Low Risk"
    elif churn_prob < 0.6:
        risk_level = "Medium Risk"
    else:
        risk_level = "High Risk"
    
    return churn_prob, churn_pred, risk_level

# Test the prediction function
test_customer = {
    'tenure_months': 6,
    'monthly_charges': 85.50,
    'total_charges': 513.00,
    'contract_type': 'Month-to-month',
    'payment_method': 'Electronic check',
    'internet_service': 'Fiber optic',
    'online_security': 'No',
    'tech_support': 'No',
    'streaming_tv': 'Yes',
    'num_support_tickets': 4,
    'num_referrals': 0,
    'satisfaction_score': 2
}

prob, pred, risk = predict_churn(
    test_customer, best_rf, scaler, feature_cols, numerical_features
)

print("\n" + "="*50)
print("CUSTOMER CHURN PREDICTION")
print("="*50)
print(f"\nCustomer Profile:")
for key, value in test_customer.items():
    print(f"  {key}: {value}")
print(f"\nPrediction Results:")
print(f"  Churn Probability: {prob:.2%}")
print(f"  Prediction: {'Will Churn' if pred == 1 else 'Will Stay'}")
print(f"  Risk Level: {risk}")

Batch Prediction for Multiple Customers

def predict_batch(customer_list, model, scaler, feature_cols, numerical_features):
    """Predict churn for multiple customers."""
    results = []
    
    for i, customer in enumerate(customer_list):
        prob, pred, risk = predict_churn(
            customer, model, scaler, feature_cols, numerical_features
        )
        results.append({
            'Customer_ID': i + 1,
            'Churn_Probability': prob,
            'Prediction': 'Churn' if pred == 1 else 'Stay',
            'Risk_Level': risk
        })
    
    return pd.DataFrame(results)

# Example batch prediction
sample_customers = [
    {
        'tenure_months': 48, 'monthly_charges': 45.00, 'total_charges': 2160.00,
        'contract_type': 'Two year', 'payment_method': 'Bank transfer',
        'internet_service': 'DSL', 'online_security': 'Yes', 'tech_support': 'Yes',
        'streaming_tv': 'No', 'num_support_tickets': 1, 'num_referrals': 3,
        'satisfaction_score': 5
    },
    {
        'tenure_months': 3, 'monthly_charges': 95.00, 'total_charges': 285.00,
        'contract_type': 'Month-to-month', 'payment_method': 'Electronic check',
        'internet_service': 'Fiber optic', 'online_security': 'No', 'tech_support': 'No',
        'streaming_tv': 'Yes', 'num_support_tickets': 5, 'num_referrals': 0,
        'satisfaction_score': 1
    }
]

batch_results = predict_batch(
    sample_customers, best_rf, scaler, feature_cols, numerical_features
)
print("\nBatch Prediction Results:")
print(batch_results.to_string(index=False))

Step 10: Save Model and Create Report

Save Model Components

import joblib

# Save model and preprocessing objects
joblib.dump(best_rf, 'churn_model.pkl')
joblib.dump(scaler, 'churn_scaler.pkl')
joblib.dump(feature_cols, 'churn_features.pkl')

print("Model components saved successfully!")

Generate Summary Report

print("\n" + "="*60)
print("CUSTOMER CHURN PREDICTION MODEL - SUMMARY REPORT")
print("="*60)

print("\n📊 DATASET OVERVIEW:")
print(f"   Total Customers: {len(df):,}")
print(f"   Churn Rate: {df['churn'].mean()*100:.1f}%")
print(f"   Features Used: {len(feature_cols)}")

print("\n🏆 BEST MODEL: Random Forest Classifier")
print(f"   Parameters: {grid_search.best_params_}")

print("\n📈 PERFORMANCE METRICS:")
print(f"   Accuracy: {accuracy_score(y_test, y_pred_final)*100:.1f}%")
print(f"   Precision: {precision_score(y_test, y_pred_final)*100:.1f}%")
print(f"   Recall: {recall_score(y_test, y_pred_final)*100:.1f}%")
print(f"   F1-Score: {f1_score(y_test, y_pred_final)*100:.1f}%")
print(f"   ROC-AUC: {roc_auc_score(y_test, y_prob_final)*100:.1f}%")

print("\n🔑 TOP 5 CHURN PREDICTORS:")
for i, row in feature_importance.head(5).iterrows():
    print(f"   {row['Feature']}: {row['Importance']:.4f}")

print("\n💡 KEY INSIGHTS:")
print("   - Month-to-month contracts have highest churn risk")
print("   - Low tenure (< 12 months) indicates higher churn probability")
print("   - Customers with tech support are less likely to churn")
print("   - High number of support tickets correlates with churn")
print("   - Satisfaction score is a strong predictor of retention")

print("\n📁 SAVED FILES:")
print("   - churn_model.pkl (Trained model)")
print("   - churn_scaler.pkl (Feature scaler)")
print("   - churn_features.pkl (Feature list)")

print("\n" + "="*60)

Project Summary

What You Built

A complete customer churn prediction system that:

Analyzes customer data to identify churn patterns
Engineers features based on domain knowledge
Compares six different classification algorithms
Tunes hyperparameters using grid search
Optimizes the classification threshold for business needs
Provides individual and batch predictions
Saves model components for deployment

Key Results

Metric	Value
Best Model	Random Forest
ROC-AUC	~0.85+
Recall (Churn Detection)	~75%+
Top Predictor	Contract Type

Business Recommendations

Based on this analysis:

Target month-to-month customers for retention campaigns
Engage new customers early (first 12 months are critical)
Promote tech support and security add-ons to increase stickiness
Monitor support tickets as an early warning indicator
Track satisfaction scores and act on declining trends

Next Steps for Enhancement

To improve this project further:

Incorporate more customer behavior data (login frequency, usage patterns)
Add time-series features (trend in charges, satisfaction changes)
Implement cost-sensitive learning (weight false negatives higher)
Build an API for real-time predictions
Create a dashboard for monitoring churn metrics
A/B test retention strategies on high-risk segments

Congratulations! You've completed a comprehensive classification project, applying multiple algorithms to solve a real business problem with measurable impact.

Related Lessons

Decision Trees

Explore Decision Trees and how they split data into meaningful decision rules. This lesson teaches tree-building, visualization, and practical classification applications.

K-Nearest Neighbors (KNN) Algorithm

Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.

Logistic Regression for Binary Classification

Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.