Progress2/3 lessons (67%)

Lesson 2

Handling Missing Values

Handling missing values involves identifying, analyzing, and addressing gaps in a dataset to ensure accuracy and reliability. It improves data quality by using techniques like deletion, imputation, or predictive modeling, helping machine learning models perform better and produce trustworthy insights.

10 min read19 views

Understanding Missing Values in Machine Learning

Missing values occur when no data is stored for a variable in an observation. In machine learning, missing data can significantly impact model training, leading to biased results or complete algorithm failure. Understanding why data is missing and choosing appropriate handling strategies is essential for building robust predictive models.

Types of Missing Data

Before handling missing values, understanding why they occur helps choose the right strategy:

Missing Completely at Random (MCAR)

The probability of missing data is unrelated to any observed or unobserved data. For example, a survey respondent accidentally skips a question.

Missing at Random (MAR)

The probability of missing data is related to observed data but not the missing values themselves. For example, younger people might be less likely to report income, but among people of the same age, missingness is random.

Missing Not at Random (MNAR)

The probability of missing data is related to the missing values themselves. For example, people with very high incomes might refuse to disclose their earnings.

Detecting Missing Values

The first step in handling missing data is proper detection.

import pandas as pd
import numpy as np

# Create sample dataset with missing values
df = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50],
    'income': [50000, np.nan, 75000, np.nan, 100000],
    'education': ['Bachelor', 'Master', None, 'PhD', 'Bachelor']
})

# Check for missing values
print(df.isnull().sum())

The isnull() method returns a boolean DataFrame indicating missing positions. Combined with sum(), it counts missing values per column.

# Calculate missing percentage
missing_percentage = (df.isnull().sum() / len(df)) * 100
print(f"\nMissing Percentage:\n{missing_percentage}")

Knowing the percentage of missing data helps determine whether a column should be dropped entirely or imputed.

# Identify rows with any missing values
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"\nRows with missing values:\n{rows_with_missing}")

This code filters rows containing at least one missing value, helping you understand the extent of data incompleteness.

Strategy 1: Deletion Methods

Listwise Deletion (Complete Case Analysis)

Remove entire rows containing any missing values.

# Drop rows with any missing values
df_cleaned = df.dropna()
print(f"Original rows: {len(df)}")
print(f"After dropna: {len(df_cleaned)}")

The dropna() function removes rows with missing values. This approach is simple but can significantly reduce dataset size and may introduce bias if data is not MCAR.

When to Use: When missing data is minimal (less than 5%) and appears to be MCAR.

Column Deletion

Remove features with excessive missing values.

# Drop columns with more than 50% missing values
threshold = 0.5
cols_to_drop = df.columns[df.isnull().mean() > threshold]
df_reduced = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {list(cols_to_drop)}")

This code calculates the missing proportion for each column and drops those exceeding the threshold. Features with too many missing values provide limited predictive value.

When to Use: When a feature has very high missingness and is not critical for analysis.

Strategy 2: Simple Imputation

Mean Imputation for Numerical Data

Replace missing values with the column mean.

# Mean imputation
df['age_mean_imputed'] = df['age'].fillna(df['age'].mean())
print(df[['age', 'age_mean_imputed']])

The fillna() method replaces NaN values with a specified value—in this case, the column mean. This preserves the overall average but reduces variance.

Median Imputation

Replace missing values with the column median.

# Median imputation (better for skewed data)
df['income_median_imputed'] = df['income'].fillna(df['income'].median())
print(df[['income', 'income_median_imputed']])

Median imputation is more robust to outliers than mean imputation. Use it when your data has significant skewness or extreme values.

Mode Imputation for Categorical Data

Replace missing values with the most frequent category.

# Mode imputation for categorical data
mode_value = df['education'].mode()[0]
df['education_imputed'] = df['education'].fillna(mode_value)
print(df[['education', 'education_imputed']])

The mode() function returns the most common value. For categorical variables, mode imputation maintains the dominant category distribution.

Real-World Example: In customer segmentation, if 65% of customers have "Bachelor" degree and education data is missing for some records, mode imputation assigns "Bachelor" to maintain the observed distribution pattern.

Strategy 3: Using Scikit-learn SimpleImputer

Scikit-learn provides a standardized imputation interface that integrates with machine learning pipelines.

from sklearn.impute import SimpleImputer

# Create numerical imputer with mean strategy
num_imputer = SimpleImputer(strategy='mean')

# Fit and transform numerical columns
numerical_data = df[['age', 'income']].values
imputed_numerical = num_imputer.fit_transform(numerical_data)

print("Imputed numerical data:")
print(imputed_numerical)

The SimpleImputer class provides consistent imputation across training and test datasets. The fit_transform() method learns parameters from data and applies the transformation.

# Create categorical imputer with most_frequent strategy
cat_imputer = SimpleImputer(strategy='most_frequent')

# Reshape and impute categorical data
categorical_data = df[['education']].values
imputed_categorical = cat_imputer.fit_transform(categorical_data)

print("Imputed categorical data:")
print(imputed_categorical)

For categorical variables, use strategy='most_frequent' to replace missing values with the mode.

Strategy 4: Advanced Imputation Techniques

K-Nearest Neighbors (KNN) Imputation

KNN imputation fills missing values based on similar observations.

from sklearn.impute import KNNImputer

# Create sample data
data = pd.DataFrame({
    'feature1': [1.0, 2.0, np.nan, 4.0, 5.0],
    'feature2': [2.0, np.nan, 3.0, 4.0, 5.0],
    'feature3': [1.0, 2.0, 3.0, np.nan, 5.0]
})

# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
imputed_data = knn_imputer.fit_transform(data)

print("KNN Imputed Data:")
print(pd.DataFrame(imputed_data, columns=data.columns))

KNN imputation finds the k most similar complete rows and uses their average to fill missing values. This method considers relationships between features, producing more realistic imputations.

When to Use: When features are correlated and you want imputations that reflect data relationships.

Iterative Imputation (MICE)

Multiple Imputation by Chained Equations models each feature with missing values as a function of other features.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create iterative imputer
iterative_imputer = IterativeImputer(max_iter=10, random_state=42)

# Apply imputation
iterative_imputed = iterative_imputer.fit_transform(data)

print("Iterative Imputed Data:")
print(pd.DataFrame(iterative_imputed, columns=data.columns))

Iterative imputation uses regression models to predict missing values based on other features. It iterates through features multiple times, refining predictions with each pass.

Strategy 5: Indicator Variables

Sometimes, the fact that data is missing carries information. Create indicator variables to capture this.

# Create missing indicator
df['income_was_missing'] = df['income'].isnull().astype(int)

# Then impute the original column
df['income_imputed'] = df['income'].fillna(df['income'].median())

print(df[['income', 'income_imputed', 'income_was_missing']])

This approach preserves information about which values were originally missing. The model can learn whether missingness itself is predictive of the target variable.

Strategy 6: Domain-Specific Imputation

Sometimes domain knowledge suggests the best imputation value.

# Example: Missing "years_employed" for unemployed people
df_employment = pd.DataFrame({
    'employment_status': ['employed', 'unemployed', 'employed', 'unemployed'],
    'years_employed': [5.0, np.nan, 3.0, np.nan]
})

# Impute 0 for unemployed people
df_employment['years_employed_fixed'] = df_employment.apply(
    lambda row: 0 if row['employment_status'] == 'unemployed' 
    else row['years_employed'], axis=1
)

print(df_employment)

Domain-specific imputation uses business logic to determine appropriate values. Here, unemployed individuals logically have zero years of employment.

Choosing the Right Strategy

Scenario	Recommended Strategy
< 5% missing, MCAR	Listwise deletion
Numerical, normally distributed	Mean imputation
Numerical, skewed	Median imputation
Categorical	Mode imputation
Features are correlated	KNN or Iterative imputation
Missingness is informative	Indicator variables
Domain logic applies	Domain-specific imputation

Complete Pipeline Example

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define columns
numerical_features = ['age', 'income']
categorical_features = ['education']

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), numerical_features),
        ('cat', SimpleImputer(strategy='most_frequent'), categorical_features)
    ]
)

# Apply preprocessing
processed_data = preprocessor.fit_transform(df)
print("Processed data shape:", processed_data.shape)

This pipeline applies different imputation strategies to numerical and categorical features simultaneously, ensuring consistent preprocessing for machine learning models.

Best Practices for Handling Missing Values

Understand Why Data is Missing: Investigate the mechanism before choosing a strategy
Never Impute Target Variables: Missing target values should be excluded
Fit on Training Data Only: Learn imputation parameters from training data to avoid data leakage
Document Your Decisions: Record which strategy was used and why
Compare Multiple Approaches: Test different imputation methods and evaluate model performance

Summary

Handling missing values is a critical preprocessing step in machine learning. From simple deletion and mean imputation to advanced techniques like KNN and iterative imputation, choosing the right strategy depends on the amount of missing data, the missingness mechanism, and the relationships between features. Always consider the impact on your model and validate that your chosen approach doesn't introduce significant bias.

Related Lessons

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of exploring and visualizing data to understand its patterns, trends, and relationships before building models. This complete guide helps you analyze data effectively, detect anomalies, summarize key insights, and make informed decisions using charts, statistics, and data exploration techniques.

Detecting and Treating Outliers

Detecting and treating outliers involves identifying unusual data points that can distort analysis and reduce model accuracy. By using statistical methods, visualizations, and domain knowledge, outliers can be removed, transformed, or corrected to improve data quality and ensure more accurate, stable machine learning results.