Handling missing values involves identifying, analyzing, and addressing gaps in a dataset to ensure accuracy and reliability. It improves data quality by using techniques like deletion, imputation, or predictive modeling, helping machine learning models perform better and produce trustworthy insights.
Missing values occur when no data is stored for a variable in an observation. In machine learning, missing data can significantly impact model training, leading to biased results or complete algorithm failure. Understanding why data is missing and choosing appropriate handling strategies is essential for building robust predictive models.
Before handling missing values, understanding why they occur helps choose the right strategy:
The probability of missing data is unrelated to any observed or unobserved data. For example, a survey respondent accidentally skips a question.
The probability of missing data is related to observed data but not the missing values themselves. For example, younger people might be less likely to report income, but among people of the same age, missingness is random.
The probability of missing data is related to the missing values themselves. For example, people with very high incomes might refuse to disclose their earnings.
The first step in handling missing data is proper detection.
import pandas as pd
import numpy as np
# Create sample dataset with missing values
df = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50],
'income': [50000, np.nan, 75000, np.nan, 100000],
'education': ['Bachelor', 'Master', None, 'PhD', 'Bachelor']
})
# Check for missing values
print(df.isnull().sum())
The isnull() method returns a boolean DataFrame indicating missing positions. Combined with sum(), it counts missing values per column.
# Calculate missing percentage
missing_percentage = (df.isnull().sum() / len(df)) * 100
print(f"\nMissing Percentage:\n{missing_percentage}")
Knowing the percentage of missing data helps determine whether a column should be dropped entirely or imputed.
# Identify rows with any missing values
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"\nRows with missing values:\n{rows_with_missing}")
This code filters rows containing at least one missing value, helping you understand the extent of data incompleteness.
Remove entire rows containing any missing values.
# Drop rows with any missing values
df_cleaned = df.dropna()
print(f"Original rows: {len(df)}")
print(f"After dropna: {len(df_cleaned)}")
The dropna() function removes rows with missing values. This approach is simple but can significantly reduce dataset size and may introduce bias if data is not MCAR.
When to Use: When missing data is minimal (less than 5%) and appears to be MCAR.
Remove features with excessive missing values.
# Drop columns with more than 50% missing values
threshold = 0.5
cols_to_drop = df.columns[df.isnull().mean() > threshold]
df_reduced = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {list(cols_to_drop)}")
This code calculates the missing proportion for each column and drops those exceeding the threshold. Features with too many missing values provide limited predictive value.
When to Use: When a feature has very high missingness and is not critical for analysis.
Replace missing values with the column mean.
# Mean imputation
df['age_mean_imputed'] = df['age'].fillna(df['age'].mean())
print(df[['age', 'age_mean_imputed']])
The fillna() method replaces NaN values with a specified value—in this case, the column mean. This preserves the overall average but reduces variance.
Replace missing values with the column median.
# Median imputation (better for skewed data)
df['income_median_imputed'] = df['income'].fillna(df['income'].median())
print(df[['income', 'income_median_imputed']])
Median imputation is more robust to outliers than mean imputation. Use it when your data has significant skewness or extreme values.
Replace missing values with the most frequent category.
# Mode imputation for categorical data
mode_value = df['education'].mode()[0]
df['education_imputed'] = df['education'].fillna(mode_value)
print(df[['education', 'education_imputed']])
The mode() function returns the most common value. For categorical variables, mode imputation maintains the dominant category distribution.
Real-World Example: In customer segmentation, if 65% of customers have "Bachelor" degree and education data is missing for some records, mode imputation assigns "Bachelor" to maintain the observed distribution pattern.
Scikit-learn provides a standardized imputation interface that integrates with machine learning pipelines.
from sklearn.impute import SimpleImputer
# Create numerical imputer with mean strategy
num_imputer = SimpleImputer(strategy='mean')
# Fit and transform numerical columns
numerical_data = df[['age', 'income']].values
imputed_numerical = num_imputer.fit_transform(numerical_data)
print("Imputed numerical data:")
print(imputed_numerical)
The SimpleImputer class provides consistent imputation across training and test datasets. The fit_transform() method learns parameters from data and applies the transformation.
# Create categorical imputer with most_frequent strategy
cat_imputer = SimpleImputer(strategy='most_frequent')
# Reshape and impute categorical data
categorical_data = df[['education']].values
imputed_categorical = cat_imputer.fit_transform(categorical_data)
print("Imputed categorical data:")
print(imputed_categorical)
For categorical variables, use strategy='most_frequent' to replace missing values with the mode.
KNN imputation fills missing values based on similar observations.
from sklearn.impute import KNNImputer
# Create sample data
data = pd.DataFrame({
'feature1': [1.0, 2.0, np.nan, 4.0, 5.0],
'feature2': [2.0, np.nan, 3.0, 4.0, 5.0],
'feature3': [1.0, 2.0, 3.0, np.nan, 5.0]
})
# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
imputed_data = knn_imputer.fit_transform(data)
print("KNN Imputed Data:")
print(pd.DataFrame(imputed_data, columns=data.columns))
KNN imputation finds the k most similar complete rows and uses their average to fill missing values. This method considers relationships between features, producing more realistic imputations.
When to Use: When features are correlated and you want imputations that reflect data relationships.
Multiple Imputation by Chained Equations models each feature with missing values as a function of other features.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Create iterative imputer
iterative_imputer = IterativeImputer(max_iter=10, random_state=42)
# Apply imputation
iterative_imputed = iterative_imputer.fit_transform(data)
print("Iterative Imputed Data:")
print(pd.DataFrame(iterative_imputed, columns=data.columns))
Iterative imputation uses regression models to predict missing values based on other features. It iterates through features multiple times, refining predictions with each pass.
Sometimes, the fact that data is missing carries information. Create indicator variables to capture this.
# Create missing indicator
df['income_was_missing'] = df['income'].isnull().astype(int)
# Then impute the original column
df['income_imputed'] = df['income'].fillna(df['income'].median())
print(df[['income', 'income_imputed', 'income_was_missing']])
This approach preserves information about which values were originally missing. The model can learn whether missingness itself is predictive of the target variable.
Sometimes domain knowledge suggests the best imputation value.
# Example: Missing "years_employed" for unemployed people
df_employment = pd.DataFrame({
'employment_status': ['employed', 'unemployed', 'employed', 'unemployed'],
'years_employed': [5.0, np.nan, 3.0, np.nan]
})
# Impute 0 for unemployed people
df_employment['years_employed_fixed'] = df_employment.apply(
lambda row: 0 if row['employment_status'] == 'unemployed'
else row['years_employed'], axis=1
)
print(df_employment)
Domain-specific imputation uses business logic to determine appropriate values. Here, unemployed individuals logically have zero years of employment.
| Scenario | Recommended Strategy |
|---|---|
| < 5% missing, MCAR | Listwise deletion |
| Numerical, normally distributed | Mean imputation |
| Numerical, skewed | Median imputation |
| Categorical | Mode imputation |
| Features are correlated | KNN or Iterative imputation |
| Missingness is informative | Indicator variables |
| Domain logic applies | Domain-specific imputation |
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Define columns
numerical_features = ['age', 'income']
categorical_features = ['education']
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', SimpleImputer(strategy='median'), numerical_features),
('cat', SimpleImputer(strategy='most_frequent'), categorical_features)
]
)
# Apply preprocessing
processed_data = preprocessor.fit_transform(df)
print("Processed data shape:", processed_data.shape)
This pipeline applies different imputation strategies to numerical and categorical features simultaneously, ensuring consistent preprocessing for machine learning models.
Handling missing values is a critical preprocessing step in machine learning. From simple deletion and mean imputation to advanced techniques like KNN and iterative imputation, choosing the right strategy depends on the amount of missing data, the missingness mechanism, and the relationships between features. Always consider the impact on your model and validate that your chosen approach doesn't introduce significant bias.
Exploratory Data Analysis (EDA) is the process of exploring and visualizing data to understand its patterns, trends, and relationships before building models. This complete guide helps you analyze data effectively, detect anomalies, summarize key insights, and make informed decisions using charts, statistics, and data exploration techniques.
Detecting and treating outliers involves identifying unusual data points that can distort analysis and reduce model accuracy. By using statistical methods, visualizations, and domain knowledge, outliers can be removed, transformed, or corrected to improve data quality and ensure more accurate, stable machine learning results.