Progress4/4 lessons (100%)

Lesson 4

Feature Selection Techniques

Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.

10 min read19 views

What is Feature Selection?

Feature selection is the process of identifying and selecting a subset of relevant features for model construction. In datasets with many features, some variables may be irrelevant, redundant, or noisy. Removing these features improves model performance, reduces training time, and enhances interpretability.

Why is Feature Selection Important?

Model Performance

Irrelevant features add noise that can confuse models and reduce accuracy.

Overfitting Prevention

Fewer features mean simpler models that generalize better to unseen data.

Computational Efficiency

Training on fewer features requires less memory and processing time.

Interpretability

Models with fewer features are easier to understand and explain.

Real-World Example: In predicting house prices, features like "number of bedrooms" and "square footage" are relevant, while "owner's favorite color" is not. Including irrelevant features degrades model performance.

Categories of Feature Selection Methods

Filter Methods: Evaluate features independently of the model
Wrapper Methods: Use model performance to evaluate feature subsets
Embedded Methods: Feature selection built into the model training process

Filter Methods

Filter methods score features based on statistical measures and select top-scoring ones.

Variance Threshold

Remove features with low variance—features that are nearly constant.

import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold

# Sample data
data = pd.DataFrame({
    'feature_a': [1, 1, 1, 1, 1],      # Zero variance
    'feature_b': [1, 2, 3, 4, 5],      # High variance
    'feature_c': [0.1, 0.1, 0.1, 0.2, 0.1],  # Low variance
    'feature_d': [10, 20, 30, 40, 50]  # High variance
})

print("Original Data:")
print(data)
print(f"\nVariances: {data.var().values}")

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.5)
selected_features = selector.fit_transform(data)

# Get selected column names
selected_columns = data.columns[selector.get_support()]
print(f"\nSelected Features: {list(selected_columns)}")
print(f"Transformed Data:\n{selected_features}")

VarianceThreshold removes features where variance falls below the specified threshold. Features with zero or near-zero variance provide no discriminative information.

Correlation Analysis

Remove highly correlated features to reduce redundancy.

import seaborn as sns
import matplotlib.pyplot as plt

# Create correlated data
np.random.seed(42)
data_corr = pd.DataFrame({
    'feature_1': np.random.randn(100),
    'feature_2': np.random.randn(100),
    'feature_3': np.random.randn(100)
})
# Create highly correlated feature
data_corr['feature_4'] = data_corr['feature_1'] * 0.9 + np.random.randn(100) * 0.1

# Calculate correlation matrix
correlation_matrix = data_corr.corr()

print("Correlation Matrix:")
print(correlation_matrix)

# Visualize correlations
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

# Function to remove highly correlated features
def remove_correlated_features(df, threshold=0.9):
    corr_matrix = df.corr().abs()
    # Select upper triangle of correlation matrix
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    # Find features with correlation greater than threshold
    to_drop = [column for column in upper_triangle.columns 
               if any(upper_triangle[column] > threshold)]
    return df.drop(columns=to_drop), to_drop

# Apply correlation filter
df_reduced, dropped = remove_correlated_features(data_corr, threshold=0.8)
print(f"\nDropped features: {dropped}")
print(f"Remaining features: {list(df_reduced.columns)}")

This function identifies and removes features that are highly correlated with others, keeping only one feature from each correlated pair.

Chi-Square Test for Categorical Features

Chi-square test measures dependence between categorical features and the target.

from sklearn.feature_selection import chi2, SelectKBest

# Sample classification data
np.random.seed(42)
X = pd.DataFrame({
    'feature_1': np.random.randint(0, 5, 100),
    'feature_2': np.random.randint(0, 3, 100),
    'feature_3': np.random.randint(0, 10, 100),
    'feature_4': np.random.randint(0, 2, 100)
})
y = np.random.randint(0, 2, 100)

# Apply Chi-Square test
chi_selector = SelectKBest(chi2, k=2)
X_selected = chi_selector.fit_transform(X, y)

# Get scores and selected features
scores = pd.DataFrame({
    'Feature': X.columns,
    'Chi2_Score': chi_selector.scores_,
    'P_Value': chi_selector.pvalues_
})
print("Chi-Square Scores:")
print(scores.sort_values('Chi2_Score', ascending=False))

print(f"\nSelected Features: {list(X.columns[chi_selector.get_support()])}")

SelectKBest with chi2 selects the k features with highest chi-square scores. Higher scores indicate stronger association with the target variable.

Note: Chi-square requires non-negative feature values.

Mutual Information

Mutual information measures the dependency between variables, capturing both linear and non-linear relationships.

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

# For classification
mi_scores = mutual_info_classif(X, y, random_state=42)

mi_df = pd.DataFrame({
    'Feature': X.columns,
    'MI_Score': mi_scores
}).sort_values('MI_Score', ascending=False)

print("Mutual Information Scores:")
print(mi_df)

# Select features using mutual information
mi_selector = SelectKBest(mutual_info_classif, k=2)
X_mi_selected = mi_selector.fit_transform(X, y)

print(f"\nSelected by MI: {list(X.columns[mi_selector.get_support()])}")

Mutual information captures non-linear dependencies that correlation might miss. Use mutual_info_regression for regression tasks.

ANOVA F-Test for Numerical Features

ANOVA F-test evaluates if the means of a feature differ significantly across target classes.

from sklearn.feature_selection import f_classif

# Apply ANOVA F-test
f_selector = SelectKBest(f_classif, k=2)
X_f_selected = f_selector.fit_transform(X, y)

f_scores = pd.DataFrame({
    'Feature': X.columns,
    'F_Score': f_selector.scores_,
    'P_Value': f_selector.pvalues_
}).sort_values('F_Score', ascending=False)

print("ANOVA F-Test Scores:")
print(f_scores)

Higher F-scores indicate features with significantly different means across classes, suggesting predictive power.

Wrapper Methods

Wrapper methods evaluate feature subsets by training models and measuring performance.

Recursive Feature Elimination (RFE)

RFE recursively removes the least important features based on model coefficients or importance.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create sample dataset
X, y = make_classification(n_samples=200, n_features=10, 
                           n_informative=5, n_redundant=2,
                           random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(10)])

# Create model and RFE selector
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=5)

# Fit RFE
rfe.fit(X, y)

# Results
rfe_results = pd.DataFrame({
    'Feature': X.columns,
    'Selected': rfe.support_,
    'Ranking': rfe.ranking_
}).sort_values('Ranking')

print("RFE Results:")
print(rfe_results)

RFE fits the model, ranks features by importance, removes the least important, and repeats. Lower ranking indicates higher importance.

# Get selected features
selected_features = X.columns[rfe.support_]
print(f"\nSelected Features: {list(selected_features)}")

RFE with Cross-Validation

from sklearn.feature_selection import RFECV

# RFE with cross-validation to find optimal number of features
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='accuracy')
rfecv.fit(X, y)

print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {list(X.columns[rfecv.support_])}")

RFECV automatically determines the optimal number of features using cross-validation, eliminating the need to specify n_features_to_select.

# Plot number of features vs. CV score
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), 
         rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFECV - Optimal Feature Count')
plt.show()

Sequential Feature Selection

Forward and backward selection strategies for finding optimal feature subsets.

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.ensemble import RandomForestClassifier

# Forward Selection
model_rf = RandomForestClassifier(n_estimators=50, random_state=42)

forward_selector = SequentialFeatureSelector(
    model_rf, 
    n_features_to_select=5,
    direction='forward',
    cv=3
)
forward_selector.fit(X, y)

print("Forward Selection Results:")
print(f"Selected: {list(X.columns[forward_selector.get_support()])}")

Forward selection starts with no features and adds the best one at each step.

# Backward Selection
backward_selector = SequentialFeatureSelector(
    model_rf,
    n_features_to_select=5,
    direction='backward',
    cv=3
)
backward_selector.fit(X, y)

print("\nBackward Selection Results:")
print(f"Selected: {list(X.columns[backward_selector.get_support()])}")

Backward selection starts with all features and removes the least important one at each step.

Embedded Methods

Embedded methods perform feature selection during model training.

L1 Regularization (Lasso)

L1 regularization drives unimportant feature coefficients to exactly zero.

from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

# For regression with Lasso
from sklearn.datasets import make_regression

X_reg, y_reg = make_regression(n_samples=200, n_features=10, 
                                n_informative=5, random_state=42)
X_reg = pd.DataFrame(X_reg, columns=[f'feature_{i}' for i in range(10)])

# Fit Lasso model
lasso = Lasso(alpha=0.1, random_state=42)
lasso.fit(X_reg, y_reg)

# View coefficients
coef_df = pd.DataFrame({
    'Feature': X_reg.columns,
    'Coefficient': lasso.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("Lasso Coefficients:")
print(coef_df)
print(f"\nFeatures with non-zero coefficients: {sum(lasso.coef_ != 0)}")

Features with zero coefficients are effectively removed by Lasso regularization.

# SelectFromModel with L1 penalty
l1_selector = SelectFromModel(
    LogisticRegression(penalty='l1', solver='saga', max_iter=1000),
    threshold='mean'
)
l1_selector.fit(X, y)

print(f"\nL1 Selected Features: {list(X.columns[l1_selector.get_support()])}")

SelectFromModel wraps any model with feature importance attributes, selecting features above a threshold.

Tree-Based Feature Importance

Tree-based models provide built-in feature importance scores.

from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# Get feature importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Random Forest Feature Importances:")
print(importance_df)

# Visualize importances
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importances')
plt.gca().invert_yaxis()
plt.show()

# Select features using importance threshold
rf_selector = SelectFromModel(rf_model, threshold='mean')
rf_selector.fit(X, y)

print(f"\nRF Selected Features: {list(X.columns[rf_selector.get_support()])}")

Features with importance above the mean (or specified threshold) are retained.

Permutation Importance

Permutation importance measures how much model performance decreases when a feature is randomly shuffled.

from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model
rf_model.fit(X_train, y_train)

# Calculate permutation importance
perm_importance = permutation_importance(
    rf_model, X_test, y_test, 
    n_repeats=10, random_state=42
)

perm_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance_Mean': perm_importance.importances_mean,
    'Importance_Std': perm_importance.importances_std
}).sort_values('Importance_Mean', ascending=False)

print("Permutation Importance:")
print(perm_df)

Permutation importance is model-agnostic and reflects the actual impact of features on predictions.

Comparing Feature Selection Methods

# Function to compare methods
def compare_feature_selection(X, y):
    results = {}
    
    # Variance Threshold
    var_sel = VarianceThreshold(threshold=0.1)
    var_sel.fit(X)
    results['Variance'] = list(X.columns[var_sel.get_support()])
    
    # Mutual Information
    mi_sel = SelectKBest(mutual_info_classif, k=5)
    mi_sel.fit(X, y)
    results['Mutual_Info'] = list(X.columns[mi_sel.get_support()])
    
    # RFE
    rfe_sel = RFE(LogisticRegression(max_iter=1000), n_features_to_select=5)
    rfe_sel.fit(X, y)
    results['RFE'] = list(X.columns[rfe_sel.support_])
    
    # Random Forest
    rf = RandomForestClassifier(n_estimators=50, random_state=42)
    rf_sel = SelectFromModel(rf, max_features=5)
    rf_sel.fit(X, y)
    results['RF_Importance'] = list(X.columns[rf_sel.get_support()])
    
    return pd.DataFrame(dict([(k, pd.Series(v)) for k, v in results.items()]))

# Compare methods
comparison = compare_feature_selection(X, y)
print("Feature Selection Comparison:")
print(comparison)

Comparing multiple methods helps identify consistently important features across different approaches.

Feature Selection Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create complete pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectFromModel(
        RandomForestClassifier(n_estimators=50, random_state=42),
        max_features=5
    )),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Fit pipeline
pipeline.fit(X_train, y_train)

# Evaluate
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Train Accuracy: {train_score:.3f}")
print(f"Test Accuracy: {test_score:.3f}")

Pipelines ensure feature selection is part of the cross-validation process, preventing data leakage.

Best Practices for Feature Selection

Start Simple: Begin with filter methods before moving to computationally expensive wrapper methods
Use Domain Knowledge: Combine statistical methods with domain expertise
Validate Selections: Use cross-validation to ensure selected features generalize
Consider Stability: Run selection multiple times to identify consistently chosen features
Avoid Data Leakage: Perform feature selection within cross-validation folds
Document Decisions: Record which features were selected and why

Summary

Feature selection is essential for building efficient, interpretable, and well-performing machine learning models. Filter methods like variance threshold and mutual information provide quick initial screening. Wrapper methods like RFE optimize feature subsets based on model performance. Embedded methods leverage built-in model mechanisms like L1 regularization and tree-based importance. Combining multiple approaches and validating with cross-validation leads to robust feature selection that improves model quality.

Related Lessons

Feature Scaling

Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.

Encoding Categorical Variables

Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.

Dimensionality Reduction with PCA

Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.