Progress1/4 lessons (25%)

Lesson 1

Feature Scaling

Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.

10 min read20 views

What is Feature Scaling?

Feature scaling is the process of transforming numerical features to a similar scale. In machine learning, features often have different ranges—age might range from 0 to 100 while income ranges from 0 to millions. Without scaling, features with larger magnitudes dominate the learning process, leading to suboptimal model performance.

Why is Feature Scaling Important?

Distance-Based Algorithms

Algorithms like K-Nearest Neighbors (KNN), K-Means clustering, and Support Vector Machines (SVM) calculate distances between data points. Features with larger scales contribute more to distance calculations, overshadowing smaller-scale features.

Example: If calculating the distance between two customers using age (20-60) and income (30000-150000), income differences dominate because of its larger numerical range.

Gradient Descent Optimization

Neural networks and linear models using gradient descent converge faster when features are on similar scales. Unscaled features cause the optimization landscape to be elongated, requiring more iterations to find the minimum.

Regularization Effects

Regularization techniques penalize large coefficients. Without scaling, features with naturally larger values receive unfairly large penalties.

The Two Main Scaling Techniques

Normalization (Min-Max Scaling)

Normalization scales features to a fixed range, typically [0, 1].

Formula:

X_normalized = (X - X_min) / (X_max - X_min)

Standardization (Z-Score Scaling)

Standardization transforms features to have zero mean and unit variance.

Formula:

X_standardized = (X - μ) / σ

Where μ is the mean and σ is the standard deviation.

Implementing Normalization in Python

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [30000, 50000, 70000, 90000, 110000],
    'experience': [1, 3, 5, 8, 12]
})

print("Original Data:")
print(data)

This creates a sample dataset with features at different scales for demonstration.

# Apply Min-Max Normalization
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

# Convert back to DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data.columns)
print("\nNormalized Data (0-1 range):")
print(normalized_df)

The MinMaxScaler transforms all values to the range [0, 1]. The fit_transform() method learns the min and max values and applies the transformation.

Custom Range Normalization

# Normalize to custom range [0, 10]
custom_scaler = MinMaxScaler(feature_range=(0, 10))
custom_normalized = custom_scaler.fit_transform(data)

print("\nCustom Range Normalized (0-10):")
print(pd.DataFrame(custom_normalized, columns=data.columns))

You can specify any desired range using the feature_range parameter.

Manual Normalization

# Manual min-max normalization
def normalize(column):
    return (column - column.min()) / (column.max() - column.min())

data_manual_norm = data.apply(normalize)
print("\nManually Normalized:")
print(data_manual_norm)

Understanding the manual calculation helps grasp what the scaler does internally.

Implementing Standardization in Python

from sklearn.preprocessing import StandardScaler

# Apply Standardization
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(data)

standardized_df = pd.DataFrame(standardized_data, columns=data.columns)
print("Standardized Data:")
print(standardized_df)

StandardScaler transforms data to have mean of 0 and standard deviation of 1.

# Verify mean and std of standardized data
print(f"\nMean of standardized features: {standardized_df.mean().values}")
print(f"Std of standardized features: {standardized_df.std().values}")

After standardization, each feature has approximately zero mean and unit variance.

Manual Standardization

# Manual z-score standardization
def standardize(column):
    return (column - column.mean()) / column.std()

data_manual_std = data.apply(standardize)
print("\nManually Standardized:")
print(data_manual_std)

This shows the underlying z-score calculation performed by StandardScaler.

Normalization vs Standardization: When to Use Which

Use Normalization When:

Bounded data required: Neural networks with sigmoid or tanh activation functions expect input in specific ranges
Image data: Pixel values are typically normalized to [0, 1]
No strong outliers: Normalization is sensitive to outliers because min and max are used
Algorithms requiring bounded input: Some algorithms assume features are within specific ranges

# Demonstrating outlier sensitivity
data_with_outlier = pd.DataFrame({
    'value': [10, 15, 12, 14, 11, 200]  # 200 is an outlier
})

normalized_outlier = MinMaxScaler().fit_transform(data_with_outlier)
print("Normalization with outlier:")
print(normalized_outlier.flatten())
# Most values compressed near 0

Notice how the outlier causes most values to cluster near 0, demonstrating normalization's sensitivity to extreme values.

Use Standardization When:

Outliers present: Standardization is more robust because it uses mean and standard deviation
Algorithms assuming normal distribution: Many statistical techniques assume standardized data
SVM, logistic regression, neural networks: These benefit from standardized features
PCA and clustering: These algorithms work better with standardized data

# Demonstrating outlier robustness
standardized_outlier = StandardScaler().fit_transform(data_with_outlier)
print("\nStandardization with outlier:")
print(standardized_outlier.flatten())
# Values more spread out

Standardization handles outliers more gracefully, though extreme values still affect the transformation.

Comparison Summary Table

Aspect	Normalization	Standardization
Output range	[0, 1] or custom	No fixed range
Mean after scaling	Not necessarily 0	0
Sensitive to outliers	Yes	Less sensitive
Best for	Neural networks, images	SVM, linear models, PCA
Preserves zero values	No	Yes (if mean is zero)

Practical Machine Learning Pipeline Example

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create sample classification data
np.random.seed(42)
X = pd.DataFrame({
    'age': np.random.randint(20, 60, 100),
    'income': np.random.randint(30000, 150000, 100),
    'score': np.random.uniform(0, 100, 100)
})
y = np.random.randint(0, 2, 100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# IMPORTANT: Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use transform, not fit_transform

print("Training data mean:", X_train_scaled.mean(axis=0))
print("Test data mean:", X_test_scaled.mean(axis=0))

Critical Point: Always fit the scaler on training data only, then transform both training and test data. Using fit_transform on test data causes data leakage.

# Train model on scaled data
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f"\nTrain accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")

This demonstrates the complete workflow of scaling data before model training.

Other Scaling Techniques

Robust Scaler

Uses median and interquartile range, making it robust to outliers.

from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(data_with_outlier)
print("Robust Scaled:")
print(robust_scaled.flatten())

RobustScaler is ideal when your data contains significant outliers that you cannot remove.

Max Absolute Scaler

Scales by dividing by the maximum absolute value, preserving sparsity.

from sklearn.preprocessing import MaxAbsScaler

maxabs_scaler = MaxAbsScaler()
maxabs_scaled = maxabs_scaler.fit_transform(data)
print("MaxAbs Scaled:")
print(maxabs_scaled)

MaxAbsScaler is useful for sparse data where you want to preserve zero values.

Scaling in Column Transformer

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Different scaling for different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('normalize', MinMaxScaler(), ['age']),
        ('standardize', StandardScaler(), ['income', 'experience'])
    ]
)

processed = preprocessor.fit_transform(data)
print("Mixed Scaling Result:")
print(processed)

ColumnTransformer allows applying different scaling techniques to different columns based on their characteristics.

Common Mistakes to Avoid

Fitting on entire dataset: Causes data leakage; always fit on training data only
Scaling target variable: Usually unnecessary for regression targets
Scaling categorical features: Scaling only applies to numerical features
Forgetting to scale new data: Production data must be scaled using the same fitted scaler
Using wrong technique: Match the scaling method to your algorithm's requirements

Best Practices for Feature Scaling

Save your scaler: Persist the fitted scaler for production use
Document your choice: Record which scaling technique was used and why
Consider feature distributions: Examine distributions before choosing a method
Test both methods: Compare model performance with different scaling approaches
Handle new data consistently: Apply the same transformation to all incoming data

Summary

Feature scaling is a fundamental preprocessing step that significantly impacts machine learning model performance. Normalization (Min-Max scaling) transforms features to a bounded range and works well for neural networks and image data. Standardization (Z-score scaling) centers data around zero with unit variance and is preferred for algorithms using distance calculations or gradient descent. Understanding when to use each technique and implementing them correctly prevents common pitfalls and ensures your models learn effectively from all features.

Related Lessons

Encoding Categorical Variables

Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.

Feature Selection Techniques

Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.

Dimensionality Reduction with PCA

Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.