Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.
Feature scaling is the process of transforming numerical features to a similar scale. In machine learning, features often have different ranges—age might range from 0 to 100 while income ranges from 0 to millions. Without scaling, features with larger magnitudes dominate the learning process, leading to suboptimal model performance.
Algorithms like K-Nearest Neighbors (KNN), K-Means clustering, and Support Vector Machines (SVM) calculate distances between data points. Features with larger scales contribute more to distance calculations, overshadowing smaller-scale features.
Example: If calculating the distance between two customers using age (20-60) and income (30000-150000), income differences dominate because of its larger numerical range.
Neural networks and linear models using gradient descent converge faster when features are on similar scales. Unscaled features cause the optimization landscape to be elongated, requiring more iterations to find the minimum.
Regularization techniques penalize large coefficients. Without scaling, features with naturally larger values receive unfairly large penalties.
Normalization scales features to a fixed range, typically [0, 1].
Formula:
X_normalized = (X - X_min) / (X_max - X_min)
Standardization transforms features to have zero mean and unit variance.
Formula:
X_standardized = (X - μ) / σ
Where μ is the mean and σ is the standard deviation.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Sample data
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [30000, 50000, 70000, 90000, 110000],
'experience': [1, 3, 5, 8, 12]
})
print("Original Data:")
print(data)
This creates a sample dataset with features at different scales for demonstration.
# Apply Min-Max Normalization
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
# Convert back to DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data.columns)
print("\nNormalized Data (0-1 range):")
print(normalized_df)
The MinMaxScaler transforms all values to the range [0, 1]. The fit_transform() method learns the min and max values and applies the transformation.
# Normalize to custom range [0, 10]
custom_scaler = MinMaxScaler(feature_range=(0, 10))
custom_normalized = custom_scaler.fit_transform(data)
print("\nCustom Range Normalized (0-10):")
print(pd.DataFrame(custom_normalized, columns=data.columns))
You can specify any desired range using the feature_range parameter.
# Manual min-max normalization
def normalize(column):
return (column - column.min()) / (column.max() - column.min())
data_manual_norm = data.apply(normalize)
print("\nManually Normalized:")
print(data_manual_norm)
Understanding the manual calculation helps grasp what the scaler does internally.
from sklearn.preprocessing import StandardScaler
# Apply Standardization
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(data)
standardized_df = pd.DataFrame(standardized_data, columns=data.columns)
print("Standardized Data:")
print(standardized_df)
StandardScaler transforms data to have mean of 0 and standard deviation of 1.
# Verify mean and std of standardized data
print(f"\nMean of standardized features: {standardized_df.mean().values}")
print(f"Std of standardized features: {standardized_df.std().values}")
After standardization, each feature has approximately zero mean and unit variance.
# Manual z-score standardization
def standardize(column):
return (column - column.mean()) / column.std()
data_manual_std = data.apply(standardize)
print("\nManually Standardized:")
print(data_manual_std)
This shows the underlying z-score calculation performed by StandardScaler.
# Demonstrating outlier sensitivity
data_with_outlier = pd.DataFrame({
'value': [10, 15, 12, 14, 11, 200] # 200 is an outlier
})
normalized_outlier = MinMaxScaler().fit_transform(data_with_outlier)
print("Normalization with outlier:")
print(normalized_outlier.flatten())
# Most values compressed near 0
Notice how the outlier causes most values to cluster near 0, demonstrating normalization's sensitivity to extreme values.
# Demonstrating outlier robustness
standardized_outlier = StandardScaler().fit_transform(data_with_outlier)
print("\nStandardization with outlier:")
print(standardized_outlier.flatten())
# Values more spread out
Standardization handles outliers more gracefully, though extreme values still affect the transformation.
| Aspect | Normalization | Standardization |
|---|---|---|
| Output range | [0, 1] or custom | No fixed range |
| Mean after scaling | Not necessarily 0 | 0 |
| Sensitive to outliers | Yes | Less sensitive |
| Best for | Neural networks, images | SVM, linear models, PCA |
| Preserves zero values | No | Yes (if mean is zero) |
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Create sample classification data
np.random.seed(42)
X = pd.DataFrame({
'age': np.random.randint(20, 60, 100),
'income': np.random.randint(30000, 150000, 100),
'score': np.random.uniform(0, 100, 100)
})
y = np.random.randint(0, 2, 100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# IMPORTANT: Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use transform, not fit_transform
print("Training data mean:", X_train_scaled.mean(axis=0))
print("Test data mean:", X_test_scaled.mean(axis=0))
Critical Point: Always fit the scaler on training data only, then transform both training and test data. Using fit_transform on test data causes data leakage.
# Train model on scaled data
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# Evaluate
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f"\nTrain accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
This demonstrates the complete workflow of scaling data before model training.
Uses median and interquartile range, making it robust to outliers.
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(data_with_outlier)
print("Robust Scaled:")
print(robust_scaled.flatten())
RobustScaler is ideal when your data contains significant outliers that you cannot remove.
Scales by dividing by the maximum absolute value, preserving sparsity.
from sklearn.preprocessing import MaxAbsScaler
maxabs_scaler = MaxAbsScaler()
maxabs_scaled = maxabs_scaler.fit_transform(data)
print("MaxAbs Scaled:")
print(maxabs_scaled)
MaxAbsScaler is useful for sparse data where you want to preserve zero values.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Different scaling for different columns
preprocessor = ColumnTransformer(
transformers=[
('normalize', MinMaxScaler(), ['age']),
('standardize', StandardScaler(), ['income', 'experience'])
]
)
processed = preprocessor.fit_transform(data)
print("Mixed Scaling Result:")
print(processed)
ColumnTransformer allows applying different scaling techniques to different columns based on their characteristics.
Feature scaling is a fundamental preprocessing step that significantly impacts machine learning model performance. Normalization (Min-Max scaling) transforms features to a bounded range and works well for neural networks and image data. Standardization (Z-score scaling) centers data around zero with unit variance and is preferred for algorithms using distance calculations or gradient descent. Understanding when to use each technique and implementing them correctly prevents common pitfalls and ensures your models learn effectively from all features.
Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.
Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.
Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.