Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.
Categorical variables represent discrete groups or categories rather than continuous numerical values. Examples include:
Machine learning algorithms perform mathematical operations that require numerical input, making categorical encoding a crucial preprocessing step.
Categories with no inherent order or ranking.
Examples: Colors, country names, product types, gender
Categories with a meaningful order or ranking.
Examples: Education level (High School < Bachelor's < Master's < PhD), satisfaction rating (Poor < Fair < Good < Excellent)
Understanding the variable type determines which encoding technique to use.
Label encoding assigns a unique integer to each category.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample data with categorical variable
data = pd.DataFrame({
'color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'],
'size': ['Small', 'Large', 'Medium', 'Small', 'Large', 'Medium']
})
print("Original Data:")
print(data)
# Apply Label Encoding
label_encoder = LabelEncoder()
data['color_encoded'] = label_encoder.fit_transform(data['color'])
print("\nAfter Label Encoding:")
print(data[['color', 'color_encoded']])
# View the mapping
print(f"\nClasses: {label_encoder.classes_}")
LabelEncoder converts each unique category to an integer starting from 0. The classes_ attribute shows the mapping between original categories and encoded values.
Real-World Example: If Red=0, Blue=1, Green=2, a linear model might interpret Green as "more" than Red, which is meaningless for colors.
One-hot encoding creates binary columns for each category.
from sklearn.preprocessing import OneHotEncoder
# Create OneHotEncoder
onehot_encoder = OneHotEncoder(sparse_output=False)
# Fit and transform
color_encoded = onehot_encoder.fit_transform(data[['color']])
# Create DataFrame with encoded columns
encoded_df = pd.DataFrame(
color_encoded,
columns=onehot_encoder.get_feature_names_out(['color'])
)
print("One-Hot Encoded:")
print(encoded_df)
Each category becomes a separate column with values 0 or 1. A value of 1 indicates the presence of that category.
# Simpler approach with pandas
encoded_pandas = pd.get_dummies(data['color'], prefix='color')
print("\nUsing pd.get_dummies():")
print(encoded_pandas)
get_dummies() is a quick alternative for one-hot encoding during exploratory analysis.
# Drop first category to avoid multicollinearity
encoded_drop_first = pd.get_dummies(data['color'],
prefix='color',
drop_first=True)
print("\nWith drop_first=True:")
print(encoded_drop_first)
Dropping one encoded column prevents perfect multicollinearity in linear models. If color_Blue=0 and color_Green=0, the color must be Red.
Ordinal encoding assigns integers based on a specified order.
from sklearn.preprocessing import OrdinalEncoder
# Data with ordinal variable
education_data = pd.DataFrame({
'education': ['High School', 'Master', 'Bachelor', 'PhD',
'High School', 'Bachelor']
})
# Define the order
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
# Create and apply OrdinalEncoder
ordinal_encoder = OrdinalEncoder(
categories=[education_order]
)
education_data['education_encoded'] = ordinal_encoder.fit_transform(
education_data[['education']]
)
print("Ordinal Encoded:")
print(education_data)
The categories parameter specifies the explicit order. High School=0, Bachelor=1, Master=2, PhD=3, preserving the meaningful progression.
# Manual mapping approach
education_mapping = {
'High School': 0,
'Bachelor': 1,
'Master': 2,
'PhD': 3
}
education_data['manual_encoded'] = education_data['education'].map(
education_mapping
)
print("\nManual Mapping:")
print(education_data)
Manual mapping gives you complete control over the encoding values.
Binary encoding converts categories to binary code, then splits into columns.
import category_encoders as ce
# Sample high-cardinality data
country_data = pd.DataFrame({
'country': ['USA', 'Canada', 'Mexico', 'USA', 'Brazil',
'Canada', 'Argentina', 'USA']
})
# Apply Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['country'])
binary_encoded = binary_encoder.fit_transform(country_data)
print("Binary Encoded:")
print(binary_encoded)
Binary encoding represents N categories with log₂(N) columns, making it efficient for high-cardinality variables.
Installation: pip install category_encoders
Target encoding replaces categories with the mean of the target variable.
# Sample data with target
target_data = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA',
'Chicago', 'NYC', 'LA'],
'purchased': [1, 0, 1, 0, 1, 0, 1, 0]
})
# Calculate target mean for each category
target_means = target_data.groupby('city')['purchased'].mean()
target_data['city_encoded'] = target_data['city'].map(target_means)
print("Target Encoded:")
print(target_data)
print(f"\nCity Means:\n{target_means}")
Target encoding captures the relationship between categories and the target, useful for high-cardinality features.
Warning: Target encoding can cause overfitting. Use cross-validation or smoothing techniques.
# Using category_encoders with smoothing
target_encoder = ce.TargetEncoder(cols=['city'], smoothing=1.0)
target_encoded = target_encoder.fit_transform(
target_data[['city']],
target_data['purchased']
)
print("\nWith Smoothing:")
print(target_encoded)
Smoothing blends category means with the global mean, reducing overfitting for rare categories.
Frequency encoding replaces categories with their occurrence count or proportion.
# Calculate frequency
frequency = data['color'].value_counts()
data['color_frequency'] = data['color'].map(frequency)
print("Frequency Encoded:")
print(data[['color', 'color_frequency']])
# Proportion encoding (normalized frequency)
proportion = data['color'].value_counts(normalize=True)
data['color_proportion'] = data['color'].map(proportion)
print("\nProportion Encoded:")
print(data[['color', 'color_proportion']])
Frequency encoding is useful when category frequency correlates with the target variable.
| Variable Type | Recommended Encoding | Algorithm |
|---|---|---|
| Nominal, low cardinality | One-Hot Encoding | Linear models, Neural Networks |
| Nominal, low cardinality | Label Encoding | Tree-based models |
| Ordinal | Ordinal Encoding | Any algorithm |
| High cardinality | Binary or Target Encoding | Any algorithm |
| Unknown categories expected | Handle with 'unknown' category | Any algorithm |
from sklearn.preprocessing import OneHotEncoder
# Training data
train_data = pd.DataFrame({'color': ['Red', 'Blue', 'Green']})
# Encoder that handles unknown categories
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(train_data[['color']])
# Test data with unknown category
test_data = pd.DataFrame({'color': ['Red', 'Yellow']})
test_encoded = encoder.transform(test_data[['color']])
print("Handling unknown 'Yellow':")
print(test_encoded)
# Yellow gets all zeros
Setting handle_unknown='ignore' prevents errors when encountering categories not seen during training.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
# Sample dataset
df = pd.DataFrame({
'color': ['Red', 'Blue', 'Green', 'Red'],
'size': ['Small', 'Large', 'Medium', 'Small'],
'quality': ['Good', 'Excellent', 'Poor', 'Good'],
'price': [100, 150, 80, 110]
})
# Define column types
nominal_cols = ['color']
ordinal_cols = ['size']
quality_order = [['Poor', 'Good', 'Excellent']]
size_order = [['Small', 'Medium', 'Large']]
# Create preprocessor
preprocessor = ColumnTransformer(
transformers=[
('nominal', OneHotEncoder(drop='first'), nominal_cols),
('ordinal_quality', OrdinalEncoder(categories=quality_order),
['quality']),
('ordinal_size', OrdinalEncoder(categories=size_order),
['size'])
],
remainder='passthrough' # Keep other columns unchanged
)
# Apply preprocessing
processed = preprocessor.fit_transform(df)
print("Processed Data:")
print(processed)
This pipeline applies different encoding strategies to different columns, creating a reproducible preprocessing workflow.
Encoding categorical variables is essential for machine learning model training. Label encoding works well for ordinal variables and tree-based algorithms. One-hot encoding is the standard choice for nominal variables with low cardinality. For high-cardinality features, consider binary encoding or target encoding. Always match your encoding strategy to both the variable type and the algorithm requirements, and remember to handle unknown categories in production systems.
Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.
Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.
Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.