Progress2/4 lessons (50%)

Lesson 2

Encoding Categorical Variables

Encoding categorical variables is the process of converting non‑numerical data into numerical formats so machine learning models can understand and learn from them. Techniques like one‑hot encoding, label encoding, and target encoding help transform categories into meaningful numeric values, improving model accuracy and performance.

10 min read19 views

What are Categorical Variables?

Categorical variables represent discrete groups or categories rather than continuous numerical values. Examples include:

Colors: Red, Blue, Green
Countries: USA, Canada, Mexico
Education Level: High School, Bachelor's, Master's, PhD
Product Category: Electronics, Clothing, Food

Machine learning algorithms perform mathematical operations that require numerical input, making categorical encoding a crucial preprocessing step.

Types of Categorical Variables

Nominal Variables

Categories with no inherent order or ranking.

Examples: Colors, country names, product types, gender

Ordinal Variables

Categories with a meaningful order or ranking.

Examples: Education level (High School < Bachelor's < Master's < PhD), satisfaction rating (Poor < Fair < Good < Excellent)

Understanding the variable type determines which encoding technique to use.

Encoding Technique 1: Label Encoding

Label encoding assigns a unique integer to each category.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data with categorical variable
data = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'],
    'size': ['Small', 'Large', 'Medium', 'Small', 'Large', 'Medium']
})

print("Original Data:")
print(data)

# Apply Label Encoding
label_encoder = LabelEncoder()
data['color_encoded'] = label_encoder.fit_transform(data['color'])

print("\nAfter Label Encoding:")
print(data[['color', 'color_encoded']])

# View the mapping
print(f"\nClasses: {label_encoder.classes_}")

LabelEncoder converts each unique category to an integer starting from 0. The classes_ attribute shows the mapping between original categories and encoded values.

When to Use Label Encoding

Tree-based algorithms: Decision trees, Random Forests, and Gradient Boosting can handle label-encoded categories effectively
Ordinal variables: When categories have a meaningful order

When NOT to Use Label Encoding

Linear models: Label encoding implies numerical relationships (2 > 1) that may not exist between categories
Neural networks: The arbitrary numerical values can mislead the model

Real-World Example: If Red=0, Blue=1, Green=2, a linear model might interpret Green as "more" than Red, which is meaningless for colors.

Encoding Technique 2: One-Hot Encoding

One-hot encoding creates binary columns for each category.

from sklearn.preprocessing import OneHotEncoder

# Create OneHotEncoder
onehot_encoder = OneHotEncoder(sparse_output=False)

# Fit and transform
color_encoded = onehot_encoder.fit_transform(data[['color']])

# Create DataFrame with encoded columns
encoded_df = pd.DataFrame(
    color_encoded, 
    columns=onehot_encoder.get_feature_names_out(['color'])
)

print("One-Hot Encoded:")
print(encoded_df)

Each category becomes a separate column with values 0 or 1. A value of 1 indicates the presence of that category.

Using Pandas get_dummies()

# Simpler approach with pandas
encoded_pandas = pd.get_dummies(data['color'], prefix='color')
print("\nUsing pd.get_dummies():")
print(encoded_pandas)

get_dummies() is a quick alternative for one-hot encoding during exploratory analysis.

Handling the Dummy Variable Trap

# Drop first category to avoid multicollinearity
encoded_drop_first = pd.get_dummies(data['color'], 
                                      prefix='color', 
                                      drop_first=True)
print("\nWith drop_first=True:")
print(encoded_drop_first)

Dropping one encoded column prevents perfect multicollinearity in linear models. If color_Blue=0 and color_Green=0, the color must be Red.

When to Use One-Hot Encoding

Nominal variables: Categories without inherent order
Linear models: Prevents imposing false ordinal relationships
Neural networks: Standard approach for categorical features
Low cardinality: Features with few unique categories

Limitations

High cardinality: Features with many categories create many columns
Memory intensive: Sparse representations help but add complexity

Encoding Technique 3: Ordinal Encoding

Ordinal encoding assigns integers based on a specified order.

from sklearn.preprocessing import OrdinalEncoder

# Data with ordinal variable
education_data = pd.DataFrame({
    'education': ['High School', 'Master', 'Bachelor', 'PhD', 
                  'High School', 'Bachelor']
})

# Define the order
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

# Create and apply OrdinalEncoder
ordinal_encoder = OrdinalEncoder(
    categories=[education_order]
)

education_data['education_encoded'] = ordinal_encoder.fit_transform(
    education_data[['education']]
)

print("Ordinal Encoded:")
print(education_data)

The categories parameter specifies the explicit order. High School=0, Bachelor=1, Master=2, PhD=3, preserving the meaningful progression.

Manual Ordinal Encoding with Mapping

# Manual mapping approach
education_mapping = {
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}

education_data['manual_encoded'] = education_data['education'].map(
    education_mapping
)
print("\nManual Mapping:")
print(education_data)

Manual mapping gives you complete control over the encoding values.

Encoding Technique 4: Binary Encoding

Binary encoding converts categories to binary code, then splits into columns.

import category_encoders as ce

# Sample high-cardinality data
country_data = pd.DataFrame({
    'country': ['USA', 'Canada', 'Mexico', 'USA', 'Brazil', 
                'Canada', 'Argentina', 'USA']
})

# Apply Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['country'])
binary_encoded = binary_encoder.fit_transform(country_data)

print("Binary Encoded:")
print(binary_encoded)

Binary encoding represents N categories with log₂(N) columns, making it efficient for high-cardinality variables.

Installation: pip install category_encoders

Encoding Technique 5: Target Encoding

Target encoding replaces categories with the mean of the target variable.

# Sample data with target
target_data = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 
             'Chicago', 'NYC', 'LA'],
    'purchased': [1, 0, 1, 0, 1, 0, 1, 0]
})

# Calculate target mean for each category
target_means = target_data.groupby('city')['purchased'].mean()
target_data['city_encoded'] = target_data['city'].map(target_means)

print("Target Encoded:")
print(target_data)
print(f"\nCity Means:\n{target_means}")

Target encoding captures the relationship between categories and the target, useful for high-cardinality features.

Warning: Target encoding can cause overfitting. Use cross-validation or smoothing techniques.

# Using category_encoders with smoothing
target_encoder = ce.TargetEncoder(cols=['city'], smoothing=1.0)
target_encoded = target_encoder.fit_transform(
    target_data[['city']], 
    target_data['purchased']
)
print("\nWith Smoothing:")
print(target_encoded)

Smoothing blends category means with the global mean, reducing overfitting for rare categories.

Encoding Technique 6: Frequency Encoding

Frequency encoding replaces categories with their occurrence count or proportion.

# Calculate frequency
frequency = data['color'].value_counts()
data['color_frequency'] = data['color'].map(frequency)

print("Frequency Encoded:")
print(data[['color', 'color_frequency']])

# Proportion encoding (normalized frequency)
proportion = data['color'].value_counts(normalize=True)
data['color_proportion'] = data['color'].map(proportion)

print("\nProportion Encoded:")
print(data[['color', 'color_proportion']])

Frequency encoding is useful when category frequency correlates with the target variable.

Choosing the Right Encoding Technique

Variable Type	Recommended Encoding	Algorithm
Nominal, low cardinality	One-Hot Encoding	Linear models, Neural Networks
Nominal, low cardinality	Label Encoding	Tree-based models
Ordinal	Ordinal Encoding	Any algorithm
High cardinality	Binary or Target Encoding	Any algorithm
Unknown categories expected	Handle with 'unknown' category	Any algorithm

Handling Unknown Categories

from sklearn.preprocessing import OneHotEncoder

# Training data
train_data = pd.DataFrame({'color': ['Red', 'Blue', 'Green']})

# Encoder that handles unknown categories
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(train_data[['color']])

# Test data with unknown category
test_data = pd.DataFrame({'color': ['Red', 'Yellow']})
test_encoded = encoder.transform(test_data[['color']])

print("Handling unknown 'Yellow':")
print(test_encoded)
# Yellow gets all zeros

Setting handle_unknown='ignore' prevents errors when encountering categories not seen during training.

Complete Encoding Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Sample dataset
df = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Red'],
    'size': ['Small', 'Large', 'Medium', 'Small'],
    'quality': ['Good', 'Excellent', 'Poor', 'Good'],
    'price': [100, 150, 80, 110]
})

# Define column types
nominal_cols = ['color']
ordinal_cols = ['size']
quality_order = [['Poor', 'Good', 'Excellent']]
size_order = [['Small', 'Medium', 'Large']]

# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('nominal', OneHotEncoder(drop='first'), nominal_cols),
        ('ordinal_quality', OrdinalEncoder(categories=quality_order), 
         ['quality']),
        ('ordinal_size', OrdinalEncoder(categories=size_order), 
         ['size'])
    ],
    remainder='passthrough'  # Keep other columns unchanged
)

# Apply preprocessing
processed = preprocessor.fit_transform(df)
print("Processed Data:")
print(processed)

This pipeline applies different encoding strategies to different columns, creating a reproducible preprocessing workflow.

Best Practices for Encoding Categorical Variables

Understand your categories: Know whether variables are nominal or ordinal
Consider cardinality: High-cardinality features need special techniques
Fit on training data only: Prevent data leakage by fitting encoders on training data
Handle unknown categories: Plan for categories not seen during training
Document your encodings: Record which technique was used for each feature
Test model performance: Compare different encoding strategies for your specific problem

Summary

Encoding categorical variables is essential for machine learning model training. Label encoding works well for ordinal variables and tree-based algorithms. One-hot encoding is the standard choice for nominal variables with low cardinality. For high-cardinality features, consider binary encoding or target encoding. Always match your encoding strategy to both the variable type and the algorithm requirements, and remember to handle unknown categories in production systems.

Related Lessons

Feature Scaling

Feature scaling is the process of transforming data values so they fit within a similar range, improving model stability and performance. Normalization scales values between 0 and 1, ideal for distance‑based algorithms. Standardization transforms data to have a mean of 0 and standard deviation of 1, making it suitable for most machine learning models that assume normally distributed features.

Feature Selection Techniques

Feature selection techniques help identify the most important variables in a dataset to improve model accuracy, reduce overfitting, and speed up training. Methods like filter, wrapper, and embedded approaches evaluate feature relevance using statistics, model performance, and built‑in algorithm scores, ensuring cleaner, more efficient, and highly predictive machine learning models.

Dimensionality Reduction with PCA

Dimensionality reduction with PCA (Principal Component Analysis) is a technique used to simplify large datasets by converting many features into a smaller set of important components. PCA reduces noise, improves model performance, and speeds up processing while preserving the most meaningful patterns and variability in the data.