Progress4/6 lessons (67%)

Lesson 4

Support Vector Machines

Master Support Vector Machines for high‑accuracy classification. Learn how SVMs create optimal boundaries and handle linear and nonlinear data with kernel functions.

10 min read21 views

Introduction to Support Vector Machines

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression. SVM finds the hyperplane that best separates classes with the maximum margin, making it effective for both linear and non-linear classification.

The SVM Intuition

Imagine drawing a line to separate two groups of points. Many lines could work, but SVM finds the line that has the largest gap (margin) between the line and the nearest points from each class. This maximum margin makes SVM robust and generalizes well.

Real-world applications:

Text categorization and spam filtering
Image classification
Bioinformatics (gene classification)
Handwriting recognition
Face detection

Key Concepts in SVM

Hyperplane

A hyperplane is a decision boundary that separates different classes:

In 2D: a line
In 3D: a plane
In higher dimensions: a hyperplane

Margin

The margin is the distance between the hyperplane and the nearest data points from each class. SVM maximizes this margin for better generalization.

Support Vectors

Support vectors are the data points closest to the hyperplane. These critical points define the margin and the hyperplane position. Removing or moving other points doesn't affect the model.

Visual Explanation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

np.random.seed(42)

# Generate linearly separable data
class_1 = np.random.randn(20, 2) + np.array([2, 2])
class_2 = np.random.randn(20, 2) + np.array([6, 6])
X = np.vstack([class_1, class_2])
y = np.array([0]*20 + [1]*20)

# Train SVM
svm = SVC(kernel='linear', C=1.0)
svm.fit(X, y)

# Get the separating hyperplane
w = svm.coef_[0]
b = svm.intercept_[0]
x_line = np.linspace(0, 8, 100)
y_line = -(w[0] * x_line + b) / w[1]

# Calculate margin boundaries
margin = 1 / np.linalg.norm(w)
y_margin_up = y_line + np.sqrt(1 + (w[0]/w[1])**2) * margin
y_margin_down = y_line - np.sqrt(1 + (w[0]/w[1])**2) * margin

# Plot
plt.figure(figsize=(10, 7))
plt.scatter(class_1[:, 0], class_1[:, 1], c='blue', label='Class 0', s=100)
plt.scatter(class_2[:, 0], class_2[:, 1], c='red', label='Class 1', s=100)
plt.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1], 
            s=300, facecolors='none', edgecolors='green', linewidths=2,
            label='Support Vectors')
plt.plot(x_line, y_line, 'k-', linewidth=2, label='Decision Boundary')
plt.plot(x_line, y_margin_up, 'k--', linewidth=1, alpha=0.5)
plt.plot(x_line, y_margin_down, 'k--', linewidth=1, alpha=0.5)
plt.fill_between(x_line, y_margin_down, y_margin_up, alpha=0.1, color='gray')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM: Maximum Margin Classifier')
plt.legend()
plt.xlim(0, 9)
plt.ylim(0, 10)
plt.show()

print(f"Number of support vectors: {len(svm.support_vectors_)}")

The green circles highlight support vectors - the points that define the margin and decision boundary.

Hard Margin vs Soft Margin SVM

Hard Margin SVM

Hard margin SVM requires perfect separation between classes. It only works when data is linearly separable with no noise or outliers.

Soft Margin SVM

Real-world data often has noise and overlapping classes. Soft margin SVM allows some misclassifications by introducing slack variables, controlled by the C parameter.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
C_values = [0.01, 1, 100]

# Add some noise/overlap
np.random.seed(42)
X_noisy = np.vstack([
    np.random.randn(30, 2) + np.array([2, 2]),
    np.random.randn(30, 2) + np.array([4, 4])
])
y_noisy = np.array([0]*30 + [1]*30)

for ax, C in zip(axes, C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_noisy, y_noisy)
    
    # Decision boundary
    xlim = ax.set_xlim(-1, 8)
    ylim = ax.set_ylim(-1, 8)
    
    xx, yy = np.meshgrid(np.linspace(-1, 8, 100), np.linspace(-1, 8, 100))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, 
                colors=['blue', 'white', 'red'])
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='black', linestyles=['--', '-', '--'])
    ax.scatter(X_noisy[:, 0], X_noisy[:, 1], c=y_noisy, cmap='coolwarm', edgecolors='black')
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
               s=150, facecolors='none', edgecolors='green', linewidths=2)
    ax.set_title(f'C = {C}\nSupport Vectors: {len(svm.support_vectors_)}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

The C parameter:

Small C: Allows more margin violations, wider margin, more generalization
Large C: Penalizes violations heavily, narrower margin, may overfit

Implementing Linear SVM

Step 1: Import Libraries and Load Data

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer

# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"Classes: {data.target_names}")

Step 2: Prepare Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features - important for SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Important: SVM is sensitive to feature scales. Always standardize features before training.

Step 3: Train Linear SVM

# Create and train SVM
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train_scaled, y_train)

# Evaluate
y_pred = svm_linear.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

The Kernel Trick

Linear SVM works well when data is linearly separable. For non-linear data, SVM uses the kernel trick to project data into a higher-dimensional space where it becomes linearly separable.

How Kernels Work

Instead of explicitly transforming data (computationally expensive), kernels compute the similarity between points in the transformed space directly.

Common Kernels

# Visual comparison of kernels
from sklearn.datasets import make_circles

# Create non-linearly separable data
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

for ax, kernel in zip(axes.flatten(), kernels):
    svm = SVC(kernel=kernel, C=1.0, gamma='auto')
    svm.fit(X_circles, y_circles)
    
    xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm', edgecolors='black')
    ax.set_title(f'{kernel.upper()} Kernel\nAccuracy: {svm.score(X_circles, y_circles):.2f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

Kernel Descriptions

Kernel	Use Case	Key Parameter
Linear	Linearly separable data	-
RBF (Gaussian)	Most non-linear problems	gamma
Polynomial	When data has polynomial patterns	degree, gamma
Sigmoid	Similar to neural network	gamma

RBF Kernel: The Most Popular

The Radial Basis Function (RBF) kernel is the most commonly used kernel for non-linear classification.

RBF Formula

K(x, x') = exp(-γ ||x - x'||²)

Where γ (gamma) controls the influence of individual training examples.

The Gamma Parameter

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
gamma_values = [0.1, 1, 10]

for ax, gamma in zip(axes, gamma_values):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_circles, y_circles)
    
    xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, 
               cmap='coolwarm', edgecolors='black')
    ax.set_title(f'Gamma = {gamma}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

Gamma effects:

Small gamma: Smooth decision boundary, may underfit
Large gamma: Complex boundary around each point, may overfit

Hyperparameter Tuning

Grid Search for Optimal Parameters

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# Grid search
svm = SVC(random_state=42)
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")

View Grid Search Results

results = pd.DataFrame(grid_search.cv_results_)
results = results[['param_C', 'param_gamma', 'param_kernel', 'mean_test_score']]
results = results.sort_values('mean_test_score', ascending=False)
print(results.head(10).to_string(index=False))

Probability Predictions

By default, SVM doesn't provide probability estimates. Enable with probability=True.

svm_prob = SVC(kernel='rbf', C=1.0, probability=True, random_state=42)
svm_prob.fit(X_train_scaled, y_train)

# Get probability predictions
y_prob = svm_prob.predict_proba(X_test_scaled)

print("Sample predictions with probabilities:")
for i in range(5):
    print(f"Actual: {y_test[i]}, Predicted: {svm_prob.predict(X_test_scaled[[i]])[0]}, "
          f"Probabilities: {y_prob[i].round(3)}")

Note: Enabling probability estimation uses Platt scaling and increases training time.

Multi-Class Classification

SVM natively handles binary classification. For multi-class problems, scikit-learn uses one-vs-one (OvO) or one-vs-rest (OvR) strategies.

from sklearn.datasets import load_iris

# Load multi-class dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

# Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train multi-class SVM
svm_multi = SVC(kernel='rbf', C=1.0, decision_function_shape='ovr', random_state=42)
svm_multi.fit(X_train_iris_scaled, y_train_iris)

# Evaluate
y_pred_iris = svm_multi.predict(X_test_iris_scaled)
print(f"Multi-class accuracy: {accuracy_score(y_test_iris, y_pred_iris):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))

Multi-class strategies:

One-vs-Rest (OvR): Trains one classifier per class against all others
One-vs-One (OvO): Trains one classifier for each pair of classes

SVM for Regression (SVR)

Support Vector Regression uses similar principles to fit data within a margin of tolerance.

from sklearn.svm import SVR
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

# Generate regression data
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)

# Scale
scaler_reg = StandardScaler()
X_reg_scaled = scaler_reg.fit_transform(X_reg)

# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_reg_scaled, y_reg)

# Predict
y_pred_reg = svr.predict(X_reg_scaled)

# Visualize
plt.figure(figsize=(10, 5))
sort_idx = X_reg_scaled.flatten().argsort()
plt.scatter(X_reg_scaled, y_reg, alpha=0.5, label='Data')
plt.plot(X_reg_scaled[sort_idx], y_pred_reg[sort_idx], 'r-', linewidth=2, label='SVR Prediction')
plt.xlabel('Feature (scaled)')
plt.ylabel('Target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

print(f"R² Score: {r2_score(y_reg, y_pred_reg):.4f}")

Advantages and Disadvantages

Advantages

Effective in high dimensions: Works well with many features
Memory efficient: Uses only support vectors
Versatile: Different kernels for different problems
Robust to outliers: Margin-based approach
Strong theoretical foundation: Maximizes margin

Disadvantages

Slow on large datasets: Training time scales poorly
Requires feature scaling: Sensitive to feature magnitudes
Difficult to interpret: No direct feature importance
Kernel selection: Choosing the right kernel requires experimentation
Probability estimation: Requires additional computation

Summary

Support Vector Machines find the optimal separating hyperplane by maximizing the margin between classes.

Key takeaways:

SVM maximizes the margin between classes for better generalization
Support vectors are the critical points defining the decision boundary
C parameter controls the trade-off between margin and misclassification
Kernel trick enables non-linear classification without explicit transformation
RBF kernel is most popular; gamma controls its flexibility
Always scale features before training SVM
Use GridSearchCV to tune C and gamma
SVM works for both classification and regression

Related Lessons

Decision Trees

Explore Decision Trees and how they split data into meaningful decision rules. This lesson teaches tree-building, visualization, and practical classification applications.

K-Nearest Neighbors (KNN) Algorithm

Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.

Logistic Regression for Binary Classification

Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.