VIDHYAI
HomeBlogTutorialsNewsAboutContact
VIDHYAI

Your Gateway to AI Knowledge

CONTENT

  • Blog
  • Tutorials
  • News

COMPANY

  • About
  • Contact

LEGAL

  • Privacy Policy
  • Terms of Service
  • Disclaimer
Home
Tutorials
Machine Learning
Supervised Learning - Classification
Classification Algorithms
Support Vector Machines
Back to Classification Algorithms
Progress4/6 lessons (67%)
Lesson 4

Support Vector Machines

Master Support Vector Machines for high‑accuracy classification. Learn how SVMs create optimal boundaries and handle linear and nonlinear data with kernel functions.

10 min read7 views

Introduction to Support Vector Machines

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression. SVM finds the hyperplane that best separates classes with the maximum margin, making it effective for both linear and non-linear classification.

The SVM Intuition

Imagine drawing a line to separate two groups of points. Many lines could work, but SVM finds the line that has the largest gap (margin) between the line and the nearest points from each class. This maximum margin makes SVM robust and generalizes well.

Real-world applications:

  • Text categorization and spam filtering
  • Image classification
  • Bioinformatics (gene classification)
  • Handwriting recognition
  • Face detection

Key Concepts in SVM

Hyperplane

A hyperplane is a decision boundary that separates different classes:

  • In 2D: a line
  • In 3D: a plane
  • In higher dimensions: a hyperplane

Margin

The margin is the distance between the hyperplane and the nearest data points from each class. SVM maximizes this margin for better generalization.

Support Vectors

Support vectors are the data points closest to the hyperplane. These critical points define the margin and the hyperplane position. Removing or moving other points doesn't affect the model.

Visual Explanation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

np.random.seed(42)

# Generate linearly separable data
class_1 = np.random.randn(20, 2) + np.array([2, 2])
class_2 = np.random.randn(20, 2) + np.array([6, 6])
X = np.vstack([class_1, class_2])
y = np.array([0]*20 + [1]*20)

# Train SVM
svm = SVC(kernel='linear', C=1.0)
svm.fit(X, y)

# Get the separating hyperplane
w = svm.coef_[0]
b = svm.intercept_[0]
x_line = np.linspace(0, 8, 100)
y_line = -(w[0] * x_line + b) / w[1]

# Calculate margin boundaries
margin = 1 / np.linalg.norm(w)
y_margin_up = y_line + np.sqrt(1 + (w[0]/w[1])**2) * margin
y_margin_down = y_line - np.sqrt(1 + (w[0]/w[1])**2) * margin

# Plot
plt.figure(figsize=(10, 7))
plt.scatter(class_1[:, 0], class_1[:, 1], c='blue', label='Class 0', s=100)
plt.scatter(class_2[:, 0], class_2[:, 1], c='red', label='Class 1', s=100)
plt.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1], 
            s=300, facecolors='none', edgecolors='green', linewidths=2,
            label='Support Vectors')
plt.plot(x_line, y_line, 'k-', linewidth=2, label='Decision Boundary')
plt.plot(x_line, y_margin_up, 'k--', linewidth=1, alpha=0.5)
plt.plot(x_line, y_margin_down, 'k--', linewidth=1, alpha=0.5)
plt.fill_between(x_line, y_margin_down, y_margin_up, alpha=0.1, color='gray')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM: Maximum Margin Classifier')
plt.legend()
plt.xlim(0, 9)
plt.ylim(0, 10)
plt.show()

print(f"Number of support vectors: {len(svm.support_vectors_)}")

The green circles highlight support vectors - the points that define the margin and decision boundary.


Hard Margin vs Soft Margin SVM

Hard Margin SVM

Hard margin SVM requires perfect separation between classes. It only works when data is linearly separable with no noise or outliers.

Soft Margin SVM

Real-world data often has noise and overlapping classes. Soft margin SVM allows some misclassifications by introducing slack variables, controlled by the C parameter.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
C_values = [0.01, 1, 100]

# Add some noise/overlap
np.random.seed(42)
X_noisy = np.vstack([
    np.random.randn(30, 2) + np.array([2, 2]),
    np.random.randn(30, 2) + np.array([4, 4])
])
y_noisy = np.array([0]*30 + [1]*30)

for ax, C in zip(axes, C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_noisy, y_noisy)
    
    # Decision boundary
    xlim = ax.set_xlim(-1, 8)
    ylim = ax.set_ylim(-1, 8)
    
    xx, yy = np.meshgrid(np.linspace(-1, 8, 100), np.linspace(-1, 8, 100))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, 
                colors=['blue', 'white', 'red'])
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='black', linestyles=['--', '-', '--'])
    ax.scatter(X_noisy[:, 0], X_noisy[:, 1], c=y_noisy, cmap='coolwarm', edgecolors='black')
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
               s=150, facecolors='none', edgecolors='green', linewidths=2)
    ax.set_title(f'C = {C}\nSupport Vectors: {len(svm.support_vectors_)}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

The C parameter:

  • Small C: Allows more margin violations, wider margin, more generalization
  • Large C: Penalizes violations heavily, narrower margin, may overfit

Implementing Linear SVM

Step 1: Import Libraries and Load Data

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer

# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"Classes: {data.target_names}")

Step 2: Prepare Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features - important for SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Important: SVM is sensitive to feature scales. Always standardize features before training.

Step 3: Train Linear SVM

# Create and train SVM
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train_scaled, y_train)

# Evaluate
y_pred = svm_linear.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

The Kernel Trick

Linear SVM works well when data is linearly separable. For non-linear data, SVM uses the kernel trick to project data into a higher-dimensional space where it becomes linearly separable.

How Kernels Work

Instead of explicitly transforming data (computationally expensive), kernels compute the similarity between points in the transformed space directly.

Common Kernels

# Visual comparison of kernels
from sklearn.datasets import make_circles

# Create non-linearly separable data
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

for ax, kernel in zip(axes.flatten(), kernels):
    svm = SVC(kernel=kernel, C=1.0, gamma='auto')
    svm.fit(X_circles, y_circles)
    
    xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm', edgecolors='black')
    ax.set_title(f'{kernel.upper()} Kernel\nAccuracy: {svm.score(X_circles, y_circles):.2f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

Kernel Descriptions

Kernel Use Case Key Parameter
Linear Linearly separable data -
RBF (Gaussian) Most non-linear problems gamma
Polynomial When data has polynomial patterns degree, gamma
Sigmoid Similar to neural network gamma

RBF Kernel: The Most Popular

The Radial Basis Function (RBF) kernel is the most commonly used kernel for non-linear classification.

RBF Formula

K(x, x') = exp(-γ ||x - x'||²)

Where γ (gamma) controls the influence of individual training examples.

The Gamma Parameter

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
gamma_values = [0.1, 1, 10]

for ax, gamma in zip(axes, gamma_values):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_circles, y_circles)
    
    xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, 
               cmap='coolwarm', edgecolors='black')
    ax.set_title(f'Gamma = {gamma}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

Gamma effects:

  • Small gamma: Smooth decision boundary, may underfit
  • Large gamma: Complex boundary around each point, may overfit

Hyperparameter Tuning

Grid Search for Optimal Parameters

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# Grid search
svm = SVC(random_state=42)
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")

View Grid Search Results

results = pd.DataFrame(grid_search.cv_results_)
results = results[['param_C', 'param_gamma', 'param_kernel', 'mean_test_score']]
results = results.sort_values('mean_test_score', ascending=False)
print(results.head(10).to_string(index=False))

Probability Predictions

By default, SVM doesn't provide probability estimates. Enable with probability=True.

svm_prob = SVC(kernel='rbf', C=1.0, probability=True, random_state=42)
svm_prob.fit(X_train_scaled, y_train)

# Get probability predictions
y_prob = svm_prob.predict_proba(X_test_scaled)

print("Sample predictions with probabilities:")
for i in range(5):
    print(f"Actual: {y_test[i]}, Predicted: {svm_prob.predict(X_test_scaled[[i]])[0]}, "
          f"Probabilities: {y_prob[i].round(3)}")

Note: Enabling probability estimation uses Platt scaling and increases training time.


Multi-Class Classification

SVM natively handles binary classification. For multi-class problems, scikit-learn uses one-vs-one (OvO) or one-vs-rest (OvR) strategies.

from sklearn.datasets import load_iris

# Load multi-class dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

# Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Train multi-class SVM
svm_multi = SVC(kernel='rbf', C=1.0, decision_function_shape='ovr', random_state=42)
svm_multi.fit(X_train_iris_scaled, y_train_iris)

# Evaluate
y_pred_iris = svm_multi.predict(X_test_iris_scaled)
print(f"Multi-class accuracy: {accuracy_score(y_test_iris, y_pred_iris):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))

Multi-class strategies:

  • One-vs-Rest (OvR): Trains one classifier per class against all others
  • One-vs-One (OvO): Trains one classifier for each pair of classes

SVM for Regression (SVR)

Support Vector Regression uses similar principles to fit data within a margin of tolerance.

from sklearn.svm import SVR
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

# Generate regression data
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)

# Scale
scaler_reg = StandardScaler()
X_reg_scaled = scaler_reg.fit_transform(X_reg)

# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_reg_scaled, y_reg)

# Predict
y_pred_reg = svr.predict(X_reg_scaled)

# Visualize
plt.figure(figsize=(10, 5))
sort_idx = X_reg_scaled.flatten().argsort()
plt.scatter(X_reg_scaled, y_reg, alpha=0.5, label='Data')
plt.plot(X_reg_scaled[sort_idx], y_pred_reg[sort_idx], 'r-', linewidth=2, label='SVR Prediction')
plt.xlabel('Feature (scaled)')
plt.ylabel('Target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

print(f"R² Score: {r2_score(y_reg, y_pred_reg):.4f}")

Advantages and Disadvantages

Advantages

  • Effective in high dimensions: Works well with many features
  • Memory efficient: Uses only support vectors
  • Versatile: Different kernels for different problems
  • Robust to outliers: Margin-based approach
  • Strong theoretical foundation: Maximizes margin

Disadvantages

  • Slow on large datasets: Training time scales poorly
  • Requires feature scaling: Sensitive to feature magnitudes
  • Difficult to interpret: No direct feature importance
  • Kernel selection: Choosing the right kernel requires experimentation
  • Probability estimation: Requires additional computation

Summary

Support Vector Machines find the optimal separating hyperplane by maximizing the margin between classes.

Key takeaways:

  • SVM maximizes the margin between classes for better generalization
  • Support vectors are the critical points defining the decision boundary
  • C parameter controls the trade-off between margin and misclassification
  • Kernel trick enables non-linear classification without explicit transformation
  • RBF kernel is most popular; gamma controls its flexibility
  • Always scale features before training SVM
  • Use GridSearchCV to tune C and gamma
  • SVM works for both classification and regression
Back to Classification Algorithms

Previous Lesson

Decision Trees

Next Lesson

Naive Bayes Classifier

Related Lessons

1

Logistic Regression for Binary Classification

Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.

2

Naive Bayes Classifier

Discover the Naive Bayes classifier, a fast and powerful algorithm based on probability and Bayes’ theorem. This lesson shows how it excels in text classification and other high‑dimensional tasks.

3

K-Nearest Neighbors (KNN) Algorithm

Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.

In this track (6)

1Logistic Regression for Binary Classification2K-Nearest Neighbors (KNN) Algorithm3Decision Trees4Support Vector Machines5Naive Bayes Classifier6Classification Project - Customer Churn Prediction