Master Support Vector Machines for high‑accuracy classification. Learn how SVMs create optimal boundaries and handle linear and nonlinear data with kernel functions.
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification and regression. SVM finds the hyperplane that best separates classes with the maximum margin, making it effective for both linear and non-linear classification.
Imagine drawing a line to separate two groups of points. Many lines could work, but SVM finds the line that has the largest gap (margin) between the line and the nearest points from each class. This maximum margin makes SVM robust and generalizes well.
Real-world applications:
A hyperplane is a decision boundary that separates different classes:
The margin is the distance between the hyperplane and the nearest data points from each class. SVM maximizes this margin for better generalization.
Support vectors are the data points closest to the hyperplane. These critical points define the margin and the hyperplane position. Removing or moving other points doesn't affect the model.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
np.random.seed(42)
# Generate linearly separable data
class_1 = np.random.randn(20, 2) + np.array([2, 2])
class_2 = np.random.randn(20, 2) + np.array([6, 6])
X = np.vstack([class_1, class_2])
y = np.array([0]*20 + [1]*20)
# Train SVM
svm = SVC(kernel='linear', C=1.0)
svm.fit(X, y)
# Get the separating hyperplane
w = svm.coef_[0]
b = svm.intercept_[0]
x_line = np.linspace(0, 8, 100)
y_line = -(w[0] * x_line + b) / w[1]
# Calculate margin boundaries
margin = 1 / np.linalg.norm(w)
y_margin_up = y_line + np.sqrt(1 + (w[0]/w[1])**2) * margin
y_margin_down = y_line - np.sqrt(1 + (w[0]/w[1])**2) * margin
# Plot
plt.figure(figsize=(10, 7))
plt.scatter(class_1[:, 0], class_1[:, 1], c='blue', label='Class 0', s=100)
plt.scatter(class_2[:, 0], class_2[:, 1], c='red', label='Class 1', s=100)
plt.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
s=300, facecolors='none', edgecolors='green', linewidths=2,
label='Support Vectors')
plt.plot(x_line, y_line, 'k-', linewidth=2, label='Decision Boundary')
plt.plot(x_line, y_margin_up, 'k--', linewidth=1, alpha=0.5)
plt.plot(x_line, y_margin_down, 'k--', linewidth=1, alpha=0.5)
plt.fill_between(x_line, y_margin_down, y_margin_up, alpha=0.1, color='gray')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM: Maximum Margin Classifier')
plt.legend()
plt.xlim(0, 9)
plt.ylim(0, 10)
plt.show()
print(f"Number of support vectors: {len(svm.support_vectors_)}")
The green circles highlight support vectors - the points that define the margin and decision boundary.
Hard margin SVM requires perfect separation between classes. It only works when data is linearly separable with no noise or outliers.
Real-world data often has noise and overlapping classes. Soft margin SVM allows some misclassifications by introducing slack variables, controlled by the C parameter.
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
C_values = [0.01, 1, 100]
# Add some noise/overlap
np.random.seed(42)
X_noisy = np.vstack([
np.random.randn(30, 2) + np.array([2, 2]),
np.random.randn(30, 2) + np.array([4, 4])
])
y_noisy = np.array([0]*30 + [1]*30)
for ax, C in zip(axes, C_values):
svm = SVC(kernel='linear', C=C)
svm.fit(X_noisy, y_noisy)
# Decision boundary
xlim = ax.set_xlim(-1, 8)
ylim = ax.set_ylim(-1, 8)
xx, yy = np.meshgrid(np.linspace(-1, 8, 100), np.linspace(-1, 8, 100))
Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2,
colors=['blue', 'white', 'red'])
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='black', linestyles=['--', '-', '--'])
ax.scatter(X_noisy[:, 0], X_noisy[:, 1], c=y_noisy, cmap='coolwarm', edgecolors='black')
ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
s=150, facecolors='none', edgecolors='green', linewidths=2)
ax.set_title(f'C = {C}\nSupport Vectors: {len(svm.support_vectors_)}')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
The C parameter:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer
# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"Classes: {data.target_names}")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features - important for SVM
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Important: SVM is sensitive to feature scales. Always standardize features before training.
# Create and train SVM
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train_scaled, y_train)
# Evaluate
y_pred = svm_linear.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
Linear SVM works well when data is linearly separable. For non-linear data, SVM uses the kernel trick to project data into a higher-dimensional space where it becomes linearly separable.
Instead of explicitly transforming data (computationally expensive), kernels compute the similarity between points in the transformed space directly.
# Visual comparison of kernels
from sklearn.datasets import make_circles
# Create non-linearly separable data
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for ax, kernel in zip(axes.flatten(), kernels):
svm = SVC(kernel=kernel, C=1.0, gamma='auto')
svm.fit(X_circles, y_circles)
xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm', edgecolors='black')
ax.set_title(f'{kernel.upper()} Kernel\nAccuracy: {svm.score(X_circles, y_circles):.2f}')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
| Kernel | Use Case | Key Parameter |
|---|---|---|
| Linear | Linearly separable data | - |
| RBF (Gaussian) | Most non-linear problems | gamma |
| Polynomial | When data has polynomial patterns | degree, gamma |
| Sigmoid | Similar to neural network | gamma |
The Radial Basis Function (RBF) kernel is the most commonly used kernel for non-linear classification.
K(x, x') = exp(-γ ||x - x'||²)
Where γ (gamma) controls the influence of individual training examples.
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
gamma_values = [0.1, 1, 10]
for ax, gamma in zip(axes, gamma_values):
svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
svm.fit(X_circles, y_circles)
xx, yy = np.meshgrid(np.linspace(-2, 2, 100), np.linspace(-2, 2, 100))
Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
ax.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles,
cmap='coolwarm', edgecolors='black')
ax.set_title(f'Gamma = {gamma}')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
Gamma effects:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
# Grid search
svm = SVC(random_state=42)
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")
# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")
results = pd.DataFrame(grid_search.cv_results_)
results = results[['param_C', 'param_gamma', 'param_kernel', 'mean_test_score']]
results = results.sort_values('mean_test_score', ascending=False)
print(results.head(10).to_string(index=False))
By default, SVM doesn't provide probability estimates. Enable with probability=True.
svm_prob = SVC(kernel='rbf', C=1.0, probability=True, random_state=42)
svm_prob.fit(X_train_scaled, y_train)
# Get probability predictions
y_prob = svm_prob.predict_proba(X_test_scaled)
print("Sample predictions with probabilities:")
for i in range(5):
print(f"Actual: {y_test[i]}, Predicted: {svm_prob.predict(X_test_scaled[[i]])[0]}, "
f"Probabilities: {y_prob[i].round(3)}")
Note: Enabling probability estimation uses Platt scaling and increases training time.
SVM natively handles binary classification. For multi-class problems, scikit-learn uses one-vs-one (OvO) or one-vs-rest (OvR) strategies.
from sklearn.datasets import load_iris
# Load multi-class dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)
# Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)
# Train multi-class SVM
svm_multi = SVC(kernel='rbf', C=1.0, decision_function_shape='ovr', random_state=42)
svm_multi.fit(X_train_iris_scaled, y_train_iris)
# Evaluate
y_pred_iris = svm_multi.predict(X_test_iris_scaled)
print(f"Multi-class accuracy: {accuracy_score(y_test_iris, y_pred_iris):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))
Multi-class strategies:
Support Vector Regression uses similar principles to fit data within a margin of tolerance.
from sklearn.svm import SVR
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
# Generate regression data
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)
# Scale
scaler_reg = StandardScaler()
X_reg_scaled = scaler_reg.fit_transform(X_reg)
# Train SVR
svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X_reg_scaled, y_reg)
# Predict
y_pred_reg = svr.predict(X_reg_scaled)
# Visualize
plt.figure(figsize=(10, 5))
sort_idx = X_reg_scaled.flatten().argsort()
plt.scatter(X_reg_scaled, y_reg, alpha=0.5, label='Data')
plt.plot(X_reg_scaled[sort_idx], y_pred_reg[sort_idx], 'r-', linewidth=2, label='SVR Prediction')
plt.xlabel('Feature (scaled)')
plt.ylabel('Target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()
print(f"R² Score: {r2_score(y_reg, y_pred_reg):.4f}")
Support Vector Machines find the optimal separating hyperplane by maximizing the margin between classes.
Key takeaways:
Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.
Discover the Naive Bayes classifier, a fast and powerful algorithm based on probability and Bayes’ theorem. This lesson shows how it excels in text classification and other high‑dimensional tasks.
Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.