VIDHYAI
HomeBlogTutorialsNewsAboutContact
VIDHYAI

Your Gateway to AI Knowledge

CONTENT

  • Blog
  • Tutorials
  • News

COMPANY

  • About
  • Contact

LEGAL

  • Privacy Policy
  • Terms of Service
  • Disclaimer
Home
Tutorials
Machine Learning
Supervised Learning - Classification
Classification Algorithms
K-Nearest Neighbors (KNN) Algorithm
Back to Classification Algorithms
Progress2/6 lessons (33%)
Lesson 2

K-Nearest Neighbors (KNN) Algorithm

Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.

10 min read22 views

Introduction to K-Nearest Neighbors

K-Nearest Neighbors (KNN) is an intuitive algorithm based on a simple principle: similar things are close to each other. To classify a new data point, KNN looks at the K closest training examples and assigns the most common class among them.

The KNN Intuition

Imagine you move to a new neighborhood and want to predict your political preference. KNN would look at your K nearest neighbors and predict you'll have the same preference as the majority of them.

Real-world applications:

  • Recommendation systems (similar users, similar products)
  • Image recognition
  • Medical diagnosis based on similar patient records
  • Anomaly detection
  • Handwriting recognition

How KNN Works

The Algorithm Steps

  1. Choose K: Select the number of neighbors to consider
  2. Calculate Distance: Measure distance from the new point to all training points
  3. Find Neighbors: Identify the K nearest training points
  4. Vote: Count the classes of the K neighbors
  5. Predict: Assign the most common class (majority vote)

Visual Explanation

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Create sample data
class_a = np.random.randn(20, 2) + np.array([2, 2])
class_b = np.random.randn(20, 2) + np.array([5, 5])

# New point to classify
new_point = np.array([3.5, 3.5])

plt.figure(figsize=(8, 6))
plt.scatter(class_a[:, 0], class_a[:, 1], c='blue', s=100, label='Class A')
plt.scatter(class_b[:, 0], class_b[:, 1], c='red', s=100, label='Class B')
plt.scatter(new_point[0], new_point[1], c='green', s=200, marker='*', 
            label='New Point', edgecolors='black', linewidths=2)

# Draw circle representing K=5 nearest neighbors
circle = plt.Circle((new_point[0], new_point[1]), 1.8, fill=False, 
                     linestyle='--', color='gray', label='K=5 Neighborhood')
plt.gca().add_patch(circle)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Classification: Finding Nearest Neighbors')
plt.legend()
plt.axis('equal')
plt.show()

The algorithm finds the K points closest to the green star and uses their classes to make a prediction.


Distance Metrics in KNN

KNN relies on distance to determine "nearness." The most common distance metrics are:

Euclidean Distance (Default)

d = √[(x₂-x₁)² + (y₂-y₁)² + ... + (zₙ-z₁)²]

This is the straight-line distance, like measuring with a ruler.

Manhattan Distance

d = |x₂-x₁| + |y₂-y₁| + ... + |zₙ-z₁|

This measures distance along axes only, like walking on a city grid.

Calculating Distances in Python

from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances
import numpy as np

point_a = np.array([[1, 2]])
point_b = np.array([[4, 6]])

euclidean = euclidean_distances(point_a, point_b)[0][0]
manhattan = manhattan_distances(point_a, point_b)[0][0]

print(f"Euclidean Distance: {euclidean:.2f}")
print(f"Manhattan Distance: {manhattan:.2f}")

Output:

Euclidean Distance: 5.00
Manhattan Distance: 7.00

Implementing KNN in Python

Step 1: Import Libraries and Load Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Dataset shape: {X.shape}")

The Iris dataset contains flower measurements for three species, making it perfect for demonstrating multi-class classification.

Step 2: Split and Scale Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features - crucial for KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Important: Always scale features before using KNN. Since KNN relies on distance, features with larger scales would dominate the distance calculation unfairly.

Step 3: Train the KNN Model

# Create and train KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

The n_neighbors parameter specifies how many neighbors to consider when making predictions.

Step 4: Make Predictions and Evaluate

y_pred = knn.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Step 5: Predict for New Data

# New flower measurements
new_flower = np.array([[5.0, 3.5, 1.5, 0.3]])

# Scale the new data
new_flower_scaled = scaler.transform(new_flower)

# Predict
prediction = knn.predict(new_flower_scaled)
probabilities = knn.predict_proba(new_flower_scaled)

print(f"Predicted class: {iris.target_names[prediction[0]]}")
print(f"Class probabilities: {probabilities[0].round(3)}")

The predict_proba() method returns the proportion of each class among the K neighbors.


Choosing the Optimal K Value

The choice of K significantly impacts model performance:

  • Small K (e.g., K=1): Highly sensitive to noise, may overfit
  • Large K: Smoother boundaries, may underfit
  • K should be odd: Avoids ties in binary classification

Finding the Best K with Cross-Validation

from sklearn.model_selection import cross_val_score

k_values = range(1, 31)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)
    cv_scores.append(scores.mean())

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_values, cv_scores, 'bo-')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Finding Optimal K Value')
plt.xticks(k_values[::2])
plt.grid(True, alpha=0.3)
plt.show()

best_k = k_values[np.argmax(cv_scores)]
print(f"Best K: {best_k} with accuracy: {max(cv_scores):.4f}")

This approach tests multiple K values and selects the one with the highest cross-validation accuracy.

Visualizing Different K Values

from sklearn.datasets import make_classification

# Create 2D data for visualization
X_2d, y_2d = make_classification(
    n_samples=200, n_features=2, n_informative=2,
    n_redundant=0, n_clusters_per_class=1, random_state=42
)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
k_values = [1, 5, 15]

for ax, k in zip(axes, k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_2d, y_2d)
    
    # Create mesh grid
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y_2d, cmap='coolwarm', edgecolors='black')
    ax.set_title(f'K = {k}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

Notice how K=1 creates jagged boundaries (overfitting), while K=15 creates smoother boundaries.


Weighted KNN

Standard KNN treats all K neighbors equally. Weighted KNN gives closer neighbors more influence.

# Compare uniform and distance-weighted KNN
knn_uniform = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn_weighted = KNeighborsClassifier(n_neighbors=5, weights='distance')

knn_uniform.fit(X_train_scaled, y_train)
knn_weighted.fit(X_train_scaled, y_train)

print(f"Uniform weights accuracy: {knn_uniform.score(X_test_scaled, y_test):.4f}")
print(f"Distance weights accuracy: {knn_weighted.score(X_test_scaled, y_test):.4f}")

Weights options:

  • 'uniform': All neighbors weighted equally
  • 'distance': Closer neighbors have more influence (weight = 1/distance)

KNN for Regression

KNN can also perform regression by averaging the target values of K neighbors instead of voting.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

# Generate regression data
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=20, random_state=42)

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train KNN regressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train_reg, y_train_reg)

# Evaluate
y_pred_reg = knn_reg.predict(X_test_reg)
print(f"R² Score: {r2_score(y_test_reg, y_pred_reg):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_reg)):.4f}")

Advantages and Disadvantages of KNN

Advantages

  • Simple and intuitive: Easy to understand and implement
  • No training phase: Model stores training data directly
  • Naturally handles multi-class: No modification needed
  • Non-parametric: Makes no assumptions about data distribution

Disadvantages

  • Slow predictions: Must calculate distance to all training points
  • Memory intensive: Stores entire training dataset
  • Sensitive to irrelevant features: All features contribute to distance
  • Curse of dimensionality: Performance degrades with many features
  • Requires feature scaling: Distance-based algorithm

Best Practices for KNN

# Complete KNN pipeline with best practices
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create pipeline with scaling
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Define parameter grid
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")

Using a pipeline ensures scaling is applied consistently and prevents data leakage.


Summary

K-Nearest Neighbors is a simple yet effective algorithm that classifies based on similarity to nearby training examples.

Key takeaways:

  • KNN predicts based on the majority class of K nearest neighbors
  • Distance metrics (Euclidean, Manhattan) measure similarity
  • Feature scaling is essential for KNN
  • Small K may overfit; large K may underfit
  • Use cross-validation to find optimal K
  • Weighted KNN gives closer neighbors more influence
  • KNN is instance-based (lazy learning) with no explicit training phase
  • Consider computational cost for large datasets
Back to Classification Algorithms

Previous Lesson

Logistic Regression for Binary Classification

Next Lesson

Decision Trees

Related Lessons

1

Decision Trees

Explore Decision Trees and how they split data into meaningful decision rules. This lesson teaches tree-building, visualization, and practical classification applications.

2

Naive Bayes Classifier

Discover the Naive Bayes classifier, a fast and powerful algorithm based on probability and Bayes’ theorem. This lesson shows how it excels in text classification and other high‑dimensional tasks.

3

Logistic Regression for Binary Classification

Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.

In this track (6)

1Logistic Regression for Binary Classification2K-Nearest Neighbors (KNN) Algorithm3Decision Trees4Support Vector Machines5Naive Bayes Classifier6Classification Project - Customer Churn Prediction