VIDHYAI
HomeBlogTutorialsNewsAboutContact
VIDHYAI

Your Gateway to AI Knowledge

CONTENT

  • Blog
  • Tutorials
  • News

COMPANY

  • About
  • Contact

LEGAL

  • Privacy Policy
  • Terms of Service
  • Disclaimer
Home
Tutorials
Machine Learning
Supervised Learning - Classification
Classification Algorithms
Naive Bayes Classifier
Back to Classification Algorithms
Progress5/6 lessons (83%)
Lesson 5

Naive Bayes Classifier

Discover the Naive Bayes classifier, a fast and powerful algorithm based on probability and Bayes’ theorem. This lesson shows how it excels in text classification and other high‑dimensional tasks.

10 min read8 views

Introduction to Naive Bayes

Naive Bayes is a family of probabilistic classifiers based on Bayes' theorem. Despite its simplicity and "naive" assumption of feature independence, it performs remarkably well on many real-world problems, especially text classification.

Why "Naive"?

The algorithm assumes all features are independent of each other given the class label. While this assumption rarely holds true in practice, Naive Bayes still achieves excellent results.

Real-world applications:

  • Email spam filtering
  • Sentiment analysis
  • Document categorization
  • Medical diagnosis
  • Real-time prediction (due to speed)

Bayes' Theorem

Bayes' theorem describes the probability of an event based on prior knowledge of related conditions.

The Formula

P(A|B) = [P(B|A) × P(A)] / P(B)

Where:

  • P(A|B): Posterior probability - probability of A given B
  • P(B|A): Likelihood - probability of B given A
  • P(A): Prior probability of A
  • P(B): Evidence - probability of B

Applied to Classification

P(Class|Features) = [P(Features|Class) × P(Class)] / P(Features)

The classifier predicts the class with the highest posterior probability.

Simple Example

# Spam classification example
# Given: Word "free" appears in email
# Question: Is it spam?

# Prior probabilities (from training data)
p_spam = 0.3          # 30% of emails are spam
p_not_spam = 0.7      # 70% are not spam

# Likelihoods
p_free_given_spam = 0.8      # "free" appears in 80% of spam
p_free_given_not_spam = 0.1  # "free" appears in 10% of non-spam

# Calculate posterior using Bayes' theorem
p_free = (p_free_given_spam * p_spam) + (p_free_given_not_spam * p_not_spam)

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
p_not_spam_given_free = (p_free_given_not_spam * p_not_spam) / p_free

print(f"P(Spam | 'free'): {p_spam_given_free:.4f}")
print(f"P(Not Spam | 'free'): {p_not_spam_given_free:.4f}")
print(f"Prediction: {'Spam' if p_spam_given_free > p_not_spam_given_free else 'Not Spam'}")

Output:

P(Spam | 'free'): 0.7742
P(Not Spam | 'free'): 0.2258
Prediction: Spam

Types of Naive Bayes Classifiers

Scikit-learn provides three main Naive Bayes variants, each suited for different types of data.

1. Gaussian Naive Bayes

Assumes continuous features follow a normal (Gaussian) distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Best for: Continuous numerical features

2. Multinomial Naive Bayes

Works with discrete count data, typically word counts in text classification.

from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Example: Document classification with word counts
# Feature matrix: rows=documents, columns=word frequencies
X_counts = np.array([
    [3, 0, 1, 2, 0],  # Document 1
    [2, 1, 0, 3, 1],  # Document 2
    [0, 2, 3, 0, 2],  # Document 3
    [1, 3, 2, 0, 1],  # Document 4
])
y_docs = np.array([0, 0, 1, 1])  # Class labels

mnb = MultinomialNB()
mnb.fit(X_counts, y_docs)

# Predict for new document
new_doc = np.array([[1, 1, 2, 1, 1]])
prediction = mnb.predict(new_doc)
probabilities = mnb.predict_proba(new_doc)

print(f"Predicted class: {prediction[0]}")
print(f"Class probabilities: {probabilities[0].round(3)}")

Best for: Text classification, word count features

3. Bernoulli Naive Bayes

Works with binary/boolean features (presence or absence).

from sklearn.naive_bayes import BernoulliNB

# Example: Binary features (word present=1, absent=0)
X_binary = np.array([
    [1, 0, 1, 1, 0],
    [1, 1, 0, 1, 0],
    [0, 1, 1, 0, 1],
    [0, 1, 1, 0, 1],
])
y_binary = np.array([0, 0, 1, 1])

bnb = BernoulliNB()
bnb.fit(X_binary, y_binary)

print(f"Accuracy: {bnb.score(X_binary, y_binary):.4f}")

Best for: Binary features, short text classification

Comparison Summary

Variant Feature Type Use Case
GaussianNB Continuous General numerical data
MultinomialNB Discrete counts Text (word frequencies)
BernoulliNB Binary Text (word presence/absence)

Text Classification with Naive Bayes

Text classification is the most common application of Naive Bayes. Let's build a complete sentiment classifier.

Step 1: Prepare Text Data

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

# Sample movie reviews
reviews = [
    "This movie was absolutely wonderful and amazing",
    "Terrible film, complete waste of time",
    "I loved every minute of this masterpiece",
    "Boring and dull, would not recommend",
    "Fantastic acting and brilliant storyline",
    "Awful movie, very disappointing",
    "A beautiful and touching experience",
    "Worst movie I have ever seen",
    "Incredible performance by the lead actor",
    "Painfully slow and uninteresting",
    "Highly entertaining and fun to watch",
    "Dreadful acting and poor direction",
    "An absolute joy from start to finish",
    "Complete disaster of a film",
    "Superb cinematography and music",
    "Tedious plot with no redeeming qualities"
]

# Labels: 1 = positive, 0 = negative
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

X_train, X_test, y_train, y_test = train_test_split(
    reviews, labels, test_size=0.25, random_state=42
)

Step 2: Convert Text to Features

# Using Count Vectorizer (word frequencies)
count_vectorizer = CountVectorizer(lowercase=True, stop_words='english')
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Training matrix shape: {X_train_counts.shape}")
print(f"\nSample vocabulary words: {list(count_vectorizer.vocabulary_.keys())[:10]}")

Step 3: Train and Evaluate

# Train Multinomial Naive Bayes
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_counts, y_train)

# Predict
y_pred = nb_classifier.predict(X_test_counts)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Step 4: Classify New Reviews

def classify_review(review, vectorizer, model):
    """Classify a single review."""
    review_vector = vectorizer.transform([review])
    prediction = model.predict(review_vector)[0]
    probability = model.predict_proba(review_vector)[0]
    
    sentiment = "Positive" if prediction == 1 else "Negative"
    confidence = probability[prediction]
    
    return sentiment, confidence

# Test new reviews
new_reviews = [
    "This was an amazing experience, loved it!",
    "Terrible waste of money, very bad",
    "Not great but not terrible either"
]

print("New Review Classifications:")
print("-" * 50)
for review in new_reviews:
    sentiment, confidence = classify_review(review, count_vectorizer, nb_classifier)
    print(f"Review: '{review[:40]}...'")
    print(f"Sentiment: {sentiment} (Confidence: {confidence:.2%})\n")

Using TF-IDF for Better Results

TF-IDF (Term Frequency-Inverse Document Frequency) often improves text classification by weighing words based on their importance.

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train with TF-IDF features
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)

# Compare results
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)
print(f"Count Vectorizer Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"TF-IDF Accuracy: {accuracy_score(y_test, y_pred_tfidf):.4f}")

Inspecting Model Internals

Naive Bayes provides interpretable results by showing which features indicate each class.

Feature Log Probabilities

# Get feature names
feature_names = count_vectorizer.get_feature_names_out()

# Get log probabilities for each class
log_probs = nb_classifier.feature_log_prob_

# Find most indicative words for each class
def get_top_features(log_probs, feature_names, class_idx, n=5):
    """Get top features for a class."""
    sorted_idx = log_probs[class_idx].argsort()[::-1][:n]
    return [(feature_names[i], log_probs[class_idx][i]) for i in sorted_idx]

print("Top words indicating NEGATIVE sentiment:")
for word, prob in get_top_features(log_probs, feature_names, 0):
    print(f"  {word}: {prob:.3f}")

print("\nTop words indicating POSITIVE sentiment:")
for word, prob in get_top_features(log_probs, feature_names, 1):
    print(f"  {word}: {prob:.3f}")

Handling the Zero Probability Problem

If a word never appears with a class in training data, its probability is zero, making the entire product zero. Laplace smoothing (additive smoothing) prevents this.

# Alpha parameter controls smoothing
# alpha=1.0 is Laplace smoothing (default)
# alpha=0 means no smoothing (can cause zero probability issues)

alphas = [0.01, 0.1, 1.0, 10.0]

print("Effect of Smoothing Parameter (alpha):")
print("-" * 40)

for alpha in alphas:
    nb = MultinomialNB(alpha=alpha)
    nb.fit(X_train_counts, y_train)
    score = nb.score(X_test_counts, y_test)
    print(f"Alpha = {alpha}: Accuracy = {score:.4f}")

Smoothing effects:

  • Small alpha: Less smoothing, more sensitive to training data
  • Large alpha: More smoothing, more uniform probabilities

Gaussian Naive Bayes for Numerical Data

For continuous features, Gaussian Naive Bayes assumes a normal distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

# Load dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42
)

# Train Gaussian NB (scaling not required but can help)
gnb = GaussianNB()
gnb.fit(X_train_c, y_train_c)

# Evaluate
y_pred_c = gnb.predict(X_test_c)
print(f"Accuracy: {accuracy_score(y_test_c, y_pred_c):.4f}")

# View learned parameters
print(f"\nClass priors: {gnb.class_prior_}")
print(f"Feature means shape: {gnb.theta_.shape}")  # Mean of each feature per class
print(f"Feature variances shape: {gnb.var_.shape}")  # Variance of each feature per class

Complete Pipeline for Text Classification

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create pipeline
text_clf_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Define parameter grid
param_grid = {
    'vectorizer__max_features': [100, 500, 1000],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'vectorizer__stop_words': [None, 'english'],
    'classifier__alpha': [0.1, 0.5, 1.0]
}

# Grid search (using full data for demonstration)
grid_search = GridSearchCV(
    text_clf_pipeline,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(reviews, labels)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")

Advantages and Disadvantages

Advantages

  • Extremely fast: Training and prediction are very quick
  • Simple implementation: Easy to understand and implement
  • Works well with small data: Requires less training data
  • Handles high dimensions: Effective with many features
  • Probabilistic predictions: Provides probability estimates
  • Interpretable: Can inspect feature probabilities

Disadvantages

  • Independence assumption: Rarely true in practice
  • Zero frequency problem: Requires smoothing
  • Limited expressiveness: Cannot learn feature interactions
  • Continuous features: Gaussian assumption may not hold
  • Imbalanced data: Can be biased toward majority class

When to Use Naive Bayes

Naive Bayes excels when:

  • You need fast training and prediction
  • Data is high-dimensional (many features)
  • Training data is limited
  • Features are reasonably independent
  • You need probabilistic predictions

Consider alternatives when:

  • Feature interactions are important
  • You have abundant training data
  • Maximum accuracy is critical
  • Features are highly correlated

Summary

Naive Bayes is a fast, probabilistic classifier that performs surprisingly well despite its simplifying assumptions.

Key takeaways:

  • Based on Bayes' theorem with the naive independence assumption
  • Gaussian NB for continuous features assuming normal distribution
  • Multinomial NB for discrete counts (text word frequencies)
  • Bernoulli NB for binary features (word presence/absence)
  • Excellent for text classification tasks
  • Laplace smoothing prevents zero probability issues
  • Extremely fast training and prediction
  • Provides interpretable probability estimates
  • Works well with high-dimensional data
Back to Classification Algorithms

Previous Lesson

Support Vector Machines

Next Lesson

Classification Project - Customer Churn Prediction

Related Lessons

1

Logistic Regression for Binary Classification

Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.

2

K-Nearest Neighbors (KNN) Algorithm

Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.

3

Classification Project - Customer Churn Prediction

Apply classification algorithms to a real Customer Churn Prediction project. Learn to preprocess data, evaluate models, and build insights that help businesses reduce customer loss.

In this track (6)

1Logistic Regression for Binary Classification2K-Nearest Neighbors (KNN) Algorithm3Decision Trees4Support Vector Machines5Naive Bayes Classifier6Classification Project - Customer Churn Prediction