Progress5/6 lessons (83%)

Lesson 5

Naive Bayes Classifier

Discover the Naive Bayes classifier, a fast and powerful algorithm based on probability and Bayes’ theorem. This lesson shows how it excels in text classification and other high‑dimensional tasks.

10 min read22 views

Introduction to Naive Bayes

Naive Bayes is a family of probabilistic classifiers based on Bayes' theorem. Despite its simplicity and "naive" assumption of feature independence, it performs remarkably well on many real-world problems, especially text classification.

Why "Naive"?

The algorithm assumes all features are independent of each other given the class label. While this assumption rarely holds true in practice, Naive Bayes still achieves excellent results.

Real-world applications:

Email spam filtering
Sentiment analysis
Document categorization
Medical diagnosis
Real-time prediction (due to speed)

Bayes' Theorem

Bayes' theorem describes the probability of an event based on prior knowledge of related conditions.

The Formula

P(A|B) = [P(B|A) × P(A)] / P(B)

Where:

P(A|B): Posterior probability - probability of A given B
P(B|A): Likelihood - probability of B given A
P(A): Prior probability of A
P(B): Evidence - probability of B

Applied to Classification

P(Class|Features) = [P(Features|Class) × P(Class)] / P(Features)

The classifier predicts the class with the highest posterior probability.

Simple Example

# Spam classification example
# Given: Word "free" appears in email
# Question: Is it spam?

# Prior probabilities (from training data)
p_spam = 0.3          # 30% of emails are spam
p_not_spam = 0.7      # 70% are not spam

# Likelihoods
p_free_given_spam = 0.8      # "free" appears in 80% of spam
p_free_given_not_spam = 0.1  # "free" appears in 10% of non-spam

# Calculate posterior using Bayes' theorem
p_free = (p_free_given_spam * p_spam) + (p_free_given_not_spam * p_not_spam)

p_spam_given_free = (p_free_given_spam * p_spam) / p_free
p_not_spam_given_free = (p_free_given_not_spam * p_not_spam) / p_free

print(f"P(Spam | 'free'): {p_spam_given_free:.4f}")
print(f"P(Not Spam | 'free'): {p_not_spam_given_free:.4f}")
print(f"Prediction: {'Spam' if p_spam_given_free > p_not_spam_given_free else 'Not Spam'}")

Output:

P(Spam | 'free'): 0.7742
P(Not Spam | 'free'): 0.2258
Prediction: Spam

Types of Naive Bayes Classifiers

Scikit-learn provides three main Naive Bayes variants, each suited for different types of data.

1. Gaussian Naive Bayes

Assumes continuous features follow a normal (Gaussian) distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Best for: Continuous numerical features

2. Multinomial Naive Bayes

Works with discrete count data, typically word counts in text classification.

from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Example: Document classification with word counts
# Feature matrix: rows=documents, columns=word frequencies
X_counts = np.array([
    [3, 0, 1, 2, 0],  # Document 1
    [2, 1, 0, 3, 1],  # Document 2
    [0, 2, 3, 0, 2],  # Document 3
    [1, 3, 2, 0, 1],  # Document 4
])
y_docs = np.array([0, 0, 1, 1])  # Class labels

mnb = MultinomialNB()
mnb.fit(X_counts, y_docs)

# Predict for new document
new_doc = np.array([[1, 1, 2, 1, 1]])
prediction = mnb.predict(new_doc)
probabilities = mnb.predict_proba(new_doc)

print(f"Predicted class: {prediction[0]}")
print(f"Class probabilities: {probabilities[0].round(3)}")

Best for: Text classification, word count features

3. Bernoulli Naive Bayes

Works with binary/boolean features (presence or absence).

from sklearn.naive_bayes import BernoulliNB

# Example: Binary features (word present=1, absent=0)
X_binary = np.array([
    [1, 0, 1, 1, 0],
    [1, 1, 0, 1, 0],
    [0, 1, 1, 0, 1],
    [0, 1, 1, 0, 1],
])
y_binary = np.array([0, 0, 1, 1])

bnb = BernoulliNB()
bnb.fit(X_binary, y_binary)

print(f"Accuracy: {bnb.score(X_binary, y_binary):.4f}")

Best for: Binary features, short text classification

Comparison Summary

Variant	Feature Type	Use Case
GaussianNB	Continuous	General numerical data
MultinomialNB	Discrete counts	Text (word frequencies)
BernoulliNB	Binary	Text (word presence/absence)

Text Classification with Naive Bayes

Text classification is the most common application of Naive Bayes. Let's build a complete sentiment classifier.

Step 1: Prepare Text Data

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

# Sample movie reviews
reviews = [
    "This movie was absolutely wonderful and amazing",
    "Terrible film, complete waste of time",
    "I loved every minute of this masterpiece",
    "Boring and dull, would not recommend",
    "Fantastic acting and brilliant storyline",
    "Awful movie, very disappointing",
    "A beautiful and touching experience",
    "Worst movie I have ever seen",
    "Incredible performance by the lead actor",
    "Painfully slow and uninteresting",
    "Highly entertaining and fun to watch",
    "Dreadful acting and poor direction",
    "An absolute joy from start to finish",
    "Complete disaster of a film",
    "Superb cinematography and music",
    "Tedious plot with no redeeming qualities"
]

# Labels: 1 = positive, 0 = negative
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

X_train, X_test, y_train, y_test = train_test_split(
    reviews, labels, test_size=0.25, random_state=42
)

Step 2: Convert Text to Features

# Using Count Vectorizer (word frequencies)
count_vectorizer = CountVectorizer(lowercase=True, stop_words='english')
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Training matrix shape: {X_train_counts.shape}")
print(f"\nSample vocabulary words: {list(count_vectorizer.vocabulary_.keys())[:10]}")

Step 3: Train and Evaluate

# Train Multinomial Naive Bayes
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_counts, y_train)

# Predict
y_pred = nb_classifier.predict(X_test_counts)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

Step 4: Classify New Reviews

def classify_review(review, vectorizer, model):
    """Classify a single review."""
    review_vector = vectorizer.transform([review])
    prediction = model.predict(review_vector)[0]
    probability = model.predict_proba(review_vector)[0]
    
    sentiment = "Positive" if prediction == 1 else "Negative"
    confidence = probability[prediction]
    
    return sentiment, confidence

# Test new reviews
new_reviews = [
    "This was an amazing experience, loved it!",
    "Terrible waste of money, very bad",
    "Not great but not terrible either"
]

print("New Review Classifications:")
print("-" * 50)
for review in new_reviews:
    sentiment, confidence = classify_review(review, count_vectorizer, nb_classifier)
    print(f"Review: '{review[:40]}...'")
    print(f"Sentiment: {sentiment} (Confidence: {confidence:.2%})\n")

Using TF-IDF for Better Results

TF-IDF (Term Frequency-Inverse Document Frequency) often improves text classification by weighing words based on their importance.

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train with TF-IDF features
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)

# Compare results
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)
print(f"Count Vectorizer Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"TF-IDF Accuracy: {accuracy_score(y_test, y_pred_tfidf):.4f}")

Inspecting Model Internals

Naive Bayes provides interpretable results by showing which features indicate each class.

Feature Log Probabilities

# Get feature names
feature_names = count_vectorizer.get_feature_names_out()

# Get log probabilities for each class
log_probs = nb_classifier.feature_log_prob_

# Find most indicative words for each class
def get_top_features(log_probs, feature_names, class_idx, n=5):
    """Get top features for a class."""
    sorted_idx = log_probs[class_idx].argsort()[::-1][:n]
    return [(feature_names[i], log_probs[class_idx][i]) for i in sorted_idx]

print("Top words indicating NEGATIVE sentiment:")
for word, prob in get_top_features(log_probs, feature_names, 0):
    print(f"  {word}: {prob:.3f}")

print("\nTop words indicating POSITIVE sentiment:")
for word, prob in get_top_features(log_probs, feature_names, 1):
    print(f"  {word}: {prob:.3f}")

Handling the Zero Probability Problem

If a word never appears with a class in training data, its probability is zero, making the entire product zero. Laplace smoothing (additive smoothing) prevents this.

# Alpha parameter controls smoothing
# alpha=1.0 is Laplace smoothing (default)
# alpha=0 means no smoothing (can cause zero probability issues)

alphas = [0.01, 0.1, 1.0, 10.0]

print("Effect of Smoothing Parameter (alpha):")
print("-" * 40)

for alpha in alphas:
    nb = MultinomialNB(alpha=alpha)
    nb.fit(X_train_counts, y_train)
    score = nb.score(X_test_counts, y_test)
    print(f"Alpha = {alpha}: Accuracy = {score:.4f}")

Smoothing effects:

Small alpha: Less smoothing, more sensitive to training data
Large alpha: More smoothing, more uniform probabilities

Gaussian Naive Bayes for Numerical Data

For continuous features, Gaussian Naive Bayes assumes a normal distribution.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

# Load dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42
)

# Train Gaussian NB (scaling not required but can help)
gnb = GaussianNB()
gnb.fit(X_train_c, y_train_c)

# Evaluate
y_pred_c = gnb.predict(X_test_c)
print(f"Accuracy: {accuracy_score(y_test_c, y_pred_c):.4f}")

# View learned parameters
print(f"\nClass priors: {gnb.class_prior_}")
print(f"Feature means shape: {gnb.theta_.shape}")  # Mean of each feature per class
print(f"Feature variances shape: {gnb.var_.shape}")  # Variance of each feature per class

Complete Pipeline for Text Classification

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create pipeline
text_clf_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Define parameter grid
param_grid = {
    'vectorizer__max_features': [100, 500, 1000],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'vectorizer__stop_words': [None, 'english'],
    'classifier__alpha': [0.1, 0.5, 1.0]
}

# Grid search (using full data for demonstration)
grid_search = GridSearchCV(
    text_clf_pipeline,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(reviews, labels)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")

Advantages and Disadvantages

Advantages

Extremely fast: Training and prediction are very quick
Simple implementation: Easy to understand and implement
Works well with small data: Requires less training data
Handles high dimensions: Effective with many features
Probabilistic predictions: Provides probability estimates
Interpretable: Can inspect feature probabilities

Disadvantages

Independence assumption: Rarely true in practice
Zero frequency problem: Requires smoothing
Limited expressiveness: Cannot learn feature interactions
Continuous features: Gaussian assumption may not hold
Imbalanced data: Can be biased toward majority class

When to Use Naive Bayes

Naive Bayes excels when:

You need fast training and prediction
Data is high-dimensional (many features)
Training data is limited
Features are reasonably independent
You need probabilistic predictions

Consider alternatives when:

Feature interactions are important
You have abundant training data
Maximum accuracy is critical
Features are highly correlated

Summary

Naive Bayes is a fast, probabilistic classifier that performs surprisingly well despite its simplifying assumptions.

Key takeaways:

Based on Bayes' theorem with the naive independence assumption
Gaussian NB for continuous features assuming normal distribution
Multinomial NB for discrete counts (text word frequencies)
Bernoulli NB for binary features (word presence/absence)
Excellent for text classification tasks
Laplace smoothing prevents zero probability issues
Extremely fast training and prediction
Provides interpretable probability estimates
Works well with high-dimensional data

Related Lessons

Decision Trees

Explore Decision Trees and how they split data into meaningful decision rules. This lesson teaches tree-building, visualization, and practical classification applications.

K-Nearest Neighbors (KNN) Algorithm

Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.

Logistic Regression for Binary Classification

Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.