Discover the Naive Bayes classifier, a fast and powerful algorithm based on probability and Bayes’ theorem. This lesson shows how it excels in text classification and other high‑dimensional tasks.
Naive Bayes is a family of probabilistic classifiers based on Bayes' theorem. Despite its simplicity and "naive" assumption of feature independence, it performs remarkably well on many real-world problems, especially text classification.
The algorithm assumes all features are independent of each other given the class label. While this assumption rarely holds true in practice, Naive Bayes still achieves excellent results.
Real-world applications:
Bayes' theorem describes the probability of an event based on prior knowledge of related conditions.
P(A|B) = [P(B|A) × P(A)] / P(B)
Where:
P(Class|Features) = [P(Features|Class) × P(Class)] / P(Features)
The classifier predicts the class with the highest posterior probability.
# Spam classification example
# Given: Word "free" appears in email
# Question: Is it spam?
# Prior probabilities (from training data)
p_spam = 0.3 # 30% of emails are spam
p_not_spam = 0.7 # 70% are not spam
# Likelihoods
p_free_given_spam = 0.8 # "free" appears in 80% of spam
p_free_given_not_spam = 0.1 # "free" appears in 10% of non-spam
# Calculate posterior using Bayes' theorem
p_free = (p_free_given_spam * p_spam) + (p_free_given_not_spam * p_not_spam)
p_spam_given_free = (p_free_given_spam * p_spam) / p_free
p_not_spam_given_free = (p_free_given_not_spam * p_not_spam) / p_free
print(f"P(Spam | 'free'): {p_spam_given_free:.4f}")
print(f"P(Not Spam | 'free'): {p_not_spam_given_free:.4f}")
print(f"Prediction: {'Spam' if p_spam_given_free > p_not_spam_given_free else 'Not Spam'}")
Output:
P(Spam | 'free'): 0.7742
P(Not Spam | 'free'): 0.2258
Prediction: Spam
Scikit-learn provides three main Naive Bayes variants, each suited for different types of data.
Assumes continuous features follow a normal (Gaussian) distribution.
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict
y_pred = gnb.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Best for: Continuous numerical features
Works with discrete count data, typically word counts in text classification.
from sklearn.naive_bayes import MultinomialNB
import numpy as np
# Example: Document classification with word counts
# Feature matrix: rows=documents, columns=word frequencies
X_counts = np.array([
[3, 0, 1, 2, 0], # Document 1
[2, 1, 0, 3, 1], # Document 2
[0, 2, 3, 0, 2], # Document 3
[1, 3, 2, 0, 1], # Document 4
])
y_docs = np.array([0, 0, 1, 1]) # Class labels
mnb = MultinomialNB()
mnb.fit(X_counts, y_docs)
# Predict for new document
new_doc = np.array([[1, 1, 2, 1, 1]])
prediction = mnb.predict(new_doc)
probabilities = mnb.predict_proba(new_doc)
print(f"Predicted class: {prediction[0]}")
print(f"Class probabilities: {probabilities[0].round(3)}")
Best for: Text classification, word count features
Works with binary/boolean features (presence or absence).
from sklearn.naive_bayes import BernoulliNB
# Example: Binary features (word present=1, absent=0)
X_binary = np.array([
[1, 0, 1, 1, 0],
[1, 1, 0, 1, 0],
[0, 1, 1, 0, 1],
[0, 1, 1, 0, 1],
])
y_binary = np.array([0, 0, 1, 1])
bnb = BernoulliNB()
bnb.fit(X_binary, y_binary)
print(f"Accuracy: {bnb.score(X_binary, y_binary):.4f}")
Best for: Binary features, short text classification
| Variant | Feature Type | Use Case |
|---|---|---|
| GaussianNB | Continuous | General numerical data |
| MultinomialNB | Discrete counts | Text (word frequencies) |
| BernoulliNB | Binary | Text (word presence/absence) |
Text classification is the most common application of Naive Bayes. Let's build a complete sentiment classifier.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
# Sample movie reviews
reviews = [
"This movie was absolutely wonderful and amazing",
"Terrible film, complete waste of time",
"I loved every minute of this masterpiece",
"Boring and dull, would not recommend",
"Fantastic acting and brilliant storyline",
"Awful movie, very disappointing",
"A beautiful and touching experience",
"Worst movie I have ever seen",
"Incredible performance by the lead actor",
"Painfully slow and uninteresting",
"Highly entertaining and fun to watch",
"Dreadful acting and poor direction",
"An absolute joy from start to finish",
"Complete disaster of a film",
"Superb cinematography and music",
"Tedious plot with no redeeming qualities"
]
# Labels: 1 = positive, 0 = negative
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
X_train, X_test, y_train, y_test = train_test_split(
reviews, labels, test_size=0.25, random_state=42
)
# Using Count Vectorizer (word frequencies)
count_vectorizer = CountVectorizer(lowercase=True, stop_words='english')
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)
print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Training matrix shape: {X_train_counts.shape}")
print(f"\nSample vocabulary words: {list(count_vectorizer.vocabulary_.keys())[:10]}")
# Train Multinomial Naive Bayes
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_counts, y_train)
# Predict
y_pred = nb_classifier.predict(X_test_counts)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
def classify_review(review, vectorizer, model):
"""Classify a single review."""
review_vector = vectorizer.transform([review])
prediction = model.predict(review_vector)[0]
probability = model.predict_proba(review_vector)[0]
sentiment = "Positive" if prediction == 1 else "Negative"
confidence = probability[prediction]
return sentiment, confidence
# Test new reviews
new_reviews = [
"This was an amazing experience, loved it!",
"Terrible waste of money, very bad",
"Not great but not terrible either"
]
print("New Review Classifications:")
print("-" * 50)
for review in new_reviews:
sentiment, confidence = classify_review(review, count_vectorizer, nb_classifier)
print(f"Review: '{review[:40]}...'")
print(f"Sentiment: {sentiment} (Confidence: {confidence:.2%})\n")
TF-IDF (Term Frequency-Inverse Document Frequency) often improves text classification by weighing words based on their importance.
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
# Train with TF-IDF features
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)
# Compare results
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)
print(f"Count Vectorizer Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"TF-IDF Accuracy: {accuracy_score(y_test, y_pred_tfidf):.4f}")
Naive Bayes provides interpretable results by showing which features indicate each class.
# Get feature names
feature_names = count_vectorizer.get_feature_names_out()
# Get log probabilities for each class
log_probs = nb_classifier.feature_log_prob_
# Find most indicative words for each class
def get_top_features(log_probs, feature_names, class_idx, n=5):
"""Get top features for a class."""
sorted_idx = log_probs[class_idx].argsort()[::-1][:n]
return [(feature_names[i], log_probs[class_idx][i]) for i in sorted_idx]
print("Top words indicating NEGATIVE sentiment:")
for word, prob in get_top_features(log_probs, feature_names, 0):
print(f" {word}: {prob:.3f}")
print("\nTop words indicating POSITIVE sentiment:")
for word, prob in get_top_features(log_probs, feature_names, 1):
print(f" {word}: {prob:.3f}")
If a word never appears with a class in training data, its probability is zero, making the entire product zero. Laplace smoothing (additive smoothing) prevents this.
# Alpha parameter controls smoothing
# alpha=1.0 is Laplace smoothing (default)
# alpha=0 means no smoothing (can cause zero probability issues)
alphas = [0.01, 0.1, 1.0, 10.0]
print("Effect of Smoothing Parameter (alpha):")
print("-" * 40)
for alpha in alphas:
nb = MultinomialNB(alpha=alpha)
nb.fit(X_train_counts, y_train)
score = nb.score(X_test_counts, y_test)
print(f"Alpha = {alpha}: Accuracy = {score:.4f}")
Smoothing effects:
For continuous features, Gaussian Naive Bayes assumes a normal distribution.
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
# Load dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target
# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_cancer, y_cancer, test_size=0.2, random_state=42
)
# Train Gaussian NB (scaling not required but can help)
gnb = GaussianNB()
gnb.fit(X_train_c, y_train_c)
# Evaluate
y_pred_c = gnb.predict(X_test_c)
print(f"Accuracy: {accuracy_score(y_test_c, y_pred_c):.4f}")
# View learned parameters
print(f"\nClass priors: {gnb.class_prior_}")
print(f"Feature means shape: {gnb.theta_.shape}") # Mean of each feature per class
print(f"Feature variances shape: {gnb.var_.shape}") # Variance of each feature per class
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# Create pipeline
text_clf_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', MultinomialNB())
])
# Define parameter grid
param_grid = {
'vectorizer__max_features': [100, 500, 1000],
'vectorizer__ngram_range': [(1, 1), (1, 2)],
'vectorizer__stop_words': [None, 'english'],
'classifier__alpha': [0.1, 0.5, 1.0]
}
# Grid search (using full data for demonstration)
grid_search = GridSearchCV(
text_clf_pipeline,
param_grid,
cv=3,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(reviews, labels)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")
Naive Bayes excels when:
Consider alternatives when:
Naive Bayes is a fast, probabilistic classifier that performs surprisingly well despite its simplifying assumptions.
Key takeaways:
Learn how Logistic Regression predicts binary outcomes using probability-based decision boundaries. This lesson covers theory, implementation, and practical use cases like spam detection and churn prediction.
Understand how the KNN algorithm classifies data based on similarity. This lesson explains distance metrics, choosing the right K value, and building accurate classification models.
Apply classification algorithms to a real Customer Churn Prediction project. Learn to preprocess data, evaluate models, and build insights that help businesses reduce customer loss.