Probability distributions describe how data values are spread and are essential for modeling and inference in machine learning. This section covers common distributions—such as normal, binomial, and uniform—and explains their role in understanding data, estimating probabilities, and building probabilistic models.
Bernoulli Distribution: Models binary outcomes (success/failure, yes/no).
import numpy as np
from scipy import stats
# Bernoulli: single trial with probability p
p = 0.7 # Probability of success
# Generate samples
np.random.seed(42)
samples = stats.bernoulli.rvs(p, size=1000)
print(f"Bernoulli (p={p}):")
print(f"Sample mean: {samples.mean():.3f} (expected: {p})")
print(f"Sample variance: {samples.var():.3f} (expected: {p*(1-p):.3f})")
Binomial Distribution: Models the number of successes in n independent trials.
import numpy as np
from scipy import stats
# Binomial: n trials, each with probability p
n, p = 10, 0.3
# Generate samples
samples = stats.binom.rvs(n, p, size=1000)
print(f"Binomial (n={n}, p={p}):")
print(f"Sample mean: {samples.mean():.3f} (expected: {n*p})")
print(f"Sample variance: {samples.var():.3f} (expected: {n*p*(1-p):.3f})")
# Probability calculations
print(f"\nP(X = 3): {stats.binom.pmf(3, n, p):.4f}")
print(f"P(X ≤ 3): {stats.binom.cdf(3, n, p):.4f}")
Poisson Distribution: Models the number of events in a fixed interval.
import numpy as np
from scipy import stats
# Poisson: average rate λ
lambda_rate = 5 # Average 5 events per interval
samples = stats.poisson.rvs(lambda_rate, size=1000)
print(f"Poisson (λ={lambda_rate}):")
print(f"Sample mean: {samples.mean():.3f} (expected: {lambda_rate})")
print(f"Sample variance: {samples.var():.3f} (expected: {lambda_rate})")
# ML application: modeling count data (website visits, defects, etc.)
Normal (Gaussian) Distribution: The most important distribution in statistics and ML.
import numpy as np
from scipy import stats
# Normal distribution parameters
mu, sigma = 0, 1 # Standard normal
# Generate samples
np.random.seed(42)
samples = stats.norm.rvs(mu, sigma, size=1000)
print(f"Normal (μ={mu}, σ={sigma}):")
print(f"Sample mean: {samples.mean():.3f}")
print(f"Sample std: {samples.std():.3f}")
# Probability calculations
print(f"\nP(X < 0): {stats.norm.cdf(0, mu, sigma):.4f}")
print(f"P(-1 < X < 1): {stats.norm.cdf(1, mu, sigma) - stats.norm.cdf(-1, mu, sigma):.4f}")
print(f"P(-2 < X < 2): {stats.norm.cdf(2, mu, sigma) - stats.norm.cdf(-2, mu, sigma):.4f}")
Why Normal Distribution Matters in ML:
Uniform Distribution:
import numpy as np
from scipy import stats
# Uniform distribution: equal probability across range [a, b]
a, b = 0, 10
samples = stats.uniform.rvs(loc=a, scale=(b-a), size=1000)
print(f"Uniform ({a}, {b}):")
print(f"Sample mean: {samples.mean():.3f} (expected: {(a+b)/2})")
print(f"Sample variance: {samples.var():.3f} (expected: {(b-a)**2/12:.3f})")
# Used for random initialization in neural networks
Exponential Distribution:
import numpy as np
from scipy import stats
# Exponential: models time between events
lambda_rate = 0.5 # Rate parameter
samples = stats.expon.rvs(scale=1/lambda_rate, size=1000)
print(f"Exponential (λ={lambda_rate}):")
print(f"Sample mean: {samples.mean():.3f} (expected: {1/lambda_rate})")
# Used for modeling: time until failure, customer arrivals, etc.
In ML, features are often modeled jointly using multivariate distributions.
import numpy as np
# Multivariate normal distribution
mean = np.array([0, 0])
covariance = np.array([[1, 0.8],
[0.8, 1]]) # Correlated features
# Generate samples
np.random.seed(42)
samples = np.random.multivariate_normal(mean, covariance, size=500)
print(f"Samples shape: {samples.shape}")
print(f"Sample mean: {samples.mean(axis=0)}")
print(f"Sample correlation: {np.corrcoef(samples.T)[0,1]:.3f}")
# Used in: Gaussian Mixture Models, Bayesian methods
Fitting distributions to data is common in ML preprocessing and analysis.
import numpy as np
from scipy import stats
# Generate mystery data
np.random.seed(42)
mystery_data = np.random.gamma(shape=2, scale=2, size=1000)
# Try fitting different distributions
distributions = ['norm', 'expon', 'gamma', 'lognorm']
print("Distribution fitting results:")
for dist_name in distributions:
dist = getattr(stats, dist_name)
params = dist.fit(mystery_data)
# Kolmogorov-Smirnov test for goodness of fit
ks_stat, p_value = stats.kstest(mystery_data, dist_name, params)
print(f"{dist_name:10s}: KS stat = {ks_stat:.4f}, p-value = {p_value:.4f}")
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate classification data
np.random.seed(42)
# Class 0: centered at (0, 0)
class_0 = np.random.randn(100, 2) + np.array([0, 0])
# Class 1: centered at (3, 3)
class_1 = np.random.randn(100, 2) + np.array([3, 3])
X = np.vstack([class_0, class_1])
y = np.array([0]*100 + [1]*100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Gaussian Naive Bayes assumes features are normally distributed
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predictions
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Gaussian Naive Bayes Accuracy: {accuracy:.2%}")
print(f"\nLearned class means:")
print(f"Class 0: {gnb.theta_[0]}")
print(f"Class 1: {gnb.theta_[1]}")
Descriptive statistics summarize and describe the key features of a dataset, providing a foundation for data analysis in machine learning. This section covers measures of central tendency, dispersion, and data distribution, helping to identify patterns, detect anomalies, and inform preprocessing and modeling decisions.
Probability fundamentals provide the framework for reasoning under uncertainty in machine learning. This section introduces key concepts such as random variables, probability distributions, conditional probability, and Bayes’ theorem, which are essential for modeling uncertainty, making predictions, and designing probabilistic algorithms.