Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.
Linear regression is one of the most fundamental algorithms in machine learning. It establishes a relationship between input features (independent variables) and a continuous output (dependent variable) by fitting a straight line through the data points.
Despite its simplicity, linear regression remains widely used in real-world applications due to its interpretability, efficiency, and effectiveness for linearly related data.
Linear regression attempts to model the relationship between variables by fitting a linear equation to observed data. The goal is to find the best-fitting line that minimizes the difference between predicted values and actual values.
Real-world applications include:
The equation for simple linear regression with one feature is:
y = mx + b
In machine learning terminology:
ŷ = w₁x + w₀
Where:
The weight (w₁) determines how much the input feature influences the prediction. A larger weight means the feature has a stronger impact on the output.
The bias (w₀) shifts the line up or down, allowing the model to fit data that doesn't pass through the origin.
Linear regression learns by minimizing a cost function. The most common cost function is Mean Squared Error (MSE).
MSE = (1/n) × Σ(yᵢ - ŷᵢ)²
Where:
The algorithm adjusts weights and bias to minimize this error, finding the line that best fits the training data.
Ordinary Least Squares is a mathematical method that calculates the optimal weights directly by minimizing the sum of squared residuals. This approach provides a closed-form solution without iterative optimization.
Let's implement linear regression step by step using Python and scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
This code imports NumPy for numerical operations, Matplotlib for visualization, and scikit-learn modules for building and evaluating the model.
# Years of experience
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
# Salary in thousands
y = np.array([35, 40, 45, 50, 55, 62, 68, 75, 82, 90])
This creates a simple dataset representing years of experience and corresponding salaries. The reshape(-1, 1) converts the array into a 2D format required by scikit-learn.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
This splits the data into 80% training and 20% testing sets. The training set teaches the model, while the testing set evaluates its performance on unseen data.
model = LinearRegression()
model.fit(X_train, y_train)
The fit() method trains the model by finding the optimal weight and bias values that minimize the cost function.
print(f"Weight (slope): {model.coef_[0]:.2f}")
print(f"Bias (intercept): {model.intercept_:.2f}")
Output:
Weight (slope): 6.14
Bias (intercept): 28.43
The weight indicates that for each additional year of experience, the salary increases by approximately $6,140. The bias represents the base salary prediction.
y_pred = model.predict(X_test)
# Predict salary for 12 years of experience
new_experience = np.array([[12]])
predicted_salary = model.predict(new_experience)
print(f"Predicted salary for 12 years: ${predicted_salary[0]:.2f}k")
The predict() method uses the learned parameters to generate predictions for new data points.
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")
Understanding the metrics:
MSE (Mean Squared Error): Average of squared differences between actual and predicted values. Lower is better.
RMSE (Root Mean Squared Error): Square root of MSE, expressed in the same units as the target variable. Easier to interpret.
R² Score (Coefficient of Determination): Measures how well the model explains variance in the data. Ranges from 0 to 1, where 1 indicates perfect prediction.
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary (thousands)')
plt.title('Linear Regression: Experience vs Salary')
plt.legend()
plt.show()
This visualization displays the actual data points as blue dots and the fitted regression line in red, helping you understand how well the model captures the underlying relationship.
For linear regression to work effectively, certain assumptions should be met:
Violating these assumptions may reduce model accuracy and reliability.
Linear regression works well when:
Consider alternatives when:
Linear regression is the cornerstone of predictive modeling in machine learning. It provides a straightforward approach to understanding relationships between variables and making predictions. By minimizing the mean squared error, linear regression finds the optimal line that best fits your data.
Key takeaways:
Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.
Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.
Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.