Back to Regression Algorithms - Predicting Continuous Values

Progress1/4 lessons (25%)

Lesson 1

Linear Regression Theory and Implementation

Learn the fundamentals of Linear Regression, including how it works, key assumptions, and step‑by‑step implementation. This lesson helps you understand relationships between variables and build accurate predictive models using real data.

10 min read23 views

Introduction to Linear Regression

Linear regression is one of the most fundamental algorithms in machine learning. It establishes a relationship between input features (independent variables) and a continuous output (dependent variable) by fitting a straight line through the data points.

Despite its simplicity, linear regression remains widely used in real-world applications due to its interpretability, efficiency, and effectiveness for linearly related data.

What is Linear Regression?

Linear regression attempts to model the relationship between variables by fitting a linear equation to observed data. The goal is to find the best-fitting line that minimizes the difference between predicted values and actual values.

Real-world applications include:

Predicting house prices based on square footage
Estimating salary based on years of experience
Forecasting sales based on advertising spend
Calculating insurance premiums based on age and health factors

The Mathematics Behind Linear Regression

Simple Linear Regression Equation

The equation for simple linear regression with one feature is:

y = mx + b

In machine learning terminology:

ŷ = w₁x + w₀

Where:

ŷ = predicted value
x = input feature
w₁ = weight (slope of the line)
w₀ = bias (y-intercept)

Understanding Weights and Bias

The weight (w₁) determines how much the input feature influences the prediction. A larger weight means the feature has a stronger impact on the output.

The bias (w₀) shifts the line up or down, allowing the model to fit data that doesn't pass through the origin.

How Linear Regression Learns: The Cost Function

Linear regression learns by minimizing a cost function. The most common cost function is Mean Squared Error (MSE).

Mean Squared Error Formula

MSE = (1/n) × Σ(yᵢ - ŷᵢ)²

Where:

n = number of data points
yᵢ = actual value
ŷᵢ = predicted value

The algorithm adjusts weights and bias to minimize this error, finding the line that best fits the training data.

Ordinary Least Squares (OLS)

Ordinary Least Squares is a mathematical method that calculates the optimal weights directly by minimizing the sum of squared residuals. This approach provides a closed-form solution without iterative optimization.

Implementing Linear Regression in Python

Let's implement linear regression step by step using Python and scikit-learn.

Step 1: Import Required Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

This code imports NumPy for numerical operations, Matplotlib for visualization, and scikit-learn modules for building and evaluating the model.

Step 2: Create Sample Data

# Years of experience
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)

# Salary in thousands
y = np.array([35, 40, 45, 50, 55, 62, 68, 75, 82, 90])

This creates a simple dataset representing years of experience and corresponding salaries. The reshape(-1, 1) converts the array into a 2D format required by scikit-learn.

Step 3: Split Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

This splits the data into 80% training and 20% testing sets. The training set teaches the model, while the testing set evaluates its performance on unseen data.

Step 4: Train the Linear Regression Model

model = LinearRegression()
model.fit(X_train, y_train)

The fit() method trains the model by finding the optimal weight and bias values that minimize the cost function.

Step 5: View Model Parameters

print(f"Weight (slope): {model.coef_[0]:.2f}")
print(f"Bias (intercept): {model.intercept_:.2f}")

Output:

Weight (slope): 6.14
Bias (intercept): 28.43

The weight indicates that for each additional year of experience, the salary increases by approximately $6,140. The bias represents the base salary prediction.

Step 6: Make Predictions

y_pred = model.predict(X_test)

# Predict salary for 12 years of experience
new_experience = np.array([[12]])
predicted_salary = model.predict(new_experience)
print(f"Predicted salary for 12 years: ${predicted_salary[0]:.2f}k")

The predict() method uses the learned parameters to generate predictions for new data points.

Evaluating Linear Regression Performance

Key Evaluation Metrics

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

Understanding the metrics:

MSE (Mean Squared Error): Average of squared differences between actual and predicted values. Lower is better.
RMSE (Root Mean Squared Error): Square root of MSE, expressed in the same units as the target variable. Easier to interpret.
R² Score (Coefficient of Determination): Measures how well the model explains variance in the data. Ranges from 0 to 1, where 1 indicates perfect prediction.

Visualizing Linear Regression

plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary (thousands)')
plt.title('Linear Regression: Experience vs Salary')
plt.legend()
plt.show()

This visualization displays the actual data points as blue dots and the fitted regression line in red, helping you understand how well the model captures the underlying relationship.

Assumptions of Linear Regression

For linear regression to work effectively, certain assumptions should be met:

Linearity: The relationship between features and target is linear
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of residuals across all levels
Normality: Residuals are normally distributed
No multicollinearity: Features are not highly correlated with each other

Violating these assumptions may reduce model accuracy and reliability.

When to Use Linear Regression

Linear regression works well when:

The relationship between variables is approximately linear
You need an interpretable, explainable model
You have continuous numerical target variables
Quick baseline predictions are required

Consider alternatives when:

The relationship is non-linear
There are many outliers in the data
Features have complex interactions

Summary

Linear regression is the cornerstone of predictive modeling in machine learning. It provides a straightforward approach to understanding relationships between variables and making predictions. By minimizing the mean squared error, linear regression finds the optimal line that best fits your data.

Key takeaways:

Linear regression predicts continuous values using a linear equation
The model learns by minimizing the cost function (MSE)
Weights indicate feature importance; bias shifts the prediction line
R² score and RMSE are essential metrics for evaluation
Understanding assumptions helps ensure reliable predictions

Related Lessons

Ridge, Lasso, and Elastic Net Regularization

Master regularization techniques like Ridge, Lasso, and Elastic Net to reduce overfitting and improve model stability. This lesson explains how these methods handle multicollinearity and enhance regression model performance.

Multiple and Polynomial Regression

Explore Multiple and Polynomial Regression techniques to capture complex patterns in data. This lesson teaches how to model multiple features and nonlinear relationships for more powerful and flexible predictions.

Regression Project - House Price Prediction

Apply regression techniques in a hands‑on House Price Prediction project. Learn to preprocess data, engineer features, select models, and evaluate performance to build a real‑world predictive analytics solution.