Progress1/3 lessons (33%)

Lesson 1

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of exploring and visualizing data to understand its patterns, trends, and relationships before building models. This complete guide helps you analyze data effectively, detect anomalies, summarize key insights, and make informed decisions using charts, statistics, and data exploration techniques.

10 min read20 views

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the process of examining and investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions. Before building any machine learning model, data scientists perform EDA to understand the structure, relationships, and quality of their data.

Think of EDA as a detective investigation. Before solving a case, a detective gathers clues, examines evidence, and forms initial theories. Similarly, EDA helps you understand what story your data tells before you start building predictive models.

Why is EDA Important in Machine Learning?

Exploratory Data Analysis serves several crucial purposes:

Data Quality Assessment: Identify missing values, duplicates, and inconsistencies
Feature Understanding: Learn what each variable represents and its distribution
Relationship Discovery: Find correlations and dependencies between variables
Assumption Validation: Check if data meets requirements for specific algorithms
Feature Engineering Ideas: Generate insights for creating new features

Without proper EDA, you risk building models on flawed data, leading to poor predictions and unreliable insights.

The EDA Process: A Structured Approach

Step 1: Data Collection and Loading

The first step involves loading your dataset and getting a basic overview.

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('customer_data.csv')

# Display first few rows
print(df.head())

This code imports the necessary libraries and loads a CSV file into a Pandas DataFrame. The head() function shows the first five rows, giving you an immediate glimpse of your data structure.

# Check dataset dimensions
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# View column names and data types
print(df.info())

The shape attribute reveals how many records and features exist in your dataset. The info() method displays column names, non-null counts, and data types—essential information for understanding your data.

Step 2: Understanding Data Types

Proper data type identification is fundamental to exploratory data analysis.

# Separate numerical and categorical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns

print(f"Numerical columns: {list(numerical_cols)}")
print(f"Categorical columns: {list(categorical_cols)}")

This code automatically identifies numerical and categorical variables. Understanding data types helps you choose appropriate visualization and analysis techniques for each variable type.

Step 3: Statistical Summary

Descriptive statistics provide a mathematical overview of your data.

# Summary statistics for numerical columns
print(df.describe())

The describe() function calculates count, mean, standard deviation, minimum, maximum, and quartile values. These statistics reveal the central tendency, spread, and range of your numerical features.

# Summary for categorical columns
print(df.describe(include=['object']))

For categorical variables, describe() shows count, unique values, most frequent category, and its frequency. This helps identify dominant categories and data diversity.

Real-World Example: In a customer churn dataset, statistical summaries might reveal that average customer tenure is 32 months with high variability (standard deviation of 24 months), suggesting diverse customer segments requiring different retention strategies.

Step 4: Missing Value Analysis

Identifying and understanding missing data is a critical EDA component.

# Count missing values per column
missing_counts = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100

# Create missing value summary
missing_summary = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing Percentage': missing_percent
})
print(missing_summary[missing_summary['Missing Count'] > 0])

This code calculates both the count and percentage of missing values for each column. Columns with high missing percentages may require special handling or removal.

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.title('Missing Value Heatmap')
plt.show()

A missing value heatmap provides a visual representation of data gaps. Yellow bands indicate missing values, making patterns immediately visible.

Step 5: Univariate Analysis

Univariate analysis examines each variable individually to understand its distribution.

Analyzing Numerical Variables

# Histogram for numerical variable
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(df['age'], bins=30, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

# Box plot for the same variable
plt.subplot(1, 2, 2)
plt.boxplot(df['age'])
plt.title('Age Box Plot')
plt.ylabel('Age')
plt.tight_layout()
plt.show()

Histograms reveal the shape of data distribution—whether it's normal, skewed, or multimodal. Box plots highlight the median, quartiles, and potential outliers in a compact visual format.

Analyzing Categorical Variables

# Count plot for categorical variable
plt.figure(figsize=(8, 5))
df['subscription_type'].value_counts().plot(kind='bar')
plt.title('Subscription Type Distribution')
plt.xlabel('Subscription Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Bar charts display the frequency of each category. This visualization helps identify class imbalances that might affect model performance.

Step 6: Bivariate Analysis

Bivariate analysis explores relationships between two variables.

Numerical vs Numerical

# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['income'], df['spending_score'], alpha=0.5)
plt.title('Income vs Spending Score')
plt.xlabel('Income')
plt.ylabel('Spending Score')
plt.show()

Scatter plots reveal relationships between numerical variables. Look for linear patterns, clusters, or unusual groupings that might indicate important data segments.

# Correlation coefficient
correlation = df['income'].corr(df['spending_score'])
print(f"Correlation: {correlation:.3f}")

The correlation coefficient quantifies the linear relationship strength. Values near +1 or -1 indicate strong relationships, while values near 0 suggest weak or no linear relationship.

Categorical vs Numerical

# Box plot by category
plt.figure(figsize=(10, 6))
sns.boxplot(x='subscription_type', y='monthly_charges', data=df)
plt.title('Monthly Charges by Subscription Type')
plt.show()

Grouped box plots compare numerical distributions across categories, revealing whether different groups have significantly different values.

Step 7: Multivariate Analysis

Multivariate analysis examines relationships among multiple variables simultaneously.

# Correlation matrix
plt.figure(figsize=(12, 8))
correlation_matrix = df[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

A correlation heatmap displays relationships between all numerical variables at once. Strong correlations (dark red or dark blue) might indicate redundant features or important relationships for your model.

# Pair plot for selected variables
selected_vars = ['age', 'income', 'spending_score', 'tenure']
sns.pairplot(df[selected_vars], diag_kind='hist')
plt.show()

Pair plots create a matrix of scatter plots, showing relationships between every pair of selected variables. The diagonal shows each variable's distribution.

Step 8: Distribution Analysis

Understanding data distributions helps choose appropriate transformations and algorithms.

from scipy import stats

# Check skewness
for col in numerical_cols:
    skewness = df[col].skew()
    print(f"{col}: Skewness = {skewness:.3f}")

Skewness measures distribution asymmetry. Values between -0.5 and 0.5 indicate approximately symmetric distributions. Higher absolute values suggest significant skew requiring transformation.

# Q-Q plot for normality check
from scipy.stats import probplot

plt.figure(figsize=(8, 6))
probplot(df['income'], dist="norm", plot=plt)
plt.title('Q-Q Plot: Income')
plt.show()

Quantile-Quantile (Q-Q) plots compare your data distribution against a theoretical normal distribution. Points following the diagonal line indicate normality.

Real-World EDA Example: E-Commerce Customer Data

Consider analyzing customer data for an online retail company:

# Comprehensive EDA function
def perform_eda(df):
    print("="*50)
    print("DATASET OVERVIEW")
    print("="*50)
    print(f"Shape: {df.shape}")
    print(f"\nData Types:\n{df.dtypes}")
    
    print("\n" + "="*50)
    print("MISSING VALUES")
    print("="*50)
    missing = df.isnull().sum()
    print(missing[missing > 0])
    
    print("\n" + "="*50)
    print("STATISTICAL SUMMARY")
    print("="*50)
    print(df.describe())
    
    return None

# Execute EDA
perform_eda(df)

This reusable function creates a structured EDA report, making the analysis process consistent and reproducible across different datasets.

Best Practices for Exploratory Data Analysis

Document Your Findings: Keep notes about patterns, anomalies, and insights discovered
Use Multiple Visualizations: Different charts reveal different aspects of data
Question Everything: Ask why patterns exist and whether they make business sense
Iterate: EDA is not a one-time process; revisit as you learn more
Consider Domain Knowledge: Combine statistical findings with business understanding

Summary

Exploratory Data Analysis is the foundation of successful machine learning projects. By systematically examining your data through statistical summaries, visualizations, and relationship analysis, you gain crucial insights that inform feature engineering, model selection, and validation strategies. Master EDA techniques to build more accurate and reliable machine learning models.

Related Lessons

Handling Missing Values

Handling missing values involves identifying, analyzing, and addressing gaps in a dataset to ensure accuracy and reliability. It improves data quality by using techniques like deletion, imputation, or predictive modeling, helping machine learning models perform better and produce trustworthy insights.

Detecting and Treating Outliers

Detecting and treating outliers involves identifying unusual data points that can distort analysis and reduce model accuracy. By using statistical methods, visualizations, and domain knowledge, outliers can be removed, transformed, or corrected to improve data quality and ensure more accurate, stable machine learning results.