📊 Lecture 4: Exploratory Data Analysis

Master data exploration and visualization with pandas, matplotlib, and real gene expression data. Transform raw data into biological insights through powerful visual analysis.

🎯 Getting Started

Vectorization for Speed - Learn why pandas is 100× faster than Python loops.

Master GroupBy - Compare gene expression across cancer types effortlessly.

Scientific Visualization - Create publication-ready figures with matplotlib.

Available

⚡ Pandas Vectorisation

Harness NumPy's power for lightning-fast data operations

  • NumPy arrays vs Python lists
  • Vectorized operations on DataFrames
  • Broadcasting for efficient computations
  • Performance comparison: loops vs vectorization
  • 100-1000× speed improvements
Open in Colab
Available

📊 Data Inspection & Quality Control

Master exploratory data analysis techniques

  • Checking data structure with .head(), .info(), .describe()
  • Identifying missing values and outliers
  • Understanding data types and ranges
  • Quality control for biological datasets
  • Building good EDA habits
Open in Colab
Available

🔄 GroupBy Operations

Split-Apply-Combine for comparing cancer types

  • Understanding the split-apply-combine pattern
  • Grouping by cancer lineage
  • Aggregating with .mean(), .std(), .count()
  • Using .agg() for multiple statistics
  • Comparing gene expression across groups
Open in Colab
Available

📊 Introduction to Matplotlib

Create publication-quality scientific figures

  • The fig, ax API - professional plotting standard
  • Understanding Figure and Axes objects
  • Customizing plots with labels and titles
  • Saving figures for publications
  • Best practices for scientific visualization
Open in Colab
Available

📈 Histograms & Data Distribution

Visualize how your data is distributed

  • Creating histograms with ax.hist()
  • Choosing appropriate bin numbers
  • Using .flatten() to analyze all genes at once
  • Understanding distribution shapes
  • Identifying normal vs skewed data
Open in Colab
Available

🎨 Creating Subplots

Compare multiple genes in multi-panel figures

  • Side-by-side subplots (1 row, 2 columns)
  • Stacked subplots (2 rows, 1 column)
  • Grid layouts (2×2 and beyond)
  • Shared axes for better comparisons
  • Using tight_layout() for professional figures
Open in Colab
Available

🔍 Scatter Plots & Correlations

Discover relationships between genes

  • Creating scatter plots with ax.scatter()
  • BRCA1 vs BRCA2 expression patterns
  • TSC1 vs TSC2 - strong correlations
  • Coloring points by cancer type
  • Identifying co-regulated genes
Open in Colab
Available

📦 Box Plots for Group Comparisons

Compare distributions across cancer types

  • Understanding box plot anatomy (Q1, median, Q3, IQR)
  • Creating box plots with matplotlib and pandas
  • Comparing gene expression across lineages
  • Identifying outliers and variability
  • Interpreting biological significance
Open in Colab

🗺️ Learning Path

Part 1: Data Manipulation

Vectorization and GroupBy operations for efficient analysis

Part 2: Data Inspection

Quality control and exploratory data analysis techniques

Part 3: Visualization

Create histograms, scatter plots, and box plots with matplotlib

🚀 Essential EDA Skills You'll Master

By completing these notebooks, you'll be able to:

  • Use vectorization for 100× faster data processing
  • Compare gene expression across multiple cancer types
  • Visualize data distributions with histograms
  • Create multi-panel figures for comparisons
  • Discover gene correlations with scatter plots
  • Identify outliers and variability with box plots
  • Apply quality control checks to biological datasets
  • Generate publication-ready scientific figures

📊 Three Essential Plot Types

📈 Histograms

Show distribution of a single variable

Use for: Understanding data spread, normality, and outliers

🔍 Scatter Plots

Reveal relationships between two genes

Use for: Finding correlations, co-regulation, and dependencies

📦 Box Plots

Compare distributions across groups

Use for: Comparing cancer types, identifying group differences

📚 Part of the Python for Biologists course by Helfrid Hochegger

University of Sussex | Year 3 Biology, Biochemistry & Neuroscience