Course Summary
Your Python Journey So Far

Today we put everything together in a complete end-to-end data analysis project.
These skills are vital for your final course assignment!

🧮Lecture 1: Lab Calculations

Basic Python Syntax

Variables, types, operators, and assignment

For Loops

Iteration and automation of repetitive tasks

Lists & File I/O

Working with collections and reading/writing files

🧬Lecture 2: Analyzing DNA

String Operations

Manipulating DNA sequences with Python strings

Functions

Creating reusable code blocks for analysis

Biopython

Using packages for biological data formats

📊Lecture 3: DepMap Data

Object-Oriented Programming

Classes, objects, and methods for data structures

Pandas DataFrames

Powerful data manipulation and analysis

Error Handling

Try-except blocks for robust code

📈Lecture 4: Data Visualization

Matplotlib

Creating professional scientific plots

Exploratory Analysis

Techniques for understanding complex datasets

Vectorization

Fast computations with NumPy arrays

🎯

Today: End-to-End Analysis Project

🔬 The Research Question

Which genes correlate with ATR in cancer cell lines?

  • • Find genes with similar dependency patterns
  • • Apply statistical corrections for multiple testing
  • • Analyze network overlap between top genes
  • • Interpret biological significance

🛠️ Skills You'll Apply

Functions
Modular analysis code
Pandas
Data manipulation
Matplotlib
Publication plots
Statistics
Correlation & FDR
💡

This is How Real Bioinformatics Works

Combining Python fundamentals, statistical analysis, and biological interpretation to answer research questions - exactly what you'll do in your final assignment!

Gene Dependency Analysis & Visualization
with Python

Master linear regression and create complex visualizations to understand cancer gene relationships

📈Linear Regression

Statistical Modeling

Build predictive models to understand gene expression relationships

Correlation Analysis

Discover how different genes influence each other in cancer

Model Evaluation

Assess model quality using statistical validation

🎨Data Visualization

Matplotlib Foundations

Create professional scientific plots and customize every detail

Seaborn Analytics

Generate statistical visualizations with minimal code

Detailed Visualizations

Build advanced visualizations for data exploration

🧬Cancer Data Insights

+

Gene Dependency Patterns

Visualize relationships between essential genes across cancer types

+

Comparative Analysis

Create scatter plots to asses correlations between genes

+

Venn Diagrams

Analyse gene dependency network overlap

🎯 What You'll Create Today

1.

Linear Regression Models

Predict gene dependencies using statistical relationships

2.

Correlation Networks

Visualize gene interaction Venn Diagrams

3.

Scatter Plot Analysis

Compare gene dependency between genes

4.

Professional Visualizations

Create publication-quality figures with custom styling

🛠️ Python Libraries We'll Master

📊
Scipy

Scienftic computing

🐼
Pandas

Data manipulation

📈
Matplotlib

Plot creation

🎨
Seaborn

Statistical plots

📚 Your Python Journey Continues

✓ Lecture 1: Python Fundamentals
✓ Lecture 2: DNA Analysis
✓ Lecture 3: OOP & Pandas
→ Lecture 4: EDA & Visualization
Lecture 5: A complete analysis project

💡 Today's Key Insight

Move beyond simple data analysis to predictive modeling and visual storytelling - the hallmarks of professional data science in cancer research

🚀 Complete End-to-End Analysis Notebook

Follow along with the full analysis workflow in this comprehensive Jupyter notebook

Open End-to-End Analysis in Colab

The Next Challenge: Gene Correlation & Visualization 📈

Sarah analyzing gene correlations

Sarah needs to find genes that correlate with ATR

The New Challenge: ATR Gene Correlations

Sarah discovered essential genes, but now needs to understand relationships between them. Which genes correlate with ATR (a key DNA repair gene) across cancer types?

🎯 Analysis Goals:

  • • Find genes that correlate with ATR
  • • Compare breast vs myeloid cancer patterns
  • • Create publication-ready visualizations
  • • Build predictive statistical models

🔬 Her Research Questions:

• "Which genes show similar dependency patterns to ATR?"

• "How do these correlations differ between cancer types?"

• "Can I predict ATR dependency from other genes?"

📊 Solution: Statistical Analysis & Visualization!

Use linear regression, correlation analysis, and create stunning plots with Matplotlib & Seaborn

🎨 Today's Tools:

Linear Regression Matplotlib Seaborn Correlation Analysis

Understanding Gene Dependency Correlation

🤝 What Are We Measuring?

Gene Dependency Correlation = How similarly two genes behave across different cancer cell lines

✓ Positive Correlation

Cells that depend on Gene X also depend on Gene Y

✗ Negative Correlation

Cells that do not need Gene X need Gene Y and vice versa

✓ Positive: ATR vs ATRIP

ATR vs ATRIP (r = 0.85)

ATR DependencyATRIP Dependency-5-2.50-5-2.50

Cells that depend on ATR also depend on ATRIP

  • • Upward diagonal trend
  • • Both genes work together
  • • DNA repair partners

✗ Negative: ATR vs MDM2

ATR vs MDM2 (r = -0.72)

ATR DependencyCOL1A1 Dependency-5-2.50-5-2.50

Cells that don't need ATR need MDM2 and vice versa

  • • Downward diagonal trend
  • • Opposite dependencies
  • • Compensatory pathways

○ None: ATR vs COL1A1

ATR vs COL1A1 (r = 0.02)

ATR DependencyCOL1A1 Dependency-5-2.50-5-2.50

No relationship between genes

  • • Random scatter pattern
  • • Independent functions
  • • Different pathways

📊 What Each Point Represents

Each Dot = One Cancer Cell Line

20 different breast cancer cell lines from our DepMap dataset

X-Axis = ATR Dependency

More negative = more essential for cell survival

Y-Axis = Other Gene

ATRIP (partner) or COL1A1 (unrelated)

💡 Key Insight: Correlation = Functional Relationship

Positive correlation (+0.85) = Genes co-essential, work together

Negative correlation (-0.72) = Opposite patterns, compensatory pathways

No correlation (0.02) = Genes function independently

🎯 This is how we discover biological networks from data!

Biological Context: Why Study ATR?

🧬 ATR: The DNA Damage Guardian

What is ATR?

ATR = Ataxia Telangiectasia and Rad3-related protein

  • • A protein kinase that detects DNA damage
  • • Acts as a "checkpoint" - stops cell division when DNA is broken
  • • Essential for genome stability
  • • Part of the DNA damage response pathway

Why is ATR Important?

  • 🛡️ Genome Protection:
    Prevents cells from dividing with damaged DNA
  • 🎯 Cancer Vulnerability:
    Cancer cells often rely heavily on ATR
  • 💊 Drug Target:
    ATR inhibitors are being developed as cancer treatments

🔄 ATR Pathway: How It Works

💥

DNA Damage

UV, chemicals,
replication stress

🔍

ATR Detection

ATRIP helps ATR
find the damage

⏸️

Cell Cycle Stop

ATR activates
CHEK1 checkpoint

🔧

DNA Repair

Fix damage before
cell division

🤝 ATR's Network Partners

👥Direct Partners

  • ATRIP - Activates ATR
  • CHEK1 - Main target kinase
  • RPA1 - DNA binding protein
  • TOPBP1 - ATR activator

🔧Repair Machinery

  • BRCA1 - Homologous recombination
  • RAD51 - DNA strand exchange
  • PARP1 - DNA break detection
  • 53BP1 - Damage focus formation

🎯Cancer Relevance

  • • Cancer cells have damaged DNA
  • • They depend on ATR for survival
  • • ATR inhibitors cause selective cancer death
  • Synthetic lethality opportunity

🔬 Why Analyze ATR Correlations in Cancer Data?

Research Questions

  • • Which genes show similar dependency patterns to ATR?
  • • Can we identify unknown members of the DNA damage response?
  • • How does ATR dependency vary across cancer types?
  • • Which combinations create synthetic lethal interactions?

Clinical Applications

  • • Identify patients most likely to respond to ATR inhibitors
  • • Find combination therapy targets
  • • Predict drug resistance mechanisms
  • • Discover biomarkers for treatment selection

💡 Key Insight: From Biology to Data Science

By understanding ATR's biological role, we can interpret our correlation analysis meaningfully

High correlations with ATR likely represent DNA repair pathway members - potential therapeutic targets!

🎯 Biology guides our data analysis interpretation

Project Overview
End-to-End Analysis Workflow

🎯 Our Research Question

Which genes correlate with ATR in cancer cell lines?
And if we find the top gene, do they share similar correlation networks?

1

📊 Exploratory Data Analysis (EDA)

What We Do:

  • • Load DepMap dependency data
  • • Check data structure and quality
  • • Handle missing values
  • • Visualize ATR distribution

Why It Matters:

Understanding our data quality and distribution before analysis ensures reliable results

2

🔍 Correlation Analysis: ATR vs All Genes

What We Do:

  • • Calculate Pearson correlation (17,000+ genes)
  • • Calculate Spearman correlation
  • • Compute p-values for significance
  • • Create volcano plots

Output:

List of genes ranked by correlation strength with ATR

3

📈 Statistical Correction (FDR)

What We Do:

  • • Apply Benjamini-Hochberg FDR correction
  • • Filter for FDR < 0.05
  • • Identify statistically significant genes
  • • Find top correlated gene (e.g., SLU7)

Why It Matters:

Testing 17,000 genes = high risk of false positives. FDR correction removes statistical noise

4

🔁 Repeat Analysis: Top Gene vs All Genes

What We Do:

  • • Run same correlation analysis for SLU7
  • • Find genes correlated with SLU7
  • • Apply FDR correction again
  • • Identify SLU7's correlation network

Goal:

Discover which genes are associated with our top ATR partner

5

🕸️ Compare Correlation Networks

What We Do:

  • • Find intersection of ATR and SLU7 networks
  • • Calculate Jaccard Index (overlap score)
  • • Create Venn diagrams
  • • Interpret biological significance

Key Question:

Do ATR and SLU7 share correlated genes? High overlap = functional module/pathway

🚀 Complete End-to-End Analysis Notebook

Follow along with the full analysis workflow in this comprehensive Jupyter notebook

Open End-to-End Analysis in Colab

💡 Why This Workflow Matters

🔬 Rigorous Science

Each step builds on the last with proper statistical validation

♻️ Reusable Code

Write functions once, apply to any gene in the dataset

🧬 Biological Discovery

Uncover functional relationships and gene modules from data

Let's dive into each step and see how we implement this analysis in Python!

Step 1: Exploratory Data Analysis

📊 Always Start with EDA!

Before any analysis, we must understand our data
Bad data in = bad results out. Quality control is essential!

✓ EDA Checklist

1. Dataset Structure

  • • How many cell lines? (N = 94)
  • • How many genes? (~17,000)
  • • What cancer types? (Breast, Myeloid)

2. Data Distribution

  • • What's the range of values?
  • • Are values centered around 0?
  • • Any outliers or strange patterns?

3. Gene of Interest (ATR)

  • • Is ATR present in the dataset?
  • • Does it show variation across cells?
  • • What's the mean dependency?

⚠️ Critical: Missing Values (NaN)

Why NaN Values Matter

Missing data can break your analysis or produce misleading results

  • • Correlation functions may fail
  • • Statistical tests become invalid
  • • Plots show gaps or errors
  • • Sample size effectively reduced

Handling Missing Values

Option 1: Remove Genes

If >20% missing, exclude gene from analysis

Option 2: Impute Values

Fill with median/mean if <20% missing

Option 3: Exclude Rows

Remove cell lines with missing data (use with caution!)

💻 Quick EDA & Missing Value Check

import pandas as pd
# Load data
df = pd.read_csv('DepMap_data.csv')
print(f"Shape: {df.shape}")
# Check for missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print(f"Genes with >20% missing: {(missing_pct > 20).sum()}")
# Impute median for genes with <20% missing
for col in numeric_cols:
if df[col].isnull().sum() > 0:
df[col] = df[col].fillna(df[col].median())

📈 Our Dataset After EDA

Clean Data

94 cell lines, 17,205 genes

No Major Issues

Only 5 genes with NaN (<1%)

Ready for Analysis

All NaN imputed with median

💡

Never Skip EDA!

Spending 10 minutes on EDA can save hours of debugging later. Missing values are the #1 cause of analysis failures!

Dealing with Missing Values (NaN)

⚠️ Two-Step Strategy for NaN Values

Step 1: Remove data with too much missing (threshold: 20%)
Step 2: Impute remaining missing values

🗑️ Step 1: Remove Data Above Threshold (20%)

📊Remove Columns (Genes)

If ≥20% of cell lines are missing data for a gene

Example:

Gene ABC: 25 missing / 100 cells = 25%
Remove entire gene ABC
# Remove genes
missing_pct = (df.isnull().sum() /
len(df)) * 100
keep_genes = missing_pct < 20
df = df.loc[:, keep_genes]

🧬Remove Rows (Cell Lines)

If ≥20% of genes are missing for a cell line

Example:

Cell Line 42: 3500 missing / 17000 genes = 20.6%
Remove cell line 42
# Remove cell lines
missing_pct = (df.isnull().sum(axis=1) /
df.shape[1]) * 100
keep_cells = missing_pct < 20
df = df.loc[keep_cells, :]

⚡ Why 20%?

This is a judgment call - not a hard rule! Common thresholds range from 10-30% depending on your data. The key principle: too much missing data makes imputation unreliable and can introduce bias.

🔧 Step 2: Impute Values Below Threshold

What is Imputation?

Imputation = Filling missing values with estimated values based on existing data

🧬 For Our Gene Dependency Data

Impute across columns (within each gene)

Logic:

  • • Each gene has similar dependency across cell lines
  • • Use median/mean of that gene's values
  • • More biologically meaningful
# Impute per gene (column)
for col in df.columns:
median_val = df[col].median()
df[col].fillna(median_val, inplace=True)

📊 Imputation Methods

✓ Median (Recommended)

Middle value - robust to outliers

Best for: Gene dependency data (has outliers)

○ Mean (Alternative)

Average value - sensitive to outliers

Best for: Normally distributed data

△ Forward/Backward Fill

Copy previous/next value

Best for: Time series with ordered index

◇ Constant Value

Fill with 0 or specific number

Best for: When missing = zero (rare)

🌳 Decision Tree: Which Method to Use?

Time Series Data?

Yes → Use forward/backward fill

Example: Temperature over time, stock prices

Has Outliers?

Yes → Use median

Example: Gene dependencies (our case!)

Normal Distribution?

Yes → Use mean

Example: Height, weight measurements

✓ Our Approach for This Analysis

What We Do:

  • 1. Remove genes with >20% missing (none in our data!)
  • 2. Remove cell lines with >20% missing (none!)
  • 3. Impute remaining NaN with median per gene

Why This Works:

  • • Only 5 genes have NaN (<1% each)
  • • Median is robust to outliers
  • • Per-gene imputation preserves biological meaning
  • • Minimal impact on downstream analysis
💡

Imputation Strategy Matters!

Different data types need different approaches. For gene dependencies: median imputation per gene is the gold standard.

🚀 Practice Missing Value Analysis

Work through different NaN detection and imputation strategies with real data

Open Missing Values Notebook in Colab

Genome-Wide Correlation Analysis
ATR vs 17,000 Genes

🎯 The Challenge

Calculate correlation between ATR and every other gene (~17,000 comparisons!)
We'll use a for loop to iterate through all genes and store results in a new DataFrame

📊 Step 1: Prepare Input Data

Extract ATR Values (pd.Series)

# Get ATR dependency values for all cell lines
gene_values = gene_df['ATR']
print(type(gene_values))
# Output: <class 'pandas.core.series.Series'>
print(len(gene_values))
# Output: 94 (cell lines)

This gives us ATR's dependency score for each of the 94 cell lines

Extract Other Genes (pd.DataFrame)

# Get all genes except ATR
other_genes = gene_df[numeric_cols].drop(columns=['ATR'])
print(type(other_genes))
# Output: <class 'pandas.core.frame.DataFrame'>
print(other_genes.shape)
# Output: (94, 17204)

DataFrame with 94 rows (cell lines) × 17,204 columns (genes)

💡 Why This Structure?

gene_values = single column (ATR) → pd.Series
other_genes = multiple columns (all other genes) → pd.DataFrame

🔧 Step 2: The calculate_gene_correlations() Function

def calculate_gene_correlations(gene_values, other_genes_df):
"""Calculate correlations between one gene and all others"""
print(f'Calculating for {len(other_genes_df.columns)} genes...')
# Initialize empty lists to store results
pearson_r = []
spearman_r = []
pearson_p = []
spearman_p = []
# Loop through each gene
for gene in other_genes_df.columns:
# Extract values for this gene
gene_values_other = other_genes_df[gene]
# Calculate both correlations
r_pearson, p_pearson = pearsonr(gene_values, gene_values_other)
r_spearman, p_spearman = spearmanr(gene_values, gene_values_other)
# Append results to lists
pearson_r.append(r_pearson)
spearman_r.append(r_spearman)
pearson_p.append(p_pearson)
spearman_p.append(p_spearman)
# Build new DataFrame from lists ⭐ NEW SKILL!
results_df = pd.DataFrame({
'gene': other_genes_df.columns,
'pearson_r': pearson_r,
'pearson_p': pearson_p,
'spearman_r': spearman_r,
'spearman_p': spearman_p
})
return results_df

📝 What Happens in the Loop?

  • 1. Extract each gene's values (one column at a time)
  • 2. Calculate Pearson & Spearman correlations with ATR
  • 3. Store r-values and p-values in separate lists
  • 4. Repeat 17,204 times!

⭐ Building a DataFrame from Lists

New skill! Create DataFrame from dictionary of lists:

  • • Keys → column names
  • • Values (lists) → column data
  • • All lists must be same length

📊 Step 3: Understanding the Results DataFrame

What We Get Back

results = calculate_gene_correlations(
gene_values,
other_genes
)
print(results.shape)
# Output: (17204, 5)
print(results.head())

Example Output

genepearson_rpearson_p
ATRIP0.8471.2e-25
CHEK10.7313.5e-18
RPA10.6928.4e-15
COL1A10.0210.842

+ spearman_r, spearman_p columns

17,204 rows

One row per gene (except ATR)

5 columns

gene name, 2 r-values, 2 p-values

Organized data!

Easy to sort, filter, and analyze

💡

Functions + For Loops + DataFrames = Powerful Analysis!

By combining for loops to iterate, lists to collect results, and pd.DataFrame() to organize data, we can analyze 17,000+ genes efficiently!

🚀 Practice Creating DataFrames

Learn how to build new DataFrames from lists and dictionaries with hands-on examples

Open DataFrame Creation Notebook in Colab

Correlation Methods
Pearson vs Spearman

📊 Measuring Relationships Between Genes

Correlation quantifies how two variables change together
We'll use two complementary methods: Pearson (linear) and Spearman (rank-based)

📊 What Pearson Correlation Looks Like

r = 1.0

Perfect positive

R² = 1.0 (100%)

r = 0.8

Strong positive

R² = 0.64 (64%)

r = 0.0

No correlation

R² = 0.0 (0%)

r = -0.9

Strong negative

R² = 0.81 (81%)

💡 Understanding R² (R-squared)

R² = r² (square of Pearson correlation)

R² tells you the percentage of variance in one variable explained by the other
Example: r = 0.8 → R² = 0.64 → 64% of variance explained

📈 Pearson Correlation (r)

What It Measures

Linear relationship between two continuous variables

  • • Measures strength and direction of linear association
  • • Values range from -1 to +1
  • • +1 = perfect positive correlation
  • • 0 = no linear correlation
  • • -1 = perfect negative correlation

Formula

r = Σ[(x - x̄)(y - ȳ)]
√[Σ(x - x̄)²] × √[Σ(y - ȳ)²]

Measures covariance normalized by standard deviations

When to Use Pearson

  • ✓ Data is continuous
  • ✓ Relationship is linear
  • ✓ Data is roughly normally distributed
  • ✓ No major outliers
  • ✓ Variables are measured on similar scales

⚠️ Limitations

  • • Sensitive to outliers
  • • Only detects linear relationships
  • • Assumes normality for significance testing

📊 Spearman Correlation (ρ or rho)

What It Measures

Monotonic relationship using ranks instead of raw values

  • • Converts data to ranks (1st, 2nd, 3rd...)
  • • Then calculates Pearson on ranks
  • • Values range from -1 to +1
  • • Detects any monotonic relationship (not just linear)
  • • Robust to outliers

How It Works

Step 1: Rank gene1: [-3.2, -2.8, -4.1]
→ [2, 3, 1]
Step 2: Rank gene2: [-3.0, -2.9, -3.8]
→ [2, 3, 1]
Step 3: Calculate Pearson on ranks

When to Use Spearman

  • ✓ Data has outliers
  • ✓ Relationship is non-linear but monotonic
  • ✓ Data is ordinal (ranked)
  • ✓ Data is not normally distributed
  • ✓ Want a robust measure

✓ Advantages

  • • Robust to outliers
  • • No normality assumption
  • • Detects monotonic (not just linear) relationships

⚖️ Pearson vs Spearman: Quick Comparison

FeaturePearson (r)Spearman (ρ)
Relationship TypeLinear onlyAny monotonic
Data TypeContinuous valuesRanks
Outlier SensitivityHigh 😟Low 😊
Normality AssumptionYes (for p-values)No
Computational SpeedFast ⚡Slower (ranking step)
Best ForClean, linear dataRobust validation

🎯 Our Approach: Calculate Both!

Why Calculate Both?

  • Pearson: Primary measure for linear relationships
  • Spearman: Confirms findings, handles outliers
  • • If they agree → strong evidence
  • • If they disagree → investigate outliers/non-linearity

Code Example

from scipy.stats import pearsonr, spearmanr
# Calculate both correlations
r_pearson, p_pearson = pearsonr(atr, gene_x)
r_spearman, p_spearman = spearmanr(atr, gene_x)
print(f"Pearson: {r_pearson:.3f}, p={p_pearson:.2e}")
print(f"Spearman: {r_spearman:.3f}, p={p_spearman:.2e}")
💡

Two Methods, Better Validation

Pearson for linear relationships, Spearman for robust validation. When both agree, you have strong evidence of correlation!

Introducing SciPy
Scientific Computing for Python

🔬 What is SciPy?

SciPy is a comprehensive library for scientific and technical computing
Built on NumPy, it provides tools for statistics, optimization, signal processing, and more

📦 SciPy Ecosystem - What's Inside?

📊scipy.stats

Statistical functions & tests

Distributions, t-tests, ANOVA, correlations

🔍scipy.optimize

Optimization algorithms

Curve fitting, root finding, minimization

📈scipy.interpolate

Interpolation tools

1D/2D interpolation, splines

scipy.integrate

Integration & ODEs

Numerical integration, differential equations

🧮scipy.linalg

Linear algebra

Matrix operations, eigenvalues, decompositions

📡scipy.signal

Signal processing

Filtering, convolution, spectral analysis

🎯 What We'll Use Today: scipy.stats

pearsonr()

Calculates Pearson correlation coefficient and p-value

from scipy.stats import pearsonr
r, p = pearsonr(gene1, gene2)

Returns: correlation coefficient (r) and significance (p-value)

spearmanr()

Calculates Spearman rank correlation and p-value

from scipy.stats import spearmanr
rho, p = spearmanr(gene1, gene2)

Returns: rank correlation (rho) and significance (p-value)

🤔 Why Use SciPy for Statistics?

Validated & Trusted

Industry-standard implementations used in research worldwide

Fast & Optimized

Written in C/Fortran for maximum performance

📚

Comprehensive

Includes p-values, confidence intervals, and more

💻 Installing SciPy

# Using pip
pip install scipy
# Using conda
conda install scipy
# Import in your code
from scipy.stats import pearsonr, spearmanr
💡

SciPy = Scientific Python Powerhouse

For correlation analysis, scipy.stats provides battle-tested functions that return both correlation coefficients and p-values in one call!

🚀 Practice Correlation Analysis with SciPy

Learn how to use scipy.stats to calculate Pearson and Spearman correlations with real data

Open SciPy Stats Notebook in Colab

The Multiple Testing Problem
Why We Need FDR Correction

⚠️ The Problem

We just tested 17,204 genes for correlation with ATR
With that many tests, we'll get hundreds of false positives by pure chance!

🪙 Understanding the Problem: The Coin Flip Analogy

Single Test (p < 0.05)

Flip a coin 20 times. Get 15+ heads? That's unusual (p < 0.05)

🪙

Chance of false positive: 5%

✓ Acceptable risk for one test

17,204 Tests!

Flip 17,204 different coins 20 times each. How many give 15+ heads?

🪙🪙🪙🪙🪙...

Expected false positives: ~860 genes!

17,204 × 0.05 = 860

💥 With p < 0.05, we'd falsely call 860 genes "correlated" just by chance!

📊 Our Real Data Example

17,204

Genes Tested

Every gene in our dataset (except ATR)

~860

False Positives Expected

Using p < 0.05 without correction

???

True Positives

How do we know which are real?

✅ The Solution: False Discovery Rate (FDR)

What is FDR?

False Discovery Rate = Expected proportion of false positives among all discoveries

FDR < 0.05 means:

"At most 5% of our significant genes are false positives"

Much more conservative than uncorrected p-values!

How Does It Work?

Benjamini-Hochberg method adjusts p-values:

  • 1. Sort all p-values from smallest to largest
  • 2. Adjust each p-value based on its rank
  • 3. Account for the total number of tests
  • 4. Control the false discovery rate

Don't worry about the math - the function does it for us!

📉 Impact of FDR Correction

❌ Without FDR Correction

Genes with p < 0.05: ~2,500

False positives: ~860 (34%)

Problem: 1 in 3 is fake!

✓ With FDR < 0.05

Genes with FDR < 0.05: ~450

False positives: ~23 (5%)

Confidence: 95% are real!

💡

FDR Correction is Essential for Genome-Wide Studies

Testing thousands of genes means we must correct for multiple testing. FDR gives us confidence that our significant genes are truly biologically meaningful, not just statistical noise!

Applying FDR Correction
Using SciPy

🎉 Good News: SciPy Has FDR Built-In!

No need for another library - scipy.stats includes false_discovery_control()
We're already using SciPy for correlations, so this keeps things simple!

🔧 The Function: false_discovery_control()

from scipy.stats import false_discovery_control
# Apply FDR correction - returns adjusted p-values
adjusted_p = false_discovery_control(
results_df['pearson_p'],
method='bh'
)
# Compare adjusted p-values to alpha threshold
alpha = 0.05
significant = adjusted_p < alpha

📥 Inputs

ps:Array/list of p-values to correct
method='bh':Benjamini-Hochberg FDR method (default)

📤 Output

adjusted_p:Array of adjusted p-values (same length as input)

⚠️ Important:

You must compare adjusted p-values to your alpha (e.g., 0.05) to determine significance

📝 Step-by-Step: Adding FDR to Our Results

# Step 1: Import the function (we already have scipy.stats!)
from scipy.stats import false_discovery_control
# Step 2: Apply FDR correction to get adjusted p-values
adjusted_pearson_p = false_discovery_control(results_df['pearson_p'], method='bh')
adjusted_spearman_p = false_discovery_control(results_df['spearman_p'], method='bh')
# Step 3: Determine significance by comparing to alpha
alpha = 0.05
results_df['pearson_p_adjusted'] = adjusted_pearson_p
results_df['spearman_p_adjusted'] = adjusted_spearman_p
results_df['significant_pearson'] = adjusted_pearson_p < alpha
results_df['significant_spearman'] = adjusted_spearman_p < alpha
# Step 4: Check how many survived correction
print(f"Pearson: {results_df['significant_pearson'].sum()} significant")
print(f"Spearman: {results_df['significant_spearman'].sum()} significant")

✓ What We Get

Four new columns in our DataFrame:

  • pearson_p_adjusted: Adjusted p-values
  • spearman_p_adjusted: Adjusted p-values
  • significant_pearson: True/False (adjusted_p < 0.05)
  • significant_spearman: True/False (adjusted_p < 0.05)

💡 Why This Works

  • • No extra library to install
  • • Adjusted p-values control false discovery rate
  • • Same proven Benjamini-Hochberg method
  • • Easy to filter: adjusted_p < 0.05

🔍 Finding Significant Genes

Filter by Pearson

# Get genes significant by Pearson
sig_genes = results_df[
results_df['significant_pearson']
]
print(f"Found {len(sig_genes)} genes")

Filter by Both Methods

# Get genes significant in BOTH tests
both_sig = results_df[
results_df['significant_pearson'] &
results_df['significant_spearman']
]
print(f"Both agree: {len(both_sig)} genes")

Example Output

Found 453 genes significant by Pearson
Found 412 genes significant by Spearman
Both methods agree on 387 genes

→ Use the 387 genes where both agree for highest confidence!

📊 Quick Count Summary

# Count significant genes
n_pearson = results_df['significant_pearson'].sum()
n_spearman = results_df['significant_spearman'].sum()
n_both = (results_df['significant_pearson'] &
results_df['significant_spearman']).sum()
print(f"Pearson: {n_pearson}")
print(f"Spearman: {n_spearman}")
print(f"Both: {n_both}")

💡 Pro Tip:

.sum() on a Boolean array counts the True values! This is a quick way to count how many genes passed the FDR threshold.

💡

FDR Correction in One SciPy Function!

false_discovery_control() returns adjusted p-values. Compare them to your alpha (0.05) to determine significance. No extra libraries needed - everything we need is in scipy.stats! Filter for genes where both Pearson and Spearman agree for the most trustworthy results.

Visualizing Results: Volcano Plots
Show Both Effect Size & Significance

🌋 What is a Volcano Plot?

A scatter plot that shows correlation strength (x-axis) vs statistical significance (y-axis)
Named for its shape - significant genes "erupt" from the top like a volcano!

📊 Anatomy of a Volcano Plot

FDR < 0.05r = -0.5r = 0.5CHEK1MDM2Correlation (r)-log₁₀(adjusted p-value)-1.0-0.500.51.00123

🎨 Color Key

Strong positive correlation (r > 0.5, FDR < 0.05)
Strong negative correlation (r < -0.5, FDR < 0.05)
Weak or not significant

📍 The Four Quadrants

Top Right:Strong positive, significant ✓
Top Left:Strong negative, significant ✓
Bottom:Not significant (ignore)

🤔 Why -log₁₀(p-value)?

The Problem with P-values

P-values are tiny numbers (e.g., 0.0001, 0.00000023)

p = 0.05 → not very significant
p = 0.001 → quite significant
p = 0.0000001 → very significant!

Hard to visualize and compare on a regular axis

The Solution: -log₁₀

Transform p-values to make differences visible

p = 0.05 → -log₁₀(0.05) = 1.3
p = 0.001 → -log₁₀(0.001) = 3
p = 0.0000001 → -log₁₀(0.0000001) = 7

Higher values = more significant (easier to read!)

💡 Key Point:

The negative sign flips small p-values (good) into large numbers (easy to plot). The log₁₀ spreads out the tiny differences in very small p-values.

🐍 Creating a Volcano Plot in Python

# Step 1: Calculate -log10(adjusted p-value)
import numpy as np
results_df['neg_log_p'] = -np.log10(results_df['pearson_p_adjusted'])
# Step 2: Create figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Step 3: Create the scatter plot
ax.scatter(results_df['pearson_r'], results_df['neg_log_p'], alpha=0.5)
# Step 4: Add threshold lines
ax.axhline(y=-np.log10(0.05), color='red', linestyle='--')
ax.axvline(x=0.5, color='blue', linestyle='--')
ax.axvline(x=-0.5, color='blue', linestyle='--')
# Step 5: Labels and title
ax.set_xlabel('Correlation (r)')
ax.set_ylabel('-log₁₀(adjusted p-value)')
ax.set_title('Volcano Plot: Gene Correlation with ATR')
plt.show()

✓ What You Get

A clear visualization showing which genes have strong correlations that are also statistically significant - the best candidates for follow-up research!

🌋

Volcano Plots Show the Complete Story

X-axis = effect size (correlation), Y-axis = significance (-log₁₀ p-value). Look for genes in the top corners - they have both strong correlation AND statistical significance!

Visualizing the Top Hit
ATR vs SLU7 Linear Regression

🎯 Our Top Correlation Result

SLU7 shows the strongest correlation with ATR
Let's visualize this relationship with a linear regression plot to understand the trend

📐 What is Linear Regression?

The Linear Model

y = w × x + b
w (weight/slope):How steep is the relationship?
b (bias/intercept):Where does the line cross y-axis?

For Our Data

The regression line shows the best linear fit through our scatter plot of ATR vs SLU7 dependencies.

x = ATR dependency scores
y = SLU7 dependency scores
Line = optimal w and b
ATR dependency (x)SLU7 dependency (y)slope (w)rise/runby = wx + b

🤔 Why Show the Regression Line?

👁️

Visual Clarity

The line makes the trend immediately obvious - positive or negative correlation

📊

Prediction

We can predict SLU7 dependency from ATR values using the fitted line

🔬

Biological Insight

The slope tells us how strongly SLU7 tracks with ATR across cell lines

🎨

Next: Creating This Plot with Seaborn

Seaborn makes it incredibly easy to create beautiful regression plots with confidence intervals. Let's see how to generate this visualization in just a few lines of Python!

Creating Regression Plots with Seaborn
Beautiful Statistical Visualizations

🎨 What is Seaborn?

A high-level Python visualization library built on top of matplotlib

🎯

Statistical plots made simple

🌈

Beautiful default styles

📊

Automatic confidence intervals

🐍 Creating the ATR vs SLU7 Plot

# Step 1: Import seaborn
import seaborn as sns
# Step 2: Prepare the data for the top hit (SLU7)
top_gene = 'SLU7'
plot_data = pd.DataFrame({
'ATR': atr_values,
'SLU7': depmap_df[top_gene]
})
# Step 3: Create figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Step 4: Create the regression plot with seaborn
sns.regplot(
x='ATR',
y='SLU7',
data=plot_data,
ax=ax,
scatter_kws={'alpha': 0.6,'s': 50},
line_kws={'color': 'red','linewidth': 2}
)
# Step 5: Add labels and title
ax.set_xlabel('ATR Dependency Score', fontsize=12)
ax.set_ylabel('SLU7 Dependency Score', fontsize=12)
ax.set_title('ATR vs SLU7 Gene Dependencies (r=0.89)', fontsize=14)
plt.show()

🎯 Key Parameters

x, y, data:Column names and DataFrame
scatter_kws:Customize scatter points (alpha, size)
line_kws:Customize regression line (color, width)

✨ Automatic Features

  • ✓ Calculates regression line automatically
  • ✓ Adds confidence interval shading
  • ✓ Handles missing values gracefully
  • ✓ Beautiful default styling

📊 The Result: ATR vs SLU7 Correlation

ATR vs SLU7 regression plot showing strong positive correlation
r = 0.89

Very strong positive correlation

Linear Trend

Clear upward slope visible

Confidence

Shaded region shows uncertainty

🎨

Seaborn Makes Statistical Plots Easy

sns.regplot() automatically fits a regression line, calculates confidence intervals, and creates a beautiful visualization - all in one function call! Perfect for exploring relationships in biological data.

🚀 Practice Seaborn Regression Plots

Explore different regression plot styles and customization options with real biological data

Open Seaborn Notebook in Colab

Validating the Discovery: Repeat Analysis
Finding Network Overlap

🔍 What We Found

ATR

DNA damage response kinase

Well-studied role in replication stress

SLU7

mRNA splicing factor

Involved in pre-mRNA processing

⚠️ Unexpected Correlation!

ATR (DNA repair) and SLU7 (RNA splicing) are in different biological pathways
Why do cells that depend on one also depend on the other?

🤔 The Key Question

Is this correlation spurious (coincidence) or real (biologically meaningful)?

❌ Spurious

Random coincidence with no biological meaning. ATR and SLU7 correlate by chance, not because they work together.

✓ Real

A novel biological interaction! ATR and SLU7 may work in the same pathway or compensate for each other in certain contexts.

🎯 Validation Strategy: Network Overlap Analysis

The Logic:

1

We found genes that correlate with ATR

2

Now let's find genes that correlate with SLU7

3

Check the overlap between the two gene lists

Expected Outcomes:

Low Overlap → Spurious
ATRSLU7Few shared genes

ATR and SLU7 have different correlation partners → likely random

High Overlap → Real Interaction!
ATRSLU7ManyMany shared genes!

ATR and SLU7 share correlation partners → likely co-regulated

🔬

Next: Run the Same Analysis with SLU7

We'll use the exact same correlation pipeline, but this time with SLU7 as our query gene. Then we'll calculate the intersection to see how many genes appear in both lists!

Visualizing Overlap: Venn Diagrams
Finding Set Intersections

⭕ What is a Venn Diagram?

A visual representation of set relationships using overlapping circles

Perfect for showing which genes are unique to ATR, unique to SLU7, or shared by both!

📊 Anatomy of a Venn Diagram

ATR onlySLU7 onlyBoth(Overlap)320 genes298 genes180 genesUnique to ATRUnique to SLU7Shared genes

🔵 Three Regions

Left only:Genes correlating only with ATR
Right only:Genes correlating only with SLU7
Intersection:Genes correlating with BOTH!

📐 The Math

Total ATR genes = 320 + 180 = 500
Total SLU7 genes = 298 + 180 = 478
Overlap = 180

Each circle's total = (unique) + (shared)

📊 Is the Overlap Significant?

The Question:

Is 180 overlapping genes more than we'd expect by chance?

Random Overlap

If gene lists were random, we'd expect small overlap just by chance

Expected: ~10-20 genes
Our Overlap

180 genes is much larger than expected!

Observed: 180 genes ✓

🎲 Hypergeometric Test

Statistical test that calculates: "What's the probability of getting this much overlap by random chance?"

p-value < 0.001

Very significant!

p-value = 0.03

Significant

p-value = 0.25

Not significant

💡 What This Means:

Low p-value (e.g., < 0.01) means the overlap is unlikely to be random - ATR and SLU7 likely share a biological relationship!

Venn Diagrams Show Set Relationships

The intersection (overlap) tells us which genes are shared. The hypergeometric test tells us if that overlap is statistically significant or just random chance.

Python Sets
The Math Behind Venn Diagrams

🔢 What is a Set?

A collection of unique items with no duplicates or order

Mathematical Set

A = {1, 2, 3, 4, 5}

B = {4, 5, 6, 7}

From mathematics - theory of sets

Python Set

genes_a = {'ATR','CHEK1','TP53'}
genes_b = {'TP53','MDM2'}

Python implementation - curly braces {}

✨ Key Properties of Sets

🎯

Unique Elements

# Duplicates removed!
genes = set(['ATR', 'ATR', 'TP53'])
print(genes)
# {'ATR','TP53'}

Each element appears only once

🔀

No Order

# Order doesn't matter
a = {'ATR','TP53'}
b = {'TP53','ATR'}
a == b # True!

Sets are unordered collections

Fast Lookups

# Very fast membership test
genes = {'ATR','TP53', ...}
'ATR' in genes
# Instant! O(1)

Checking membership is instant

🔄 Converting Lists to Sets

From a List

# Start with a list of gene names
gene_list = ['ATR', 'CHEK1', 'TP53', 'MDM2']
# Convert to set
gene_set = set(gene_list)

Why? Lists can have duplicates and are slow for lookups. Sets remove duplicates and are fast!

From a DataFrame Column

# Get significant genes from results
sig_genes_df = results_df[
results_df['significant_both']
]
# Convert column to set
gene_set = set(sig_genes_df['gene'])

Perfect for our analysis! We convert our significant gene lists to sets.

🔣 Set Operations - The Math Behind Venn Diagrams

⭕ Intersection (AND)

Elements in BOTH sets

A & B
atr_genes = {'ATR','CHEK1','TP53'}
slu7_genes = {'TP53','MDM2','SLU7'}
# Method 1: .intersection()
overlap = atr_genes.intersection(slu7_genes)
# Method 2: & operator
overlap = atr_genes & slu7_genes
# Result: {'TP53'}

⭕ Union (OR)

Elements in EITHER set

A | B
# Method 1: .union()
all_genes = atr_genes.union(slu7_genes)
# Method 2: | operator
all_genes = atr_genes | slu7_genes
# Result: {'ATR','CHEK1','TP53','MDM2','SLU7'}

⭕ Difference (NOT)

Elements in A but NOT in B

A - B
# Method 1: .difference()
atr_only = atr_genes.difference(slu7_genes)
# Method 2: - operator
atr_only = atr_genes - slu7_genes
# Result: {'ATR','CHEK1'}
🔢

Sets Are Perfect for Gene Overlap Analysis

Python sets implement mathematical set theory, making it easy to find intersections (genes in both lists), unions (all unique genes), and differences (unique to one list). This is exactly what we need for Venn diagram analysis!

🚀 Practice Python Sets

Master set operations with hands-on exercises using gene lists and biological data

Open Python Sets Notebook in Colab

Creating Venn Diagrams: matplotlib-venn
Community-Built Open Source Tool

🎨 What is matplotlib-venn?

A small open-source package that extends matplotlib to create Venn diagrams

❤️

Community Built

Created by kind developers to help everyone

🔓

Free & Open

Available to all researchers worldwide

🔧

Simple API

Easy to use, integrates with matplotlib

💚 The Power of Open Source

How matplotlib-venn Came to Be

Someone needed to create Venn diagrams in Python, didn't find a good solution, so they built it and shared it with the world!

👤

One Developer

Had a problem to solve

🔨

Built a Solution

Created matplotlib-venn

🌍

Shared Freely

Now thousands benefit!

✨ This is the Spirit of Open Source!

Scientists and developers sharing tools helps the entire research community move faster. Today we use matplotlib-venn; tomorrow you might create something others need!

🐍 Creating a Venn Diagram

# Step 1: Install the package (if needed)
# pip install matplotlib-venn
# Step 2: Import the library
from matplotlib_venn import venn2
# Step 3: Prepare your gene sets
atr_genes = set(atr_results_df[atr_results_df['significant_both']]['gene'])
slu7_genes = set(slu7_results_df[slu7_results_df['significant_both']]['gene'])
# Step 4: Create figure and axes
fig, ax = plt.subplots(figsize=(8, 6))
# Step 5: Create the Venn diagram
venn2(
[atr_genes, slu7_genes],
set_labels=('ATR correlations', 'SLU7 correlations'),
ax=ax
)
# Step 6: Add title and show
ax.set_title('Overlap of ATR and SLU7 Correlation Networks', fontsize=14)
plt.show()

🎯 Key Points

set():Convert lists to Python sets for overlap calculation
venn2():Creates 2-circle Venn diagram (venn3 for 3 circles)
set_labels:Labels for each circle

✨ What You Get

  • ✓ Automatic overlap calculation
  • ✓ Counts displayed in each region
  • ✓ Proportional circle sizes (optional)
  • ✓ Customizable colors and styling

📊 Bonus: Getting the Intersection in Python

# Python sets have built-in intersection operations!
overlap_genes = atr_genes.intersection(slu7_genes)
# Or use the & operator
overlap_genes = atr_genes & slu7_genes
# Get the count
overlap_count = len(overlap_genes)
# Print the results
print(f"ATR genes: {len(atr_genes)}")
print(f"SLU7 genes: {len(slu7_genes)}")
print(f"Overlap: {overlap_count} genes")

💡 Pro Tip:

Python's set data structure is perfect for finding overlaps, unions, and differences. It's fast and has intuitive operators like & (intersection), | (union), and - (difference)!

❤️

Small Tools, Big Impact

matplotlib-venn is a perfect example of open-source collaboration. A developer created a useful tool and shared it freely, and now researchers worldwide use it to visualize their data. This is how science moves forward together!

Summary & Open Questions
What Have We Discovered?

🔬 Our Analysis: ATR Gene Dependency

We analyzed ATR gene dependency correlations in breast and myeloid cancer cell lines

📊

Genome-wide correlation

Tested all genes vs ATR

🎯

Top hit: SLU7

Highly correlated with ATR

🔍

Validation analysis

Network overlap check

🧬 Shared Correlation Partners

The intersection of correlations for ATR and SLU7 revealed these genes:

CHEK1

DEFB121

FUBP1

HIGD1A

RPA1

RPA2

U2SURP

These 7 genes correlate with both ATR and SLU7 in our dataset

💡 What Do We Know?

ATR: DNA damage response, replication stress

SLU7: RNA splicing factor

Some shared genes: CHEK1, RPA1, RPA2 are known DNA repair proteins

🤔 Questions to Think About

Is this overlap significant?

We found 7 genes that correlate with both ATR and SLU7. But is this more than we'd expect by chance?

Think about: How could we test if this overlap is statistically significant? What would we compare it to?

Could this point to a new pathway?

ATR is involved in DNA repair, while SLU7 is an RNA splicing factor. Finding them correlated is unexpected!

Think about: Could there be a biological connection between DNA repair and splicing? What would it mean if these pathways interact?

What further analyses could we do?

We've done correlation analysis and checked for overlap. What's the next step?

• Could we test if specific biological pathways are enriched in the shared genes?

• Should we look at protein interactions between ATR and SLU7?

• Would experimental validation help confirm this connection?

• Are there published studies linking DNA repair and splicing?

🎓

From Data to Discovery

You've learned how to go from raw data to biological insights:

Load & explore data

Statistical analysis

Visualization

Interpretation

But science doesn't end with code - it ends with questions. The tools we've learned give you the power to ask better questions and design experiments to answer them. Now it's your turn to explore!

📓 Practice Notebooks

Master statistical analysis and visualization with hands-on exercises

View All Lecture 5 Notebooks →

Practice SciPy statistics, Seaborn visualization, FDR correction, and complete end-to-end analysis

Next Steps with Python
Your Journey Continues

🎉 You've Completed the Core Python Course!

You now have the fundamental skills to analyze biological data with Python.
But this is just the beginning of your coding journey...

💡 The Best Way to Learn: Apply Your Skills

Reading tutorials won't make you a programmer - solving real problems will.

🔬

Start Small

Automate repetitive tasks in your lab work

📊

Build Projects

Analyze your own research data

🚀

Keep Challenging

Tackle increasingly complex problems

🧬 Bioinformatics Pathways

🧬Next-Generation Sequencing (NGS) Analysis

RNA-seq, ChIP-seq, ATAC-seq, single-cell sequencing

Biopython

Sequence parsing, alignment, BLAST

pysam

BAM/SAM file manipulation

scanpy

Single-cell RNA-seq analysis

🔬Image Analysis

Microscopy, cell counting, segmentation, feature extraction

scikit-image

Image processing algorithms

CellProfiler

High-throughput cell analysis

napari

Multi-dimensional image viewer

🤖Machine Learning & AI

Predictive modeling, classification, deep learning for biology

scikit-learn

Classical ML algorithms

PyTorch

Deep learning framework

AlphaFold

Protein structure prediction

🛠️ Level Up Your Development Skills

Move Beyond Colab

VS Code

Professional code editor with Python support

code.visualstudio.com →

uv

Fast Python package & project manager

docs.astral.sh/uv →

Terminal Skills

Learn bash commands: cd, ls, grep, find

Version Control

Git Basics

Track changes: git add, commit, push

Essential for any serious project

GitHub

Share code, collaborate, build portfolio

github.com →

Why It Matters

Backup your work, collaborate with others, showcase your skills

🚀

You're Ready to Code

You've learned the fundamentals. Now the real learning begins through building, breaking, and fixing real projects.

✓ You can load and manipulate data with pandas

✓ You can perform statistical analysis with scipy

✓ You can create visualizations with matplotlib and seaborn

✓ You can read documentation and learn new packages independently

These skills are transferable - whether you pursue academia, industry, or clinical research, Python will amplify your impact.

Keep coding. Keep learning. Keep building. 💚