Course Summary
Your Python Journey So Far
Today we put everything together in a complete end-to-end data analysis project.
These skills are vital for your final course assignment!
🧮Lecture 1: Lab Calculations
Basic Python Syntax
Variables, types, operators, and assignment
For Loops
Iteration and automation of repetitive tasks
Lists & File I/O
Working with collections and reading/writing files
🧬Lecture 2: Analyzing DNA
String Operations
Manipulating DNA sequences with Python strings
Functions
Creating reusable code blocks for analysis
Biopython
Using packages for biological data formats
📊Lecture 3: DepMap Data
Object-Oriented Programming
Classes, objects, and methods for data structures
Pandas DataFrames
Powerful data manipulation and analysis
Error Handling
Try-except blocks for robust code
📈Lecture 4: Data Visualization
Matplotlib
Creating professional scientific plots
Exploratory Analysis
Techniques for understanding complex datasets
Vectorization
Fast computations with NumPy arrays
Today: End-to-End Analysis Project
🔬 The Research Question
Which genes correlate with ATR in cancer cell lines?
- • Find genes with similar dependency patterns
- • Apply statistical corrections for multiple testing
- • Analyze network overlap between top genes
- • Interpret biological significance
🛠️ Skills You'll Apply
This is How Real Bioinformatics Works
Combining Python fundamentals, statistical analysis, and biological interpretation to answer research questions - exactly what you'll do in your final assignment!
Gene Dependency Analysis & Visualization
with Python
Master linear regression and create complex visualizations to understand cancer gene relationships
📈Linear Regression
Statistical Modeling
Build predictive models to understand gene expression relationships
Correlation Analysis
Discover how different genes influence each other in cancer
Model Evaluation
Assess model quality using statistical validation
🎨Data Visualization
Matplotlib Foundations
Create professional scientific plots and customize every detail
Seaborn Analytics
Generate statistical visualizations with minimal code
Detailed Visualizations
Build advanced visualizations for data exploration
🧬Cancer Data Insights
Gene Dependency Patterns
Visualize relationships between essential genes across cancer types
Comparative Analysis
Create scatter plots to asses correlations between genes
Venn Diagrams
Analyse gene dependency network overlap
🎯 What You'll Create Today
Linear Regression Models
Predict gene dependencies using statistical relationships
Correlation Networks
Visualize gene interaction Venn Diagrams
Scatter Plot Analysis
Compare gene dependency between genes
Professional Visualizations
Create publication-quality figures with custom styling
🛠️ Python Libraries We'll Master
Scipy
Scienftic computing
Pandas
Data manipulation
Matplotlib
Plot creation
Seaborn
Statistical plots
📚 Your Python Journey Continues
💡 Today's Key Insight
Move beyond simple data analysis to predictive modeling and visual storytelling - the hallmarks of professional data science in cancer research
🚀 Complete End-to-End Analysis Notebook
Follow along with the full analysis workflow in this comprehensive Jupyter notebook
Open End-to-End Analysis in ColabThe Next Challenge: Gene Correlation & Visualization 📈

Sarah needs to find genes that correlate with ATR
The New Challenge: ATR Gene Correlations
Sarah discovered essential genes, but now needs to understand relationships between them. Which genes correlate with ATR (a key DNA repair gene) across cancer types?
🎯 Analysis Goals:
- • Find genes that correlate with ATR
- • Compare breast vs myeloid cancer patterns
- • Create publication-ready visualizations
- • Build predictive statistical models
🔬 Her Research Questions:
• "Which genes show similar dependency patterns to ATR?"
• "How do these correlations differ between cancer types?"
• "Can I predict ATR dependency from other genes?"
📊 Solution: Statistical Analysis & Visualization!
Use linear regression, correlation analysis, and create stunning plots with Matplotlib & Seaborn
🎨 Today's Tools:
Understanding Gene Dependency Correlation
🤝 What Are We Measuring?
Gene Dependency Correlation = How similarly two genes behave across different cancer cell lines
✓ Positive Correlation
Cells that depend on Gene X also depend on Gene Y
✗ Negative Correlation
Cells that do not need Gene X need Gene Y and vice versa
✓ Positive: ATR vs ATRIP
ATR vs ATRIP (r = 0.85)
Cells that depend on ATR also depend on ATRIP
- • Upward diagonal trend
- • Both genes work together
- • DNA repair partners
✗ Negative: ATR vs MDM2
ATR vs MDM2 (r = -0.72)
Cells that don't need ATR need MDM2 and vice versa
- • Downward diagonal trend
- • Opposite dependencies
- • Compensatory pathways
○ None: ATR vs COL1A1
ATR vs COL1A1 (r = 0.02)
No relationship between genes
- • Random scatter pattern
- • Independent functions
- • Different pathways
📊 What Each Point Represents
Each Dot = One Cancer Cell Line
20 different breast cancer cell lines from our DepMap dataset
X-Axis = ATR Dependency
More negative = more essential for cell survival
Y-Axis = Other Gene
ATRIP (partner) or COL1A1 (unrelated)
💡 Key Insight: Correlation = Functional Relationship
Positive correlation (+0.85) = Genes co-essential, work together
Negative correlation (-0.72) = Opposite patterns, compensatory pathways
No correlation (0.02) = Genes function independently
🎯 This is how we discover biological networks from data!
Biological Context: Why Study ATR?
🧬 ATR: The DNA Damage Guardian
What is ATR?
ATR = Ataxia Telangiectasia and Rad3-related protein
- • A protein kinase that detects DNA damage
- • Acts as a "checkpoint" - stops cell division when DNA is broken
- • Essential for genome stability
- • Part of the DNA damage response pathway
Why is ATR Important?
- 🛡️ Genome Protection:
Prevents cells from dividing with damaged DNA - 🎯 Cancer Vulnerability:
Cancer cells often rely heavily on ATR - 💊 Drug Target:
ATR inhibitors are being developed as cancer treatments
🔄 ATR Pathway: How It Works
DNA Damage
UV, chemicals,
replication stress
ATR Detection
ATRIP helps ATR
find the damage
Cell Cycle Stop
ATR activates
CHEK1 checkpoint
DNA Repair
Fix damage before
cell division
🤝 ATR's Network Partners
👥Direct Partners
- • ATRIP - Activates ATR
- • CHEK1 - Main target kinase
- • RPA1 - DNA binding protein
- • TOPBP1 - ATR activator
🔧Repair Machinery
- • BRCA1 - Homologous recombination
- • RAD51 - DNA strand exchange
- • PARP1 - DNA break detection
- • 53BP1 - Damage focus formation
🎯Cancer Relevance
- • Cancer cells have damaged DNA
- • They depend on ATR for survival
- • ATR inhibitors cause selective cancer death
- • Synthetic lethality opportunity
🔬 Why Analyze ATR Correlations in Cancer Data?
Research Questions
- • Which genes show similar dependency patterns to ATR?
- • Can we identify unknown members of the DNA damage response?
- • How does ATR dependency vary across cancer types?
- • Which combinations create synthetic lethal interactions?
Clinical Applications
- • Identify patients most likely to respond to ATR inhibitors
- • Find combination therapy targets
- • Predict drug resistance mechanisms
- • Discover biomarkers for treatment selection
💡 Key Insight: From Biology to Data Science
By understanding ATR's biological role, we can interpret our correlation analysis meaningfully
High correlations with ATR likely represent DNA repair pathway members - potential therapeutic targets!
🎯 Biology guides our data analysis interpretation
Project Overview
End-to-End Analysis Workflow
🎯 Our Research Question
Which genes correlate with ATR in cancer cell lines?
And if we find the top gene, do they share similar correlation networks?
📊 Exploratory Data Analysis (EDA)
What We Do:
- • Load DepMap dependency data
- • Check data structure and quality
- • Handle missing values
- • Visualize ATR distribution
Why It Matters:
Understanding our data quality and distribution before analysis ensures reliable results
🔍 Correlation Analysis: ATR vs All Genes
What We Do:
- • Calculate Pearson correlation (17,000+ genes)
- • Calculate Spearman correlation
- • Compute p-values for significance
- • Create volcano plots
Output:
List of genes ranked by correlation strength with ATR
📈 Statistical Correction (FDR)
What We Do:
- • Apply Benjamini-Hochberg FDR correction
- • Filter for FDR < 0.05
- • Identify statistically significant genes
- • Find top correlated gene (e.g., SLU7)
Why It Matters:
Testing 17,000 genes = high risk of false positives. FDR correction removes statistical noise
🔁 Repeat Analysis: Top Gene vs All Genes
What We Do:
- • Run same correlation analysis for SLU7
- • Find genes correlated with SLU7
- • Apply FDR correction again
- • Identify SLU7's correlation network
Goal:
Discover which genes are associated with our top ATR partner
🕸️ Compare Correlation Networks
What We Do:
- • Find intersection of ATR and SLU7 networks
- • Calculate Jaccard Index (overlap score)
- • Create Venn diagrams
- • Interpret biological significance
Key Question:
Do ATR and SLU7 share correlated genes? High overlap = functional module/pathway
🚀 Complete End-to-End Analysis Notebook
Follow along with the full analysis workflow in this comprehensive Jupyter notebook
Open End-to-End Analysis in Colab💡 Why This Workflow Matters
🔬 Rigorous Science
Each step builds on the last with proper statistical validation
♻️ Reusable Code
Write functions once, apply to any gene in the dataset
🧬 Biological Discovery
Uncover functional relationships and gene modules from data
Let's dive into each step and see how we implement this analysis in Python!
Step 1: Exploratory Data Analysis
📊 Always Start with EDA!
Before any analysis, we must understand our data
Bad data in = bad results out. Quality control is essential!
✓ EDA Checklist
1. Dataset Structure
- • How many cell lines? (N = 94)
- • How many genes? (~17,000)
- • What cancer types? (Breast, Myeloid)
2. Data Distribution
- • What's the range of values?
- • Are values centered around 0?
- • Any outliers or strange patterns?
3. Gene of Interest (ATR)
- • Is ATR present in the dataset?
- • Does it show variation across cells?
- • What's the mean dependency?
⚠️ Critical: Missing Values (NaN)
Why NaN Values Matter
Missing data can break your analysis or produce misleading results
- • Correlation functions may fail
- • Statistical tests become invalid
- • Plots show gaps or errors
- • Sample size effectively reduced
Handling Missing Values
Option 1: Remove Genes
If >20% missing, exclude gene from analysis
Option 2: Impute Values
Fill with median/mean if <20% missing
Option 3: Exclude Rows
Remove cell lines with missing data (use with caution!)
💻 Quick EDA & Missing Value Check
📈 Our Dataset After EDA
Clean Data
94 cell lines, 17,205 genes
No Major Issues
Only 5 genes with NaN (<1%)
Ready for Analysis
All NaN imputed with median
Never Skip EDA!
Spending 10 minutes on EDA can save hours of debugging later. Missing values are the #1 cause of analysis failures!
Dealing with Missing Values (NaN)
⚠️ Two-Step Strategy for NaN Values
Step 1: Remove data with too much missing (threshold: 20%)
Step 2: Impute remaining missing values
🗑️ Step 1: Remove Data Above Threshold (20%)
📊Remove Columns (Genes)
If ≥20% of cell lines are missing data for a gene
Example:
→ Remove entire gene ABC
🧬Remove Rows (Cell Lines)
If ≥20% of genes are missing for a cell line
Example:
→ Remove cell line 42
⚡ Why 20%?
This is a judgment call - not a hard rule! Common thresholds range from 10-30% depending on your data. The key principle: too much missing data makes imputation unreliable and can introduce bias.
🔧 Step 2: Impute Values Below Threshold
What is Imputation?
Imputation = Filling missing values with estimated values based on existing data
🧬 For Our Gene Dependency Data
Impute across columns (within each gene)
Logic:
- • Each gene has similar dependency across cell lines
- • Use median/mean of that gene's values
- • More biologically meaningful
📊 Imputation Methods
✓ Median (Recommended)
Middle value - robust to outliers
Best for: Gene dependency data (has outliers)
○ Mean (Alternative)
Average value - sensitive to outliers
Best for: Normally distributed data
△ Forward/Backward Fill
Copy previous/next value
Best for: Time series with ordered index
◇ Constant Value
Fill with 0 or specific number
Best for: When missing = zero (rare)
🌳 Decision Tree: Which Method to Use?
Time Series Data?
Yes → Use forward/backward fill
Example: Temperature over time, stock prices
Has Outliers?
Yes → Use median
Example: Gene dependencies (our case!)
Normal Distribution?
Yes → Use mean
Example: Height, weight measurements
✓ Our Approach for This Analysis
What We Do:
- 1. Remove genes with >20% missing (none in our data!)
- 2. Remove cell lines with >20% missing (none!)
- 3. Impute remaining NaN with median per gene
Why This Works:
- • Only 5 genes have NaN (<1% each)
- • Median is robust to outliers
- • Per-gene imputation preserves biological meaning
- • Minimal impact on downstream analysis
Imputation Strategy Matters!
Different data types need different approaches. For gene dependencies: median imputation per gene is the gold standard.
🚀 Practice Missing Value Analysis
Work through different NaN detection and imputation strategies with real data
Open Missing Values Notebook in ColabGenome-Wide Correlation Analysis
ATR vs 17,000 Genes
🎯 The Challenge
Calculate correlation between ATR and every other gene (~17,000 comparisons!)
We'll use a for loop to iterate through all genes and store results in a new DataFrame
📊 Step 1: Prepare Input Data
Extract ATR Values (pd.Series)
This gives us ATR's dependency score for each of the 94 cell lines
Extract Other Genes (pd.DataFrame)
DataFrame with 94 rows (cell lines) × 17,204 columns (genes)
💡 Why This Structure?
gene_values = single column (ATR) → pd.Series
other_genes = multiple columns (all other genes) → pd.DataFrame
🔧 Step 2: The calculate_gene_correlations() Function
📝 What Happens in the Loop?
- 1. Extract each gene's values (one column at a time)
- 2. Calculate Pearson & Spearman correlations with ATR
- 3. Store r-values and p-values in separate lists
- 4. Repeat 17,204 times!
⭐ Building a DataFrame from Lists
New skill! Create DataFrame from dictionary of lists:
- • Keys → column names
- • Values (lists) → column data
- • All lists must be same length
📊 Step 3: Understanding the Results DataFrame
What We Get Back
Example Output
| gene | pearson_r | pearson_p |
|---|---|---|
| ATRIP | 0.847 | 1.2e-25 |
| CHEK1 | 0.731 | 3.5e-18 |
| RPA1 | 0.692 | 8.4e-15 |
| COL1A1 | 0.021 | 0.842 |
+ spearman_r, spearman_p columns
17,204 rows
One row per gene (except ATR)
5 columns
gene name, 2 r-values, 2 p-values
Organized data!
Easy to sort, filter, and analyze
Functions + For Loops + DataFrames = Powerful Analysis!
By combining for loops to iterate, lists to collect results, and pd.DataFrame() to organize data, we can analyze 17,000+ genes efficiently!
🚀 Practice Creating DataFrames
Learn how to build new DataFrames from lists and dictionaries with hands-on examples
Open DataFrame Creation Notebook in ColabCorrelation Methods
Pearson vs Spearman
📊 Measuring Relationships Between Genes
Correlation quantifies how two variables change together
We'll use two complementary methods: Pearson (linear) and Spearman (rank-based)
📊 What Pearson Correlation Looks Like
r = 1.0
Perfect positive
R² = 1.0 (100%)
r = 0.8
Strong positive
R² = 0.64 (64%)
r = 0.0
No correlation
R² = 0.0 (0%)
r = -0.9
Strong negative
R² = 0.81 (81%)
💡 Understanding R² (R-squared)
R² = r² (square of Pearson correlation)
R² tells you the percentage of variance in one variable explained by the other
Example: r = 0.8 → R² = 0.64 → 64% of variance explained
📈 Pearson Correlation (r)
What It Measures
Linear relationship between two continuous variables
- • Measures strength and direction of linear association
- • Values range from -1 to +1
- • +1 = perfect positive correlation
- • 0 = no linear correlation
- • -1 = perfect negative correlation
Formula
Measures covariance normalized by standard deviations
When to Use Pearson
- ✓ Data is continuous
- ✓ Relationship is linear
- ✓ Data is roughly normally distributed
- ✓ No major outliers
- ✓ Variables are measured on similar scales
⚠️ Limitations
- • Sensitive to outliers
- • Only detects linear relationships
- • Assumes normality for significance testing
📊 Spearman Correlation (ρ or rho)
What It Measures
Monotonic relationship using ranks instead of raw values
- • Converts data to ranks (1st, 2nd, 3rd...)
- • Then calculates Pearson on ranks
- • Values range from -1 to +1
- • Detects any monotonic relationship (not just linear)
- • Robust to outliers
How It Works
When to Use Spearman
- ✓ Data has outliers
- ✓ Relationship is non-linear but monotonic
- ✓ Data is ordinal (ranked)
- ✓ Data is not normally distributed
- ✓ Want a robust measure
✓ Advantages
- • Robust to outliers
- • No normality assumption
- • Detects monotonic (not just linear) relationships
⚖️ Pearson vs Spearman: Quick Comparison
| Feature | Pearson (r) | Spearman (ρ) |
|---|---|---|
| Relationship Type | Linear only | Any monotonic |
| Data Type | Continuous values | Ranks |
| Outlier Sensitivity | High 😟 | Low 😊 |
| Normality Assumption | Yes (for p-values) | No |
| Computational Speed | Fast ⚡ | Slower (ranking step) |
| Best For | Clean, linear data | Robust validation |
🎯 Our Approach: Calculate Both!
Why Calculate Both?
- • Pearson: Primary measure for linear relationships
- • Spearman: Confirms findings, handles outliers
- • If they agree → strong evidence
- • If they disagree → investigate outliers/non-linearity
Code Example
Two Methods, Better Validation
Pearson for linear relationships, Spearman for robust validation. When both agree, you have strong evidence of correlation!
Introducing SciPy
Scientific Computing for Python
🔬 What is SciPy?
SciPy is a comprehensive library for scientific and technical computing
Built on NumPy, it provides tools for statistics, optimization, signal processing, and more
📦 SciPy Ecosystem - What's Inside?
📊scipy.stats
Statistical functions & tests
Distributions, t-tests, ANOVA, correlations
🔍scipy.optimize
Optimization algorithms
Curve fitting, root finding, minimization
📈scipy.interpolate
Interpolation tools
1D/2D interpolation, splines
∫scipy.integrate
Integration & ODEs
Numerical integration, differential equations
🧮scipy.linalg
Linear algebra
Matrix operations, eigenvalues, decompositions
📡scipy.signal
Signal processing
Filtering, convolution, spectral analysis
🎯 What We'll Use Today: scipy.stats
pearsonr()
Calculates Pearson correlation coefficient and p-value
Returns: correlation coefficient (r) and significance (p-value)
spearmanr()
Calculates Spearman rank correlation and p-value
Returns: rank correlation (rho) and significance (p-value)
🤔 Why Use SciPy for Statistics?
Validated & Trusted
Industry-standard implementations used in research worldwide
Fast & Optimized
Written in C/Fortran for maximum performance
Comprehensive
Includes p-values, confidence intervals, and more
💻 Installing SciPy
SciPy = Scientific Python Powerhouse
For correlation analysis, scipy.stats provides battle-tested functions that return both correlation coefficients and p-values in one call!
🚀 Practice Correlation Analysis with SciPy
Learn how to use scipy.stats to calculate Pearson and Spearman correlations with real data
Open SciPy Stats Notebook in ColabThe Multiple Testing Problem
Why We Need FDR Correction
⚠️ The Problem
We just tested 17,204 genes for correlation with ATR
With that many tests, we'll get hundreds of false positives by pure chance!
🪙 Understanding the Problem: The Coin Flip Analogy
Single Test (p < 0.05)
Flip a coin 20 times. Get 15+ heads? That's unusual (p < 0.05)
Chance of false positive: 5%
✓ Acceptable risk for one test
17,204 Tests!
Flip 17,204 different coins 20 times each. How many give 15+ heads?
Expected false positives: ~860 genes!
17,204 × 0.05 = 860
💥 With p < 0.05, we'd falsely call 860 genes "correlated" just by chance!
📊 Our Real Data Example
Genes Tested
Every gene in our dataset (except ATR)
False Positives Expected
Using p < 0.05 without correction
True Positives
How do we know which are real?
✅ The Solution: False Discovery Rate (FDR)
What is FDR?
False Discovery Rate = Expected proportion of false positives among all discoveries
FDR < 0.05 means:
"At most 5% of our significant genes are false positives"
Much more conservative than uncorrected p-values!
How Does It Work?
Benjamini-Hochberg method adjusts p-values:
- 1. Sort all p-values from smallest to largest
- 2. Adjust each p-value based on its rank
- 3. Account for the total number of tests
- 4. Control the false discovery rate
Don't worry about the math - the function does it for us!
📉 Impact of FDR Correction
❌ Without FDR Correction
Genes with p < 0.05: ~2,500
False positives: ~860 (34%)
Problem: 1 in 3 is fake!
✓ With FDR < 0.05
Genes with FDR < 0.05: ~450
False positives: ~23 (5%)
Confidence: 95% are real!
FDR Correction is Essential for Genome-Wide Studies
Testing thousands of genes means we must correct for multiple testing. FDR gives us confidence that our significant genes are truly biologically meaningful, not just statistical noise!
Applying FDR Correction
Using SciPy
🎉 Good News: SciPy Has FDR Built-In!
No need for another library - scipy.stats includes false_discovery_control()
We're already using SciPy for correlations, so this keeps things simple!
🔧 The Function: false_discovery_control()
📥 Inputs
📤 Output
⚠️ Important:
You must compare adjusted p-values to your alpha (e.g., 0.05) to determine significance
📝 Step-by-Step: Adding FDR to Our Results
✓ What We Get
Four new columns in our DataFrame:
- • pearson_p_adjusted: Adjusted p-values
- • spearman_p_adjusted: Adjusted p-values
- • significant_pearson: True/False (adjusted_p < 0.05)
- • significant_spearman: True/False (adjusted_p < 0.05)
💡 Why This Works
- • No extra library to install
- • Adjusted p-values control false discovery rate
- • Same proven Benjamini-Hochberg method
- • Easy to filter: adjusted_p < 0.05
🔍 Finding Significant Genes
Filter by Pearson
Filter by Both Methods
Example Output
Found 412 genes significant by Spearman
Both methods agree on 387 genes ✓
→ Use the 387 genes where both agree for highest confidence!
📊 Quick Count Summary
💡 Pro Tip:
.sum() on a Boolean array counts the True values! This is a quick way to count how many genes passed the FDR threshold.
FDR Correction in One SciPy Function!
false_discovery_control() returns adjusted p-values. Compare them to your alpha (0.05) to determine significance. No extra libraries needed - everything we need is in scipy.stats! Filter for genes where both Pearson and Spearman agree for the most trustworthy results.
Visualizing Results: Volcano Plots
Show Both Effect Size & Significance
🌋 What is a Volcano Plot?
A scatter plot that shows correlation strength (x-axis) vs statistical significance (y-axis)
Named for its shape - significant genes "erupt" from the top like a volcano!
📊 Anatomy of a Volcano Plot
🎨 Color Key
📍 The Four Quadrants
🤔 Why -log₁₀(p-value)?
The Problem with P-values
P-values are tiny numbers (e.g., 0.0001, 0.00000023)
Hard to visualize and compare on a regular axis
The Solution: -log₁₀
Transform p-values to make differences visible
Higher values = more significant (easier to read!)
💡 Key Point:
The negative sign flips small p-values (good) into large numbers (easy to plot). The log₁₀ spreads out the tiny differences in very small p-values.
🐍 Creating a Volcano Plot in Python
✓ What You Get
A clear visualization showing which genes have strong correlations that are also statistically significant - the best candidates for follow-up research!
Volcano Plots Show the Complete Story
X-axis = effect size (correlation), Y-axis = significance (-log₁₀ p-value). Look for genes in the top corners - they have both strong correlation AND statistical significance!
Visualizing the Top Hit
ATR vs SLU7 Linear Regression
🎯 Our Top Correlation Result
SLU7 shows the strongest correlation with ATR
Let's visualize this relationship with a linear regression plot to understand the trend
📐 What is Linear Regression?
The Linear Model
For Our Data
The regression line shows the best linear fit through our scatter plot of ATR vs SLU7 dependencies.
🤔 Why Show the Regression Line?
Visual Clarity
The line makes the trend immediately obvious - positive or negative correlation
Prediction
We can predict SLU7 dependency from ATR values using the fitted line
Biological Insight
The slope tells us how strongly SLU7 tracks with ATR across cell lines
Next: Creating This Plot with Seaborn
Seaborn makes it incredibly easy to create beautiful regression plots with confidence intervals. Let's see how to generate this visualization in just a few lines of Python!
Creating Regression Plots with Seaborn
Beautiful Statistical Visualizations
🎨 What is Seaborn?
A high-level Python visualization library built on top of matplotlib
Statistical plots made simple
Beautiful default styles
Automatic confidence intervals
🐍 Creating the ATR vs SLU7 Plot
🎯 Key Parameters
✨ Automatic Features
- ✓ Calculates regression line automatically
- ✓ Adds confidence interval shading
- ✓ Handles missing values gracefully
- ✓ Beautiful default styling
📊 The Result: ATR vs SLU7 Correlation

Very strong positive correlation
Clear upward slope visible
Shaded region shows uncertainty
Seaborn Makes Statistical Plots Easy
sns.regplot() automatically fits a regression line, calculates confidence intervals, and creates a beautiful visualization - all in one function call! Perfect for exploring relationships in biological data.
🚀 Practice Seaborn Regression Plots
Explore different regression plot styles and customization options with real biological data
Open Seaborn Notebook in ColabValidating the Discovery: Repeat Analysis
Finding Network Overlap
🔍 What We Found
ATR
DNA damage response kinase
SLU7
mRNA splicing factor
⚠️ Unexpected Correlation!
ATR (DNA repair) and SLU7 (RNA splicing) are in different biological pathways
Why do cells that depend on one also depend on the other?
🤔 The Key Question
Is this correlation spurious (coincidence) or real (biologically meaningful)?
❌ Spurious
Random coincidence with no biological meaning. ATR and SLU7 correlate by chance, not because they work together.
✓ Real
A novel biological interaction! ATR and SLU7 may work in the same pathway or compensate for each other in certain contexts.
🎯 Validation Strategy: Network Overlap Analysis
The Logic:
We found genes that correlate with ATR
Now let's find genes that correlate with SLU7
Check the overlap between the two gene lists
Expected Outcomes:
Low Overlap → Spurious
ATR and SLU7 have different correlation partners → likely random
High Overlap → Real Interaction!
ATR and SLU7 share correlation partners → likely co-regulated
Next: Run the Same Analysis with SLU7
We'll use the exact same correlation pipeline, but this time with SLU7 as our query gene. Then we'll calculate the intersection to see how many genes appear in both lists!
Visualizing Overlap: Venn Diagrams
Finding Set Intersections
⭕ What is a Venn Diagram?
A visual representation of set relationships using overlapping circles
Perfect for showing which genes are unique to ATR, unique to SLU7, or shared by both!
📊 Anatomy of a Venn Diagram
🔵 Three Regions
📐 The Math
Each circle's total = (unique) + (shared)
📊 Is the Overlap Significant?
The Question:
Is 180 overlapping genes more than we'd expect by chance?
Random Overlap
If gene lists were random, we'd expect small overlap just by chance
Our Overlap
180 genes is much larger than expected!
🎲 Hypergeometric Test
Statistical test that calculates: "What's the probability of getting this much overlap by random chance?"
Very significant!
Significant
Not significant
💡 What This Means:
Low p-value (e.g., < 0.01) means the overlap is unlikely to be random - ATR and SLU7 likely share a biological relationship!
Venn Diagrams Show Set Relationships
The intersection (overlap) tells us which genes are shared. The hypergeometric test tells us if that overlap is statistically significant or just random chance.
Python Sets
The Math Behind Venn Diagrams
🔢 What is a Set?
A collection of unique items with no duplicates or order
Mathematical Set
A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7}
From mathematics - theory of sets
Python Set
Python implementation - curly braces {}
✨ Key Properties of Sets
Unique Elements
Each element appears only once
No Order
Sets are unordered collections
Fast Lookups
Checking membership is instant
🔄 Converting Lists to Sets
From a List
Why? Lists can have duplicates and are slow for lookups. Sets remove duplicates and are fast!
From a DataFrame Column
Perfect for our analysis! We convert our significant gene lists to sets.
🔣 Set Operations - The Math Behind Venn Diagrams
⭕ Intersection (AND)
Elements in BOTH sets
⭕ Union (OR)
Elements in EITHER set
⭕ Difference (NOT)
Elements in A but NOT in B
Sets Are Perfect for Gene Overlap Analysis
Python sets implement mathematical set theory, making it easy to find intersections (genes in both lists), unions (all unique genes), and differences (unique to one list). This is exactly what we need for Venn diagram analysis!
🚀 Practice Python Sets
Master set operations with hands-on exercises using gene lists and biological data
Open Python Sets Notebook in ColabCreating Venn Diagrams: matplotlib-venn
Community-Built Open Source Tool
🎨 What is matplotlib-venn?
A small open-source package that extends matplotlib to create Venn diagrams
Community Built
Created by kind developers to help everyone
Free & Open
Available to all researchers worldwide
Simple API
Easy to use, integrates with matplotlib
💚 The Power of Open Source
How matplotlib-venn Came to Be
Someone needed to create Venn diagrams in Python, didn't find a good solution, so they built it and shared it with the world!
One Developer
Had a problem to solve
Built a Solution
Created matplotlib-venn
Shared Freely
Now thousands benefit!
✨ This is the Spirit of Open Source!
Scientists and developers sharing tools helps the entire research community move faster. Today we use matplotlib-venn; tomorrow you might create something others need!
🐍 Creating a Venn Diagram
🎯 Key Points
✨ What You Get
- ✓ Automatic overlap calculation
- ✓ Counts displayed in each region
- ✓ Proportional circle sizes (optional)
- ✓ Customizable colors and styling
📊 Bonus: Getting the Intersection in Python
💡 Pro Tip:
Python's set data structure is perfect for finding overlaps, unions, and differences. It's fast and has intuitive operators like & (intersection), | (union), and - (difference)!
Small Tools, Big Impact
matplotlib-venn is a perfect example of open-source collaboration. A developer created a useful tool and shared it freely, and now researchers worldwide use it to visualize their data. This is how science moves forward together!
Summary & Open Questions
What Have We Discovered?
🔬 Our Analysis: ATR Gene Dependency
We analyzed ATR gene dependency correlations in breast and myeloid cancer cell lines
Genome-wide correlation
Tested all genes vs ATR
Top hit: SLU7
Highly correlated with ATR
Validation analysis
Network overlap check
🧬 Shared Correlation Partners
The intersection of correlations for ATR and SLU7 revealed these genes:
CHEK1
DEFB121
FUBP1
HIGD1A
RPA1
RPA2
U2SURP
These 7 genes correlate with both ATR and SLU7 in our dataset
💡 What Do We Know?
ATR: DNA damage response, replication stress
SLU7: RNA splicing factor
Some shared genes: CHEK1, RPA1, RPA2 are known DNA repair proteins
🤔 Questions to Think About
Is this overlap significant?
We found 7 genes that correlate with both ATR and SLU7. But is this more than we'd expect by chance?
Think about: How could we test if this overlap is statistically significant? What would we compare it to?
Could this point to a new pathway?
ATR is involved in DNA repair, while SLU7 is an RNA splicing factor. Finding them correlated is unexpected!
Think about: Could there be a biological connection between DNA repair and splicing? What would it mean if these pathways interact?
What further analyses could we do?
We've done correlation analysis and checked for overlap. What's the next step?
• Could we test if specific biological pathways are enriched in the shared genes?
• Should we look at protein interactions between ATR and SLU7?
• Would experimental validation help confirm this connection?
• Are there published studies linking DNA repair and splicing?
From Data to Discovery
You've learned how to go from raw data to biological insights:
Load & explore data
Statistical analysis
Visualization
Interpretation
But science doesn't end with code - it ends with questions. The tools we've learned give you the power to ask better questions and design experiments to answer them. Now it's your turn to explore!
📓 Practice Notebooks
Master statistical analysis and visualization with hands-on exercises
Practice SciPy statistics, Seaborn visualization, FDR correction, and complete end-to-end analysis
Next Steps with Python
Your Journey Continues
🎉 You've Completed the Core Python Course!
You now have the fundamental skills to analyze biological data with Python.
But this is just the beginning of your coding journey...
💡 The Best Way to Learn: Apply Your Skills
Reading tutorials won't make you a programmer - solving real problems will.
Start Small
Automate repetitive tasks in your lab work
Build Projects
Analyze your own research data
Keep Challenging
Tackle increasingly complex problems
🧬 Bioinformatics Pathways
🧬Next-Generation Sequencing (NGS) Analysis
RNA-seq, ChIP-seq, ATAC-seq, single-cell sequencing
Biopython
Sequence parsing, alignment, BLAST
pysam
BAM/SAM file manipulation
scanpy
Single-cell RNA-seq analysis
🔬Image Analysis
Microscopy, cell counting, segmentation, feature extraction
scikit-image
Image processing algorithms
CellProfiler
High-throughput cell analysis
napari
Multi-dimensional image viewer
🤖Machine Learning & AI
Predictive modeling, classification, deep learning for biology
scikit-learn
Classical ML algorithms
PyTorch
Deep learning framework
AlphaFold
Protein structure prediction
🛠️ Level Up Your Development Skills
Move Beyond Colab
Terminal Skills
Learn bash commands: cd, ls, grep, find
Version Control
Git Basics
Track changes: git add, commit, push
Essential for any serious project
Why It Matters
Backup your work, collaborate with others, showcase your skills
You're Ready to Code
You've learned the fundamentals. Now the real learning begins through building, breaking, and fixing real projects.
✓ You can load and manipulate data with pandas
✓ You can perform statistical analysis with scipy
✓ You can create visualizations with matplotlib and seaborn
✓ You can read documentation and learn new packages independently
These skills are transferable - whether you pursue academia, industry, or clinical research, Python will amplify your impact.
Keep coding. Keep learning. Keep building. 💚