Data Analysis with Python
for Cancer Research

Master Object-Oriented Programming and Pandas to analyze real cancer genomics data

🏗️Object-Oriented Programming

Classes & Objects

Create reusable code structures to model biological entities

Methods & Attributes

Add behaviors and properties to your biological data

Code Organization

Build clean, maintainable programs for scientific analysis

🐼Pandas Data Analysis

DataFrames & Series

Work with structured datasets from real cancer research

Filtering & Selection

Extract specific cancer types and gene subsets efficiently

Statistical Analysis

Calculate means, sort data, and identify essential genes

🧬Real-World Application

+

CRISPR-Cas9 Data

Analyze cutting-edge gene knockout experiments from cancer research

+

Essential Gene Discovery

Identify genes critical for cancer cell survival

+

Comparative Analysis

Compare breast vs myeloid cancer dependencies

🎯 What You'll Accomplish Today

1.

Build Python Classes

Create Gene and Protein classes with biological methods

2.

Master Pandas Basics

Load, explore, and understand large biological datasets

3.

Analyze Cancer Data

Filter, calculate statistics, and rank essential genes

4.

Discover Biological Insights

Identify common vs cancer-specific gene dependencies

📚 Your Python Journey So Far

✓ Lecture 1: Python Fundamentals
✓ Lecture 2: DNA Analysis & Strings
→ Lecture 3: OOP & Data Analysis
Lecture 4: Statistical Analysis
Lecture 5: Advanced Visualization

Recap: Our Python Journey So Far

📚 Lecture 1: Python Basics

Variables & Data Types

  • int: whole numbers (42, -10)
  • float: decimals (3.14, 2.718)
  • str: text ("ATGCGTA")
  • bool: True/False

Core Operations

  • • Arithmetic: +, -, *, /, %
  • • Comparisons: ==, !=, <, >
  • • String operations: +, *, len()

Biological Application

Calculated GC content, buffer concentrations, and basic sequence analysis

🧬 Lecture 2: DNA Analysis

String Manipulation

  • • Slicing: dna[0:3]
  • • Finding: dna.find("ATG")
  • • Iteration: for base in dna
  • • Range: range(0, len(dna), 3)

Functions & Conditionals

  • def function_name():
  • if/elif/else
  • return values
  • • Defensive programming

Dictionaries

  • • Codon tables: {'ATG': 'M'}
  • • Key-value pairs
  • dict.get(key)
  • key in dict

ORF Finder Project

Built a complete Open Reading Frame finder with start codon detection and translation

🎯 Key Programming Concepts Mastered

🔄

Iteration

Process sequences systematically with for loops

🎯

Decision Making

Handle different cases with if statements

📦

Abstraction

Package code into reusable functions

🚀 Today: Data Analysis with Pandas

We'll apply everything we've learned to analyze real biological datasets, perform linear regression on cancer data, and discover patterns in experimental results!

The Problem: A Cancer Researcher with Big Data 📊

Sarah the cancer researcher

Sarah, Postdoc studying cancer cell dependencies

The Challenge: DepMap Dataset

Sarah is analyzing the Cancer Dependency Map (DepMap) - a massive dataset showing which genes cancer cells need to survive. But the data is overwhelming!

📊 Dataset dimensions:

  • • 1,200+ cancer cell lines (rows)
  • • 30,000+ gene dependencies (columns)
  • • 36 million data points total!
  • • Excel crashes trying to open it

🔬 Her Question:

"Which genes show similar dependencies across differnet cancer cell lines compared to the ATR checkpoint kinase?"

💡 Solution: Pandas DataFrames!

Handle millions of data points effortlessly and find patterns in seconds

Today's Learning Journey 🚀

📦

Part 1: Package Ecosystem

Import & Modules

  • import package, import package as, from package import method
  • • Python's package manager (pip)
  • • Finding the right tools

🎯 Goal: Understand how Python's vast ecosystem helps scientists

🛠️

Part 2: Standard Library

Built-in Power Tools

  • pathlib - File system navigation
  • csv - Data file handling
  • math - Scientific calculations
  • random - Simulations

🎯 Goal: Master Python's built-in tools for scientific computing

🏗️

Part 3: OOP & APIs

Working with Objects

  • • Objects have methods: df.mean()
  • • Objects have attributes: df.shape
  • • Chain operations: df.filter().sort()

🎯 Goal: Understand how to use powerful data objects

🐼

Part 4: Pandas Practical

Real Data Analysis

  • • Load the DepMap dataset
  • • Filter cancer cell lines
  • • Calculate correlations
  • • Find gene dependencies

🎯 Goal: Solve Sarah's research question with real data!

🎯 Today's Superpower

Transform from working with single sequences to analyzing entire datasets - from 100s of data points to millions with the same ease!

Single DNA → Genome Databases | One Cell → 1000s of Cell Lines

Part 1 & 2

Python Package Ecosystem & Standard Library Modules
Working with Open Source Code

Leverage thousands of scientific tools built by the community

📦 The Power of Packages

Transform complex tasks into simple commands:

Without Packages (100+ lines)

# Manual correlation calculation
for gene1 in genes:
for gene2 in genes:
# Complex math...
# More complex math...
# Even more math...

With Pandas (1 line) ✨

# Calculate all correlations at once
correlations = df.corr()

Let's explore how to tap into this incredible ecosystem!

What are ' Python Packages 🧺

"Imagine you land in Japan and got your only shirt dirty on the plane. You have two choices..."

🏠 Do It Yourself

Japanese laundromat

You need to:

  • • Find a laundromat
  • • Learn Japanese instructions
  • • Get coins & detergent
  • • Wait and monitor cycles
  • • Handle every step yourself

📝 Writing Your Own Code

# Manual correlation calculation
sum_x = 0
sum_y = 0
sum_xy = 0
# ... 50+ more lines of math
# ... handle edge cases
# ... normalize results

🏨 Hotel Service (API)

Hotel concierge

You just:

  • • Hand dirty clothes to receptionist
  • • Say "clean please"
  • • Get clean clothes back
  • • Don't worry about HOW
  • • Trust the service to handle it

📦 Using a Package

import pandas as pd
# Let pandas handle everything
df = pd.read_csv('data.csv')
result = df.corr() # Done!

🎯 The Power of Python Objects & APIs

Objects Are Like Hotel Services

• They have methods (services they provide)
• They have attributes (properties you can check)
• You don't need to know HOW they work inside

Example: DataFrame Object

df.mean() # Method: calculates mean
df.shape # Attribute: (rows, cols)
df.plot() # Method: creates a graph

"Good design is obvious. Great design is transparent." - Steve Jobs

Two Types of Python Packages 📦

Python comes with batteries included, but the real power is in the community

🔋

Standard Library

Built into every Python installation

📖 Browse Documentation

These are already included in your Python installation:

  • os - Operating system interface
  • pathlib - File path handling
  • csv - CSV file reading/writing
  • math - Mathematical functions
  • random - Random number generation
  • json - JSON data handling

✅ Advantages:

  • • Always available
  • • No installation needed
  • • Well-tested & stable
  • • Works in any environment
import csv
import math
import random
# Ready to use immediately!
🌐

PyPI Packages

Community-built powerhouses

🔍 Search PyPI

Specialized tools for science:

  • pandas - Data analysis & manipulation
  • numpy - Numerical computing
  • matplotlib - Data visualization
  • biopython - Biological computing
  • scipy - Scientific computing
  • seaborn - Statistical plots

⚡ Supercharged capabilities:

  • • Handle millions of data points
  • • Advanced scientific algorithms
  • • Domain-specific features
  • • Active development
# Needs installation first:
# uv add pandas numpy
import pandas as pd
import numpy as np

💻 Local Development (Advanced)

For building real applications with uv:

# Create new project
uv init my-bio-project
cd my-bio-project
# Add scientific packages
uv add pandas numpy matplotlib
# Run your script
uv run analysis.py

uv handles virtual environments & dependencies automatically!

☁️ Google Colab (Recommended)

Most scientific packages pre-installed:

# Just import and use!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# If you need something special:
!pip install biopython

Perfect for learning - no setup required!

🎯 For this course: We'll use both! Standard library for basics, PyPI packages (especially Pandas) for powerful data analysis.

Standard Library Example: Random 🎲

Perfect for biological simulations - no installation needed!

🎯 What is `random`?

Built-in capabilities:

  • • Generate random numbers
  • • Sample from populations
  • • Shuffle sequences
  • • Pick random choices
  • • Statistical distributions

Perfect for biology:

  • • Random DNA sequences
  • • Monte Carlo simulations
  • • Bootstrap sampling
  • • Mutation modeling
  • • Population genetics
import random
# Always available in Python!
# No pip install needed

🧬 Example 1: Genetic Drift & Sample Randomization

Loading interactive Python...

🧪 Example 2: DNA Point Mutations

Loading interactive Python...

🔧 Most Useful Methods for Biology

Numbers

random.random() # 0.0 to 1.0
random.randint(1,6) # 1 to 6
random.uniform(0,100) # float range

Selections

random.choice(list) # pick one
random.choices(list,k=3) # pick 3 (replacement)
random.sample(list,3) # pick 3 (no replacement)

Shuffling

random.shuffle(list) # shuffle in-place
random.seed(42) # reproducible results

Part 3

Python Classes & Objects
Building Your Own Data Types

Create custom objects to organize complex biological data and functionality

🧬 From Data to Objects

Bundle data with the functions that work on it:

Separate Functions & Data

sequence = "ATCGATCG"
gc = calculate_gc_content(sequence)
rev_comp = reverse_complement(sequence)
# Data and functions are separate

Object-Oriented Approach ✨

dna = DNASequence("ATCGATCG")
gc = dna.gc_content()
rev_comp = dna.reverse_complement()
# Data and methods together!

This is exactly how pandas DataFrames work - let's build our own!

Classes: Storing Data

Classes let you create custom data containers with named attributes

🧬 Creating a DNASequence Class - Try It!

Loading interactive Python...

🏗️ __init__ method

Runs automatically when you create a new object

Think of it as the "setup" function that stores your initial data

🔑 self keyword

Refers to "this particular object"

💡 Confused me at first too - just Python's way of saying "this one"!

📦 Each object stores its own copy of the data - gene1 and gene2 are separate!

Classes: Adding Methods

Classes can also include methods - functions that work with the stored data

⚙️ Adding Functionality to Our DNASequence - Try It!

Loading interactive Python...

🎯 This is Exactly How Pandas Works!

Our DNASequence Class

my_gene.length()
my_gene.gc_content()
my_gene.reverse_complement()

Pandas DataFrame Class

df.head()
df.describe()
df.groupby('gene')

Same pattern: object.method() - the object knows how to work with its own data!

✨ Methods make your code readable: gene.gc_content() vs calculate_gc(gene_sequence)

Part 4

Enter Pandas
The Library That Changed Data Science

From spreadsheets to data science revolution - now with hands-on biological data!

📊 A Brief History: Why Pandas Exists

❌ Before Pandas (2008)

Data analysis meant:

  • • Excel for small datasets
  • • R for statistics
  • • SQL for databases
  • • MATLAB for matrices
  • • Separate tools = messy workflows!

🚀 Pandas Revolution (2008-now)

One library to rule them all:

  • • Read any file format
  • • Clean messy data
  • • Statistical analysis
  • • Visualization
  • • All in Python!

🧬 For Biology Today

Perfect for biological data:

  • • Gene expression matrices
  • • Clinical trial data
  • • DNA sequence analysis
  • • Protein structures
  • • Publication-ready plots

🎯 Remember Our Classes?

# Classes prepare you for this!
df.head() # DataFrame method - just like our DNASequence.length()
df.groupby('gene') # DataFrame method - objects that know their data
df.describe() # DataFrame method - built-in functionality
Wes McKinney - Creator of Pandas

Wes McKinney
Creator of Pandas (2008)

📚 Learn More About Pandas

Official documentation with comprehensive guides and API reference

Pandas Documentation

Our Dataset: DepMap CRISPR

🧬 The Cancer Dependency Map (DepMap)

A massive project by the Broad Institute to find cancer's weaknesses across 1000+ cancer cell lines

🔗 depmap.org

🔬 How CRISPR Works Here

1. Take cancer cell lines

2. Use CRISPR to knock out each gene

3. Measure: Do cells die or survive?

4. Repeat for ~20,000 genes!

📊 Gene Effect Scores

Negative score: Gene is essential

~0 score: Gene not important

Positive: Gene inhibits growth

Scale: -1.0 = typical essential gene

📋 What Our Data Looks Like

# Our dataset: Breast vs Myeloid cancers
# Metadata columns + gene effect scores
model_id cell_line_name oncotree_lineage oncotree_primary_disease A1BG A1CF A2M
ACH-000004 HEL Myeloid Acute Myeloid Leukemia 0.005 -0.069 -0.098
ACH-000017 SK-BR-3 Breast Invasive Breast Carcinoma -0.032 -0.102 -0.013
ACH-000019 MCF7 Breast Invasive Breast Carcinoma 0.036 0.018 0.095
ACH-000028 KPL-1 Breast Invasive Breast Carcinoma -0.188 -0.149 0.077
... ... ... ... ... ... ...
# We'll compare: Are certain genes more essential in breast vs myeloid cancers?
# Perfect for groupby analysis!

🎯 Why This Matters for Drug Discovery

Find Targets

Genes essential in cancer but not normal cells

Personalize Treatment

Different cancers = different vulnerabilities

Save Lives

Turn data into new cancer drugs

🚀 Let's explore this data with pandas and uncover cancer's secrets!

First Steps: Loading & Inspecting Data

📥 Step 1: Load the Data

import pandas as pd
# Load our DepMap CRISPR data
df = pd.read_csv('depmap_breast_myeloid.csv')
print("Data loaded successfully!")
print(f"Shape: {df.shape}") # (rows, columns)

🔍 Step 2: Inspect Your Data

Essential DataFrame Methods

# Get basic info
df.shape # (rows, columns)
df.columns # column names
df.dtypes # data types
df.info() # comprehensive overview
# Peek at the data
df.head() # first 5 rows
df.head(10) # first 10 rows
df.tail() # last 5 rows

Statistical Summary

# Get statistics
df.describe() # numerical summaries
df.nunique() # unique values per column
# Check for missing data
df.isnull().sum() # count NaN values
df.isna().sum() # same as above
# Sample some rows
df.sample(5) # 5 random rows

👀 What You'll Discover

Dataset Size

How many cell lines? How many genes?

Data Quality

Any missing values? Clean data?

Cancer Types

How many breast vs myeloid cancers?

🎯 Remember: DataFrame = Powerful Class!

Our DNASequence class:

my_dna.length()

Pandas DataFrame class:

df.head(), df.describe()

Same pattern - objects that know how to work with their data!

🚀 Ready to Get Hands-On?

Time to explore real cancer data with Pandas!

Open in Colab

Our Research Question: Essential Genes

🧬 What are the top 10 most essential genes in breast and myeloid cancers?

Let's discover which genes are critical for cancer cell survival!

🎯 Why This Question Matters

🎯 Drug Targets

Essential genes = potential therapeutic targets

🧬 Cancer Biology

Understand what keeps cancer cells alive

⚕️ Precision Medicine

Different cancers = different vulnerabilities

🗺️ Our Analysis Roadmap

1

Filter Breast Cancer Data

Extract only breast cancer cell lines from the dataset

2

Calculate Mean Gene Effects

Average gene scores across all breast cancer cell lines

3

Sort & Select Top 10

Find the most negative scores (most essential genes)

4

Repeat for Myeloid Cancer

Same process: filter → mean → sort → top 10

5

Compare & Visualize

Compare top 10 lists - what's different between cancer types?

🚀 Advanced Option:

Later we'll learn df.groupby() to analyze both cancer types at once!

⚙️ Pandas Techniques We'll Learn

🔍 Data Selection:

  • df.loc[condition] - filter rows by condition
  • df.loc[:, 'column'] - select columns by name
  • df.iloc[0:5] - select rows by position

📊 Analysis:

  • .mean() - calculate averages
  • .sort_values() - ranking data
  • .head(10) - top results

🚀 Let's Discover Cancer's Secrets!

Time to dive into Google Colab and analyze real cancer dependency data

Open DepMap Analysis Notebook

Step 1: Filtering DataFrames

🎯 The Power of Boolean Filtering

Select rows that meet specific conditions - like SQL WHERE or Excel filters, but more powerful!

📖 General Filtering Patterns

Single Condition

# Basic pattern: df[condition]
# Equals
df[df['column'] == 'value']
# Greater than
df[df['age'] > 30]
# String contains
df[df['gene'].str.contains('BRCA')]

Multiple Conditions

# Use & (and), | (or), ~ (not)
# Note: Need parentheses!
# AND condition
df[(df['age'] > 30) & (df['sex'] == 'F')]
# OR condition
df[(df['type'] == 'A') | (df['type'] == 'B')]
# NOT condition
df[~df['column'].isnull()]

🧬 Filtering Our Cancer Data

# Filter for breast cancer cell lines
breast_df = df[df['oncotree_lineage'] == 'Breast']
print(f"Found {len(breast_df)} breast cancer cell lines")
# Filter for myeloid cancer cell lines
myeloid_df = df[df['oncotree_lineage'] == 'Myeloid']
print(f"Found {len(myeloid_df)} myeloid cancer cell lines")
# Advanced: Get both types in one DataFrame
both_types = df[(df['oncotree_lineage'] == 'Breast') |
(df['oncotree_lineage'] == 'Myeloid')]
# Alternative using .isin()
cancer_types = ['Breast', 'Myeloid']
both_types = df[df['oncotree_lineage'].isin(cancer_types)]

🔧 Useful Filtering Methods

String Methods

  • .str.contains('text')
  • .str.startswith('A')
  • .str.upper()

Null Handling

  • .isnull()
  • .notna()
  • .dropna()

Value Checks

  • .isin(['A', 'B'])
  • .between(0, 100)
  • .duplicated()

⚠️ Common Gotcha

❌ Wrong (Python and/or)

df[df['x'] > 5 and df['y'] < 10]

✅ Correct (Pandas &/|)

df[(df['x'] > 5) & (df['y'] < 10)]

Always use & | ~ with parentheses!

🚀 Practice Time!

Head to our Google Colab notebook to practice filtering with real cancer data

Practice Pandas Filtering

Step 2: Calculating Statistics

📊 From Individual Scores to Summary Statistics

Transform hundreds of cell line measurements into meaningful averages for each gene

🔢 Essential Statistical Methods

Column-wise Statistics

# Calculate mean for all numeric columns
df.mean()
# Mean for specific columns
df[['A1BG', 'A1CF', 'A2M']].mean()
# Other useful statistics
df.median() # Middle value
df.std() # Standard deviation
df.var() # Variance
df.min() # Minimum values
df.max() # Maximum values

Selecting Gene Columns

# Get all columns except metadata
metadata_cols = ['model_id', 'cell_line_name',
'oncotree_lineage', 'oncotree_primary_disease']
gene_columns = [col for col in df.columns
if col not in metadata_cols]
# Or use column slicing
gene_columns = df.columns[4:] # Skip first 4 metadata columns
# Calculate means for just gene columns
gene_means = df[gene_columns].mean()

🧬 Our Gene Analysis Workflow

# Step 1: Filter for breast cancer (we already learned this!)
breast_df = df.loc[df['oncotree_lineage'] == 'Breast']
print(f"Found {len(breast_df)} breast cancer cell lines")
# Step 2: Select only gene effect columns
metadata_cols = ['model_id', 'cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease']
gene_columns = [col for col in df.columns if col not in metadata_cols]
print(f"Analyzing {len(gene_columns)} genes")
# Step 3: Calculate mean gene effect for each gene across all breast cancer lines
breast_gene_means = breast_df[gene_columns].mean()
print("Sample of gene means:")
print(breast_gene_means.head())
# Example output:
# A1BG -0.021
# A1CF -0.088
# A2M -0.013
# A2ML1 0.041
# A4GALT -0.003

💡 Understanding Your Results

🎯 Negative Means

Example: A1CF = -0.088

Gene is essential - knocking it out reduces cell survival on average

⚪ Near Zero

Example: A4GALT = -0.003

Gene has little effect - not essential for survival

📈 Positive Means

Example: A2ML1 = 0.041

Gene may inhibit growth - knocking out helps cells grow

🔄 What's Next?

Now we have mean gene effect scores for breast cancer

Next: Sort these means to find the top 10 most essential genes!

🧮 Practice Statistics!

Master pandas statistics with hands-on calculations using real data

Practice Statistical Analysis with Pandas

Step 3: Sorting Data

📊 Finding the Most Important Data

Sort your data to discover patterns, rankings, and extremes - perfect for finding essential genes!

🔧 Sorting with .sort_values()

Basic Sorting

# Sort by one column (ascending by default)
df.sort_values('gene_effect')
# Sort descending (highest to lowest)
df.sort_values('gene_effect', ascending=False)
# Sort and get top 10
df.sort_values('gene_effect').head(10)
# Sort and get bottom 10 (most negative)
df.sort_values('gene_effect').tail(10)

Advanced Sorting

# Sort by multiple columns
df.sort_values(['lineage', 'gene_effect'])
# Mixed sort directions
df.sort_values(['lineage', 'gene_effect'],
ascending=[True, False])
# Sort Series (for our gene means)
gene_means.sort_values(ascending=False)
# Reset index after sorting
df.sort_values('gene_effect').reset_index(drop=True)

🧬 Finding Essential Genes in Our Data

# After filtering and calculating means for breast cancer
breast_df = df.loc[df['oncotree_lineage'] == 'Breast']
gene_columns = ['A1BG', 'A1CF', 'A2M', ...] # All gene columns
breast_means = breast_df[gene_columns].mean()
# Sort to find most essential genes (most negative scores)
most_essential = breast_means.sort_values(ascending=True)
print("Top 10 most essential genes in breast cancer:")
print(most_essential.head(10))
# Or find genes that inhibit growth (most positive)
growth_inhibitors = breast_means.sort_values(ascending=False)
print("Top 10 growth inhibitor genes:")
print(growth_inhibitors.head(10))

💡 Key Sorting Concepts

📈 Ascending vs Descending

  • ascending=True: 1, 2, 3...
  • ascending=False: 3, 2, 1...
  • • Essential genes = most negative!

🎯 Getting Top Results

  • .head(n): first n rows
  • .tail(n): last n rows
  • .nlargest(n): top n values

⚙️ Sorting Tips

  • • Sort doesn't modify original data
  • • Use inplace=True to modify
  • • Can sort any column type

🎯 Our Goal: Top 10 Essential Genes

Remember: In CRISPR data, negative scores = essential genes

Most essential = most negative = sort ascending + head(10)

🚀 Practice Sorting!

Master data sorting with hands-on practice using real cancer data

Practice Pandas Sorting

Analysis Summary & Results

🔬 Our Analysis Workflow

📋

Step 1: Filter Data

Selected breast & myeloid cancer cell lines from CRISPR dataset

📊

Step 2: Calculate Statistics

Computed mean gene effect scores for each cancer type

🏆

Step 3: Sort & Rank

Identified top 10 most essential genes per cancer type

🔬 Try the Full Analysis

Run all the steps yourself with the complete DepMap notebook

Open in Colab

🎗️ Breast Cancer - Top Essential Genes

1. RAN-4.1840
2. HSPE1-3.4315
3. SNRPF-3.1414
4. SMU1-3.0940
5. PSMA6-3.0273
6. SNRPA1-2.9927
7. RRM1-2.9468
8. PCNA-2.9238
9. PLK1-2.9126
10. SF3B5-2.9118

🩸 Myeloid Cancer - Top Essential Genes

1. RAN-3.9426
2. HSPE1-3.5099
3. RPL17-3.2110
4. RPS8-2.9228
5. RPS29-2.8929
6. RRM1-2.8591
7. PLK1-2.8287
8. RPS19-2.7984
9. UBL5-2.7908
10. PSMA6-2.7539

🔍 Key Observations

🤝 Common Essential Genes

  • RAN - Nuclear transport, cell division
  • HSPE1 - Protein folding chaperone
  • RRM1 - DNA replication
  • PLK1 - Cell cycle regulation
  • PSMA6 - Protein degradation

🎯 Cancer-Specific Patterns

Breast Cancer:

RNA splicing genes (SNRPF, SNRPA1, SF3B5)

Myeloid Cancer:

Ribosomal proteins (RPL17, RPS8, RPS29, RPS19)

💡 Different cancer types depend on distinct cellular pathways!

🤔 Research Questions & Next Steps

❓ Questions to Explore

  • • Why are 5 genes common across cancer types?
  • • What pathways are breast-specific vs myeloid-specific?
  • • Do these genes interact in regulatory networks?
  • • Are there druggable targets among these genes?
  • • How do these relate to clinical outcomes?

🚀 Advanced Analysis Techniques

  • GSEA - Gene Set Enrichment Analysis
  • Network Analysis - Protein-protein interactions
  • Correlation Analysis - Gene expression patterns
  • Visualization - Heatmaps, networks, volcano plots
  • Machine Learning - Predictive models

🧠 Biological Insights

Common essential genes represent fundamental cellular processes required by all cancer cells

Cancer-specific genes reveal unique vulnerabilities that could be targeted therapeutically

🎯 This analysis provides a roadmap for precision cancer therapy development!

📈 Coming Next

Build on today's analysis with advanced statistical methods

📊

Linear Regression Analysis

📈 Statistical Modeling

  • • Gene expression correlations
  • • Predictive modeling
  • • R² and significance testing

🎨 Data Visualization

  • • Scatter plots & regression lines
  • • Heatmaps & clustering
  • • Interactive plots with Plotly
Start Lecture 4: Statistical Analysis & Visualization

What We Learned Today

🎯 Today's Key Achievements

🏗️Object-Oriented Programming

  • • Understanding classes as data containers
  • • Creating methods to process data
  • • Building reusable, organized code structures

🐼Pandas Data Analysis

  • • Loading and exploring DataFrames
  • • Filtering data with boolean indexing
  • • Computing statistics and sorting results

🏗️ Object-Oriented Programming Mastery

Classes as Data Containers

class Gene:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
def get_length(self):
return len(self.sequence)
def get_gc_content(self):
gc_count = self.sequence.count('G') + self.sequence.count('C')
return gc_count / len(self.sequence) * 100
# Create and use objects
my_gene = Gene("BRCA1", "ATCGATCG")
print(f"Length: {my_gene.get_length()}")
print(f"GC%: {my_gene.get_gc_content():.1f}%")

Key OOP Benefits

📦 Organization

Group related data and functions together

🔄 Reusability

Create multiple instances with same behavior

🎯 Clarity

Self-documenting, intuitive code structure

🐼 Pandas Data Analysis Pipeline

1. Load & Explore

# Load data
df = pd.read_csv('data.csv')
# Explore structure
print(df.head())
print(df.shape)
print(df.columns)
print(df.info())

2. Filter & Select

# Boolean filtering
breast_df = df.loc[
df['oncotree_lineage'] == 'Breast'
]
# Column selection
gene_columns = df.columns[4:]
gene_data = breast_df[gene_columns]

3. Analyze & Sort

# Calculate statistics
means = gene_data.mean()
# Sort results
top_genes = means.sort_values(
ascending=True
).head(10)
print(top_genes)

💡 Key Takeaways

🏗️ OOP Mindset

  • • Think in terms of objects with properties and behaviors
  • • Use classes to model real-world entities (genes, proteins, cells)
  • • Methods make your objects smart and interactive

📊 Data Analysis Workflow

  • • Always explore your data first
  • • Use boolean indexing for precise filtering
  • • Combine statistics + sorting to find patterns

🧬Real-World Impact: From Code to Cancer Research

Today we identified essential genes in breast and myeloid cancers using the same techniques used in

real cancer research labs worldwide. Your code can now analyze datasets with millions of data points!

🎯 You're now equipped to tackle real biological big data challenges!

🎉

Congratulations!

You've mastered Object-Oriented Programming and Pandas data analysis

Practice Your Skills! 🧪

Further Resources

Continue your Pandas journey with these excellent resources

🎓

Interactive Course

Data Manipulation with Pandas

DataCamp's hands-on course with interactive exercises and real-world datasets. Perfect for learning by doing!

Interactive coding exercises
Video tutorials & immediate feedback
Certificate upon completion
Start DataCamp Course
📚

Practice Book

Pandas Workout

Pandas Workout Book Cover

By Reuven M. Lerner. 200+ exercises to build your Pandas skills through practice.

Real-world data scenarios
Progressive difficulty levels
Detailed solutions included
View on Manning

🌟 Other Excellent Resources

📖 Official Docs

Comprehensive reference and tutorials

pandas.pydata.org

🎥 YouTube Tutorials

Video series by Corey Schafer

Watch playlist

📊 Kaggle Learn

Free micro-courses with real datasets

kaggle.com/learn

🎯 Keep Practicing!

The key to mastering Pandas is regular practice with real biological datasets. Start with our course notebooks and gradually work your way through these resources.