Data Analysis with Python
for Cancer Research
Master Object-Oriented Programming and Pandas to analyze real cancer genomics data
🏗️Object-Oriented Programming
Classes & Objects
Create reusable code structures to model biological entities
Methods & Attributes
Add behaviors and properties to your biological data
Code Organization
Build clean, maintainable programs for scientific analysis
🐼Pandas Data Analysis
DataFrames & Series
Work with structured datasets from real cancer research
Filtering & Selection
Extract specific cancer types and gene subsets efficiently
Statistical Analysis
Calculate means, sort data, and identify essential genes
🧬Real-World Application
CRISPR-Cas9 Data
Analyze cutting-edge gene knockout experiments from cancer research
Essential Gene Discovery
Identify genes critical for cancer cell survival
Comparative Analysis
Compare breast vs myeloid cancer dependencies
🎯 What You'll Accomplish Today
Build Python Classes
Create Gene and Protein classes with biological methods
Master Pandas Basics
Load, explore, and understand large biological datasets
Analyze Cancer Data
Filter, calculate statistics, and rank essential genes
Discover Biological Insights
Identify common vs cancer-specific gene dependencies
📚 Your Python Journey So Far
Recap: Our Python Journey So Far
📚 Lecture 1: Python Basics
Variables & Data Types
- •
int: whole numbers (42, -10) - •
float: decimals (3.14, 2.718) - •
str: text ("ATGCGTA") - •
bool: True/False
Core Operations
- • Arithmetic:
+,-,*,/,% - • Comparisons:
==,!=,<,> - • String operations:
+,*,len()
Biological Application
Calculated GC content, buffer concentrations, and basic sequence analysis
🧬 Lecture 2: DNA Analysis
String Manipulation
- • Slicing:
dna[0:3] - • Finding:
dna.find("ATG") - • Iteration:
for base in dna - • Range:
range(0, len(dna), 3)
Functions & Conditionals
- •
def function_name(): - •
if/elif/else - •
returnvalues - • Defensive programming
Dictionaries
- • Codon tables:
{'ATG': 'M'} - • Key-value pairs
- •
dict.get(key) - •
key in dict
ORF Finder Project
Built a complete Open Reading Frame finder with start codon detection and translation
🎯 Key Programming Concepts Mastered
Iteration
Process sequences systematically with for loops
Decision Making
Handle different cases with if statements
Abstraction
Package code into reusable functions
🚀 Today: Data Analysis with Pandas
We'll apply everything we've learned to analyze real biological datasets, perform linear regression on cancer data, and discover patterns in experimental results!
The Problem: A Cancer Researcher with Big Data 📊

Sarah, Postdoc studying cancer cell dependencies
The Challenge: DepMap Dataset
Sarah is analyzing the Cancer Dependency Map (DepMap) - a massive dataset showing which genes cancer cells need to survive. But the data is overwhelming!
📊 Dataset dimensions:
- • 1,200+ cancer cell lines (rows)
- • 30,000+ gene dependencies (columns)
- • 36 million data points total!
- • Excel crashes trying to open it
🔬 Her Question:
"Which genes show similar dependencies across differnet cancer cell lines compared to the ATR checkpoint kinase?"
💡 Solution: Pandas DataFrames!
Handle millions of data points effortlessly and find patterns in seconds
Today's Learning Journey 🚀
Part 1: Package Ecosystem
Import & Modules
- •
import package, import package as, from package import method - • Python's package manager (pip)
- • Finding the right tools
🎯 Goal: Understand how Python's vast ecosystem helps scientists
Part 2: Standard Library
Built-in Power Tools
- •
pathlib- File system navigation - •
csv- Data file handling - •
math- Scientific calculations - •
random- Simulations
🎯 Goal: Master Python's built-in tools for scientific computing
Part 3: OOP & APIs
Working with Objects
- • Objects have methods:
df.mean() - • Objects have attributes:
df.shape - • Chain operations:
df.filter().sort()
🎯 Goal: Understand how to use powerful data objects
Part 4: Pandas Practical
Real Data Analysis
- • Load the DepMap dataset
- • Filter cancer cell lines
- • Calculate correlations
- • Find gene dependencies
🎯 Goal: Solve Sarah's research question with real data!
🎯 Today's Superpower
Transform from working with single sequences to analyzing entire datasets - from 100s of data points to millions with the same ease!
Part 1 & 2
Python Package Ecosystem & Standard Library Modules
Working with Open Source Code
Leverage thousands of scientific tools built by the community
📦 The Power of Packages
Transform complex tasks into simple commands:
Without Packages (100+ lines)
# Manual correlation calculationfor gene1 in genes: for gene2 in genes: # Complex math... # More complex math... # Even more math...With Pandas (1 line) ✨
# Calculate all correlations at oncecorrelations = df.corr()Let's explore how to tap into this incredible ecosystem!
What are ' Python Packages 🧺
"Imagine you land in Japan and got your only shirt dirty on the plane. You have two choices..."
🏠 Do It Yourself

You need to:
- • Find a laundromat
- • Learn Japanese instructions
- • Get coins & detergent
- • Wait and monitor cycles
- • Handle every step yourself
📝 Writing Your Own Code
# Manual correlation calculationsum_x = 0sum_y = 0sum_xy = 0# ... 50+ more lines of math# ... handle edge cases# ... normalize results🏨 Hotel Service (API)

You just:
- • Hand dirty clothes to receptionist
- • Say "clean please"
- • Get clean clothes back
- • Don't worry about HOW
- • Trust the service to handle it
📦 Using a Package
import pandas as pd
# Let pandas handle everythingdf = pd.read_csv('data.csv')result = df.corr() # Done!🎯 The Power of Python Objects & APIs
Objects Are Like Hotel Services
• They have methods (services they provide)
• They have attributes (properties you can check)
• You don't need to know HOW they work inside
Example: DataFrame Object
df.mean() # Method: calculates meandf.shape # Attribute: (rows, cols)df.plot() # Method: creates a graph"Good design is obvious. Great design is transparent." - Steve Jobs
Two Types of Python Packages 📦
Python comes with batteries included, but the real power is in the community
These are already included in your Python installation:
- •
os- Operating system interface - •
pathlib- File path handling - •
csv- CSV file reading/writing - •
math- Mathematical functions - •
random- Random number generation - •
json- JSON data handling
✅ Advantages:
- • Always available
- • No installation needed
- • Well-tested & stable
- • Works in any environment
import csvimport mathimport random
# Ready to use immediately!Specialized tools for science:
- •
pandas- Data analysis & manipulation - •
numpy- Numerical computing - •
matplotlib- Data visualization - •
biopython- Biological computing - •
scipy- Scientific computing - •
seaborn- Statistical plots
⚡ Supercharged capabilities:
- • Handle millions of data points
- • Advanced scientific algorithms
- • Domain-specific features
- • Active development
# Needs installation first:# uv add pandas numpy
import pandas as pdimport numpy as np💻 Local Development (Advanced)
For building real applications with uv:
# Create new projectuv init my-bio-projectcd my-bio-project
# Add scientific packagesuv add pandas numpy matplotlib
# Run your scriptuv run analysis.pyuv handles virtual environments & dependencies automatically!
☁️ Google Colab (Recommended)
Most scientific packages pre-installed:
# Just import and use!import pandas as pdimport numpy as npimport matplotlib.pyplot as plt
# If you need something special:!pip install biopythonPerfect for learning - no setup required!
🎯 For this course: We'll use both! Standard library for basics, PyPI packages (especially Pandas) for powerful data analysis.
Standard Library Example: Random 🎲
Perfect for biological simulations - no installation needed!
🎯 What is `random`?
Built-in capabilities:
- • Generate random numbers
- • Sample from populations
- • Shuffle sequences
- • Pick random choices
- • Statistical distributions
Perfect for biology:
- • Random DNA sequences
- • Monte Carlo simulations
- • Bootstrap sampling
- • Mutation modeling
- • Population genetics
import random
# Always available in Python!# No pip install needed🧬 Example 1: Genetic Drift & Sample Randomization
🧪 Example 2: DNA Point Mutations
🔧 Most Useful Methods for Biology
Numbers
random.random() # 0.0 to 1.0random.randint(1,6) # 1 to 6random.uniform(0,100) # float rangeSelections
random.choice(list) # pick onerandom.choices(list,k=3) # pick 3 (replacement)random.sample(list,3) # pick 3 (no replacement)Shuffling
random.shuffle(list) # shuffle in-placerandom.seed(42) # reproducible resultsPart 3
Python Classes & Objects
Building Your Own Data Types
Create custom objects to organize complex biological data and functionality
🧬 From Data to Objects
Bundle data with the functions that work on it:
Separate Functions & Data
sequence = "ATCGATCG"gc = calculate_gc_content(sequence)rev_comp = reverse_complement(sequence)# Data and functions are separateObject-Oriented Approach ✨
dna = DNASequence("ATCGATCG")gc = dna.gc_content()rev_comp = dna.reverse_complement()# Data and methods together!This is exactly how pandas DataFrames work - let's build our own!
Classes: Storing Data
Classes let you create custom data containers with named attributes
🧬 Creating a DNASequence Class - Try It!
🏗️ __init__ method
Runs automatically when you create a new object
Think of it as the "setup" function that stores your initial data
🔑 self keyword
Refers to "this particular object"
💡 Confused me at first too - just Python's way of saying "this one"!
📦 Each object stores its own copy of the data - gene1 and gene2 are separate!
Classes: Adding Methods
Classes can also include methods - functions that work with the stored data
⚙️ Adding Functionality to Our DNASequence - Try It!
🎯 This is Exactly How Pandas Works!
Our DNASequence Class
my_gene.length()my_gene.gc_content()my_gene.reverse_complement()Pandas DataFrame Class
df.head()df.describe()df.groupby('gene')Same pattern: object.method() - the object knows how to work with its own data!
✨ Methods make your code readable: gene.gc_content() vs calculate_gc(gene_sequence)
Part 4
Enter Pandas
The Library That Changed Data Science
From spreadsheets to data science revolution - now with hands-on biological data!
📊 A Brief History: Why Pandas Exists
❌ Before Pandas (2008)
Data analysis meant:
- • Excel for small datasets
- • R for statistics
- • SQL for databases
- • MATLAB for matrices
- • Separate tools = messy workflows!
🚀 Pandas Revolution (2008-now)
One library to rule them all:
- • Read any file format
- • Clean messy data
- • Statistical analysis
- • Visualization
- • All in Python!
🧬 For Biology Today
Perfect for biological data:
- • Gene expression matrices
- • Clinical trial data
- • DNA sequence analysis
- • Protein structures
- • Publication-ready plots
🎯 Remember Our Classes?
# Classes prepare you for this!df.head() # DataFrame method - just like our DNASequence.length()df.groupby('gene') # DataFrame method - objects that know their datadf.describe() # DataFrame method - built-in functionality
Wes McKinney
Creator of Pandas (2008)
📚 Learn More About Pandas
Official documentation with comprehensive guides and API reference
Pandas DocumentationOur Dataset: DepMap CRISPR
🧬 The Cancer Dependency Map (DepMap)
A massive project by the Broad Institute to find cancer's weaknesses across 1000+ cancer cell lines
🔬 How CRISPR Works Here
1. Take cancer cell lines
2. Use CRISPR to knock out each gene
3. Measure: Do cells die or survive?
4. Repeat for ~20,000 genes!
📊 Gene Effect Scores
Negative score: Gene is essential
~0 score: Gene not important
Positive: Gene inhibits growth
Scale: -1.0 = typical essential gene
📋 What Our Data Looks Like
# Our dataset: Breast vs Myeloid cancers# Metadata columns + gene effect scores
model_id cell_line_name oncotree_lineage oncotree_primary_disease A1BG A1CF A2MACH-000004 HEL Myeloid Acute Myeloid Leukemia 0.005 -0.069 -0.098ACH-000017 SK-BR-3 Breast Invasive Breast Carcinoma -0.032 -0.102 -0.013 ACH-000019 MCF7 Breast Invasive Breast Carcinoma 0.036 0.018 0.095ACH-000028 KPL-1 Breast Invasive Breast Carcinoma -0.188 -0.149 0.077... ... ... ... ... ... ...
# We'll compare: Are certain genes more essential in breast vs myeloid cancers?# Perfect for groupby analysis!🎯 Why This Matters for Drug Discovery
Find Targets
Genes essential in cancer but not normal cells
Personalize Treatment
Different cancers = different vulnerabilities
Save Lives
Turn data into new cancer drugs
🚀 Let's explore this data with pandas and uncover cancer's secrets!
First Steps: Loading & Inspecting Data
📥 Step 1: Load the Data
import pandas as pd
# Load our DepMap CRISPR datadf = pd.read_csv('depmap_breast_myeloid.csv')
print("Data loaded successfully!")print(f"Shape: {df.shape}") # (rows, columns)🔍 Step 2: Inspect Your Data
Essential DataFrame Methods
# Get basic infodf.shape # (rows, columns)df.columns # column namesdf.dtypes # data typesdf.info() # comprehensive overview
# Peek at the data df.head() # first 5 rowsdf.head(10) # first 10 rowsdf.tail() # last 5 rowsStatistical Summary
# Get statisticsdf.describe() # numerical summariesdf.nunique() # unique values per column
# Check for missing datadf.isnull().sum() # count NaN valuesdf.isna().sum() # same as above
# Sample some rowsdf.sample(5) # 5 random rows👀 What You'll Discover
Dataset Size
How many cell lines? How many genes?
Data Quality
Any missing values? Clean data?
Cancer Types
How many breast vs myeloid cancers?
🎯 Remember: DataFrame = Powerful Class!
Our DNASequence class:
my_dna.length()Pandas DataFrame class:
df.head(), df.describe()Same pattern - objects that know how to work with their data!
🚀 Ready to Get Hands-On?
Time to explore real cancer data with Pandas!
Our Research Question: Essential Genes
🧬 What are the top 10 most essential genes in breast and myeloid cancers?
Let's discover which genes are critical for cancer cell survival!
🎯 Why This Question Matters
🎯 Drug Targets
Essential genes = potential therapeutic targets
🧬 Cancer Biology
Understand what keeps cancer cells alive
⚕️ Precision Medicine
Different cancers = different vulnerabilities
🗺️ Our Analysis Roadmap
Filter Breast Cancer Data
Extract only breast cancer cell lines from the dataset
Calculate Mean Gene Effects
Average gene scores across all breast cancer cell lines
Sort & Select Top 10
Find the most negative scores (most essential genes)
Repeat for Myeloid Cancer
Same process: filter → mean → sort → top 10
Compare & Visualize
Compare top 10 lists - what's different between cancer types?
🚀 Advanced Option:
Later we'll learn df.groupby() to analyze both cancer types at once!
⚙️ Pandas Techniques We'll Learn
🔍 Data Selection:
- •
df.loc[condition]- filter rows by condition - •
df.loc[:, 'column']- select columns by name - •
df.iloc[0:5]- select rows by position
📊 Analysis:
- •
.mean()- calculate averages - •
.sort_values()- ranking data - •
.head(10)- top results
🚀 Let's Discover Cancer's Secrets!
Time to dive into Google Colab and analyze real cancer dependency data
Open DepMap Analysis NotebookStep 1: Filtering DataFrames
🎯 The Power of Boolean Filtering
Select rows that meet specific conditions - like SQL WHERE or Excel filters, but more powerful!
📖 General Filtering Patterns
Single Condition
# Basic pattern: df[condition]
# Equalsdf[df['column'] == 'value']
# Greater thandf[df['age'] > 30]
# String containsdf[df['gene'].str.contains('BRCA')]Multiple Conditions
# Use & (and), | (or), ~ (not)# Note: Need parentheses!
# AND conditiondf[(df['age'] > 30) & (df['sex'] == 'F')]
# OR conditiondf[(df['type'] == 'A') | (df['type'] == 'B')]
# NOT conditiondf[~df['column'].isnull()]🧬 Filtering Our Cancer Data
# Filter for breast cancer cell linesbreast_df = df[df['oncotree_lineage'] == 'Breast']print(f"Found {len(breast_df)} breast cancer cell lines")
# Filter for myeloid cancer cell linesmyeloid_df = df[df['oncotree_lineage'] == 'Myeloid']print(f"Found {len(myeloid_df)} myeloid cancer cell lines")
# Advanced: Get both types in one DataFrameboth_types = df[(df['oncotree_lineage'] == 'Breast') | (df['oncotree_lineage'] == 'Myeloid')]
# Alternative using .isin()cancer_types = ['Breast', 'Myeloid']both_types = df[df['oncotree_lineage'].isin(cancer_types)]🔧 Useful Filtering Methods
String Methods
- •
.str.contains('text') - •
.str.startswith('A') - •
.str.upper()
Null Handling
- •
.isnull() - •
.notna() - •
.dropna()
Value Checks
- •
.isin(['A', 'B']) - •
.between(0, 100) - •
.duplicated()
⚠️ Common Gotcha
❌ Wrong (Python and/or)
df[df['x'] > 5 and df['y'] < 10]✅ Correct (Pandas &/|)
df[(df['x'] > 5) & (df['y'] < 10)]Always use & | ~ with parentheses!
🚀 Practice Time!
Head to our Google Colab notebook to practice filtering with real cancer data
Practice Pandas FilteringStep 2: Calculating Statistics
📊 From Individual Scores to Summary Statistics
Transform hundreds of cell line measurements into meaningful averages for each gene
🔢 Essential Statistical Methods
Column-wise Statistics
# Calculate mean for all numeric columnsdf.mean()
# Mean for specific columnsdf[['A1BG', 'A1CF', 'A2M']].mean()
# Other useful statisticsdf.median() # Middle valuedf.std() # Standard deviationdf.var() # Variancedf.min() # Minimum valuesdf.max() # Maximum valuesSelecting Gene Columns
# Get all columns except metadatametadata_cols = ['model_id', 'cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease']gene_columns = [col for col in df.columns if col not in metadata_cols]
# Or use column slicinggene_columns = df.columns[4:] # Skip first 4 metadata columns
# Calculate means for just gene columnsgene_means = df[gene_columns].mean()🧬 Our Gene Analysis Workflow
# Step 1: Filter for breast cancer (we already learned this!)breast_df = df.loc[df['oncotree_lineage'] == 'Breast']print(f"Found {len(breast_df)} breast cancer cell lines")
# Step 2: Select only gene effect columnsmetadata_cols = ['model_id', 'cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease']gene_columns = [col for col in df.columns if col not in metadata_cols]print(f"Analyzing {len(gene_columns)} genes")
# Step 3: Calculate mean gene effect for each gene across all breast cancer linesbreast_gene_means = breast_df[gene_columns].mean()print("Sample of gene means:")print(breast_gene_means.head())
# Example output:# A1BG -0.021# A1CF -0.088# A2M -0.013# A2ML1 0.041# A4GALT -0.003💡 Understanding Your Results
🎯 Negative Means
Example: A1CF = -0.088
Gene is essential - knocking it out reduces cell survival on average
⚪ Near Zero
Example: A4GALT = -0.003
Gene has little effect - not essential for survival
📈 Positive Means
Example: A2ML1 = 0.041
Gene may inhibit growth - knocking out helps cells grow
🔄 What's Next?
Now we have mean gene effect scores for breast cancer
Next: Sort these means to find the top 10 most essential genes!
🧮 Practice Statistics!
Master pandas statistics with hands-on calculations using real data
Practice Statistical Analysis with PandasStep 3: Sorting Data
📊 Finding the Most Important Data
Sort your data to discover patterns, rankings, and extremes - perfect for finding essential genes!
🔧 Sorting with .sort_values()
Basic Sorting
# Sort by one column (ascending by default)df.sort_values('gene_effect')
# Sort descending (highest to lowest)df.sort_values('gene_effect', ascending=False)
# Sort and get top 10df.sort_values('gene_effect').head(10)
# Sort and get bottom 10 (most negative)df.sort_values('gene_effect').tail(10)Advanced Sorting
# Sort by multiple columnsdf.sort_values(['lineage', 'gene_effect'])
# Mixed sort directionsdf.sort_values(['lineage', 'gene_effect'], ascending=[True, False])
# Sort Series (for our gene means)gene_means.sort_values(ascending=False)
# Reset index after sortingdf.sort_values('gene_effect').reset_index(drop=True)🧬 Finding Essential Genes in Our Data
# After filtering and calculating means for breast cancerbreast_df = df.loc[df['oncotree_lineage'] == 'Breast']gene_columns = ['A1BG', 'A1CF', 'A2M', ...] # All gene columnsbreast_means = breast_df[gene_columns].mean()
# Sort to find most essential genes (most negative scores)most_essential = breast_means.sort_values(ascending=True)print("Top 10 most essential genes in breast cancer:")print(most_essential.head(10))
# Or find genes that inhibit growth (most positive)growth_inhibitors = breast_means.sort_values(ascending=False)print("Top 10 growth inhibitor genes:")print(growth_inhibitors.head(10))💡 Key Sorting Concepts
📈 Ascending vs Descending
- •
ascending=True: 1, 2, 3... - •
ascending=False: 3, 2, 1... - • Essential genes = most negative!
🎯 Getting Top Results
- •
.head(n): first n rows - •
.tail(n): last n rows - •
.nlargest(n): top n values
⚙️ Sorting Tips
- • Sort doesn't modify original data
- • Use
inplace=Trueto modify - • Can sort any column type
🎯 Our Goal: Top 10 Essential Genes
Remember: In CRISPR data, negative scores = essential genes
Most essential = most negative = sort ascending + head(10)
🚀 Practice Sorting!
Master data sorting with hands-on practice using real cancer data
Practice Pandas SortingAnalysis Summary & Results
🔬 Our Analysis Workflow
Step 1: Filter Data
Selected breast & myeloid cancer cell lines from CRISPR dataset
Step 2: Calculate Statistics
Computed mean gene effect scores for each cancer type
Step 3: Sort & Rank
Identified top 10 most essential genes per cancer type
🔬 Try the Full Analysis
Run all the steps yourself with the complete DepMap notebook
🎗️ Breast Cancer - Top Essential Genes
🩸 Myeloid Cancer - Top Essential Genes
🔍 Key Observations
🤝 Common Essential Genes
- ●RAN - Nuclear transport, cell division
- ●HSPE1 - Protein folding chaperone
- ●RRM1 - DNA replication
- ●PLK1 - Cell cycle regulation
- ●PSMA6 - Protein degradation
🎯 Cancer-Specific Patterns
RNA splicing genes (SNRPF, SNRPA1, SF3B5)
Ribosomal proteins (RPL17, RPS8, RPS29, RPS19)
🤔 Research Questions & Next Steps
❓ Questions to Explore
- • Why are 5 genes common across cancer types?
- • What pathways are breast-specific vs myeloid-specific?
- • Do these genes interact in regulatory networks?
- • Are there druggable targets among these genes?
- • How do these relate to clinical outcomes?
🚀 Advanced Analysis Techniques
- • GSEA - Gene Set Enrichment Analysis
- • Network Analysis - Protein-protein interactions
- • Correlation Analysis - Gene expression patterns
- • Visualization - Heatmaps, networks, volcano plots
- • Machine Learning - Predictive models
🧠 Biological Insights
Common essential genes represent fundamental cellular processes required by all cancer cells
Cancer-specific genes reveal unique vulnerabilities that could be targeted therapeutically
🎯 This analysis provides a roadmap for precision cancer therapy development!
📈 Coming Next
Build on today's analysis with advanced statistical methods
Linear Regression Analysis
📈 Statistical Modeling
- • Gene expression correlations
- • Predictive modeling
- • R² and significance testing
🎨 Data Visualization
- • Scatter plots & regression lines
- • Heatmaps & clustering
- • Interactive plots with Plotly
What We Learned Today
🎯 Today's Key Achievements
🏗️Object-Oriented Programming
- • Understanding classes as data containers
- • Creating methods to process data
- • Building reusable, organized code structures
🐼Pandas Data Analysis
- • Loading and exploring DataFrames
- • Filtering data with boolean indexing
- • Computing statistics and sorting results
🏗️ Object-Oriented Programming Mastery
Classes as Data Containers
class Gene: def __init__(self, name, sequence): self.name = name self.sequence = sequence
def get_length(self): return len(self.sequence)
def get_gc_content(self): gc_count = self.sequence.count('G') + self.sequence.count('C') return gc_count / len(self.sequence) * 100
# Create and use objectsmy_gene = Gene("BRCA1", "ATCGATCG")print(f"Length: {my_gene.get_length()}")print(f"GC%: {my_gene.get_gc_content():.1f}%")Key OOP Benefits
📦 Organization
Group related data and functions together
🔄 Reusability
Create multiple instances with same behavior
🎯 Clarity
Self-documenting, intuitive code structure
🐼 Pandas Data Analysis Pipeline
1. Load & Explore
# Load datadf = pd.read_csv('data.csv')
# Explore structureprint(df.head())print(df.shape)print(df.columns)print(df.info())2. Filter & Select
# Boolean filteringbreast_df = df.loc[ df['oncotree_lineage'] == 'Breast']
# Column selectiongene_columns = df.columns[4:]gene_data = breast_df[gene_columns]3. Analyze & Sort
# Calculate statisticsmeans = gene_data.mean()
# Sort resultstop_genes = means.sort_values( ascending=True).head(10)
print(top_genes)💡 Key Takeaways
🏗️ OOP Mindset
- • Think in terms of objects with properties and behaviors
- • Use classes to model real-world entities (genes, proteins, cells)
- • Methods make your objects smart and interactive
📊 Data Analysis Workflow
- • Always explore your data first
- • Use boolean indexing for precise filtering
- • Combine statistics + sorting to find patterns
🧬Real-World Impact: From Code to Cancer Research
Today we identified essential genes in breast and myeloid cancers using the same techniques used in
real cancer research labs worldwide. Your code can now analyze datasets with millions of data points!
🎯 You're now equipped to tackle real biological big data challenges!
Congratulations!
You've mastered Object-Oriented Programming and Pandas data analysis
Practice Your Skills! 🧪Further Resources
Continue your Pandas journey with these excellent resources
Interactive Course
Data Manipulation with Pandas
DataCamp's hands-on course with interactive exercises and real-world datasets. Perfect for learning by doing!
Practice Book
Pandas Workout

By Reuven M. Lerner. 200+ exercises to build your Pandas skills through practice.
🌟 Other Excellent Resources
🎯 Keep Practicing!
The key to mastering Pandas is regular practice with real biological datasets. Start with our course notebooks and gradually work your way through these resources.