DNA Analysis with Python
for Biologists
Build an Open Reading Frame finder to translate DNA sequences into proteins
๐งฌToday's Focus
String Manipulation
Learn to work with text data, perfect for DNA and protein sequences
DNA Sequence Analysis
Apply Python skills to real biological problems like GC content calculation
Pattern Recognition
Find motifs, restriction sites, and other important sequence features
๐Core Python Skills
Python dictionaries
Store and manipulate biological data efficiently
String Methods
Essential tools for sequence processing and analysis
Conditional Logic
Make decisions in your code based on biological criteria
๐งAdditional Topics
Python Development Tools
IDE setup, debugging, and best practices for scientific computing
Open Source Software
Understanding the ecosystem of biological analysis tools
Biopython
Introduction to the most popular Python library for bioinformatics
Today's Goal: Build an ORF Finder ๐งฌ
Our Python Program Will:
Find Start Codons
Locate all ATG positions
Find Stop Codons
Scan for TAA, TAG, TGA in-frame
Extract ORFs
Get sequences between start & stop
Find Longest ORF
Identify the most likely protein
Translate to Protein
Convert DNA codons to amino acids
Process Many Files
Automate for 100s of sequences
Skills we'll learn: String operations โข Dictionaries โข Conditionals โข File I/O โข Biopython
Meet Darren: A Biologist with a Problem ๐งฌ

Darren, PhD student studying gene expression
The Challenge
Darren has sequenced hundreds of mRNA molecules from cancer cells. Each sequence could encode important proteins, but finding them manually takes hours per sequence!
๐ Current situation:
- โข 500+ mRNA sequence files
- โข Each needs to be checked for ORFs
- โข Manual checking takes ~30 min/file
- โข That's 250 hours of tedious work!
๐ก Solution: Automate with Python!
What takes 30 minutes by hand can be done in milliseconds with code
What are Open Reading Frames (ORFs)?

Key Concepts
- โขDNA/RNA can be read in 3 different frames
- โขEach frame groups nucleotides into different codons
- โขAn Open frame has no early stop codons
- โขA Blocked frame hits a stop codon quickly
ORF Requirements
Start:
ATG (codes for Methionine)
Stop:
TAA, TAG, or TGA
Valid ORF:
ATG โ ... โ Stop (in same frame!)
In the example above: Only Frame 1 is "open" - it can produce a full protein. Frames 2 & 3 hit stop codons immediately!
Breaking Down the ORF Problem
To find and translate an Open Reading Frame, we need to solve 3 simple steps:
Find First ATG
Scan the DNA string and find the first ATG start codon
โ position 6
Extract ORF
From ATG, collect codons until we hit a STOP codon
Translate to Protein
Convert each codon to its amino acid using the genetic code
๐ก Our Learning Path
The Complete Code - Live Demo!
Here's our complete ORF finder - try it with the example DNA sequence!
Part 1
Python String Fundamentals
for Biology
Working with DNA sequences as strings
๐ฏ Our First Function: Finding the Start Codon
This is what we'll build together in the next few slides:
def find_atg(dna_sequence): """Find the first ATG start codon in the sequence.""" for i in range(len(dna_sequence) - 2): if dna_sequence[i:i+3] == 'ATG': return i return None # Return None if no ATG foundDon't worry if this looks complex - we'll build it step by step!
Quick Review: Data Types & Strings
Basic Data Types
intโ 42, -17, 1000Whole numbers
floatโ 3.14, -0.5, 2.7e-8Decimal numbers
strโ "ATCG", 'DNA'Text sequences
boolโ True, FalseLogical values
String Operations We Learned
dna = "ATCGATCG"
len(dna)โ 8dna[0]โ "A"dna[0:3]โ "ATC"dna + "TAA"โ "ATCGATCGTAA"dna * 2โ "ATCGATCGATCGATCG""AT" in dnaโ TrueRemember: In Python, strings are sequences of characters - perfect for representing DNA, RNA, and protein sequences!
DNA String Slicing
๐ช Understanding String Slicing
String slicing lets you extract parts of a string using [start:end] notation:
- โข
string[1:4]gets characters at positions 1, 2, and 3 - โข Position counting starts from 0
- โข The end position is not included
๐ฏ Try It Yourself!
Complete the challenge: Print the last three bases of the DNA sequence using slicing.
๐ฏ Practice Challenge
Try these basic slicing exercises:
- โข Extract just the middle 4 bases of "ATGCGTAAA"
- โข Get the first half of the DNA sequence
- โข Extract every other base using step slicing [::2]
- โข Practice using negative indices to get sections from the end
๐ Practice More in Google Colab!
Open the full string manipulation notebook with exercises and solutions
Finding ATGs: Loops + Slicing
๐ Step 1: Loop Through Every Position
To find ATGs, we need to check every possible position in the DNA string:
- โข Position 0: Check bases 0-1-2
- โข Position 1: Check bases 1-2-3
- โข Position 2: Check bases 2-3-4
- โข And so on...
๐ช Step 2: Extract 3 Bases from Each Position
At each position, slice out exactly 3 bases to check if it could be a start codon:
๐ง We've Hit a Problem!
What we can do: Extract 3-base sequences from every position โ
What we can't do yet: Check IF a sequence equals "ATG" โ
We need: A way to make decisions in our code - Python conditionals!
๐ Coming Up Next: Python Conditionals
To solve our ATG-finding problem, we'll learn:
- โข
ifstatements for making decisions - โข Comparing strings with
== - โข
elifandelsefor multiple conditions - โข Putting it all together to find ATGs automatically!
๐ Practice String Manipulation in Google Colab!
Try the full string manipulation notebook with more loop and slicing exercises
Part 2
Python Conditionals
Making Decisions
Teaching Python to make choices based on biological data
๐ Focus: The Conditional Logic
Notice the if statement that makes the decision:
def find_atg(dna_sequence): """Find the first ATG start codon in the sequence.""" for i in range(len(dna_sequence) - 2): if dna_sequence[i:i+3] == 'ATG': # โ The key decision! return i return None # Return None if no ATG foundWe'll learn how if statements help us find biological patterns!
Conditionals: Making Decisions in Code
๐ฏ Basic if Statement
Use if to make decisions based on conditions:
๐ if-else: Choose Between Two Options
Use else to handle the opposite case:
๐ช elif: Multiple Choices
Use elif to test multiple conditions:
๐ Key Points to Remember
๐ Practice Conditionals in Google Colab!
Open the comprehensive conditionals notebook with exercises and biological examples
Building Our First Function: find_atg()
๐ฏ Goal: Find the First ATG in a DNA Sequence
Our function needs to:
- โข Look at every position in the DNA string
- โข Check if the 3 bases starting at that position are "ATG"
- โข Return the position when ATG is found
- โข Return None if no ATG exists
๐ Step 1: Loop Through Each Position
We use range(len(dna_sequence) - 2) to avoid going past the end:
๐ Step 2: Check if Three Bases Equal "ATG"
At each position, we extract 3 bases and compare with "ATG":
โฉ๏ธ Step 3: Return the Position When Found
As soon as we find ATG, we return its position and stop searching:
๐งฉ Function Components Breakdown
def find_atg(dna_sequence):Define function with one parameter
for i in range(len(dna_sequence) - 2):Loop through valid positions
if dna_sequence[i:i+3] == 'ATG':Check if 3 bases equal ATG
return iReturn position and exit
๐ก Try It Yourself!
Modify the function to find ALL ATG positions (not just the first):
๐ Key Concepts We Combined
Part 3
Python Dictionaries
Match Codons with Amino Acids
Using key-value pairs to store and look up biological information
๐๏ธ Next Function: The Codon Reader
Notice the dictionary lookup that finds stop codons:
# Step 2: Extract ORF from ATG to STOP codondef find_orf(dna_sequence, atg_index): """Find the ORF starting from ATG position until stop codon.""" orf = '' for i in range(atg_index, len(dna_sequence) - 2): codon = dna_sequence[i:i+3] if len(codon) == 3: # Make sure we have a complete codon orf += codon if codon in CODON_TABLE and CODON_TABLE[codon] == '*': # โ Dictionary magic! break return orfWe'll learn how CODON_TABLE[codon] looks up amino acids instantly!
Dictionaries: Key-Value Pairs for Biology
๐๏ธ Creating a Simple Codon Dictionary
Dictionaries map keys to values. Perfect for codon โ amino acid!
๐ Accessing Keys and Values
Get values by key and loop through the dictionary:
โ๏ธ Adding and Changing Values
Modify existing entries or add new ones:
๐ก๏ธ Safe Operations: get() and pop()
Handle missing keys safely and remove entries:
The Two Core Functions - Interactive Demo
๐ Dictionary & Functions Setup
๐งฌ Try It Out!
Debugging & Error Handling
Let's examine our code with line numbers to understand debugging
Understanding Python Error Messages
๐ Anatomy of a Python Error
๐ Example 1: Syntax Error
Missing parentheses - Python can't understand the code structure
๐งฎ Example 2: Type Error
Trying to use incompatible data types together
๐ท๏ธ Example 3: Name Error
Using a variable or function that doesn't exist
๐ก Debugging Tips
- โข Start from the bottom - that's the actual error
- โข Note the line number and file name
- โข Look at the exact code line mentioned
- โข SyntaxError: Check parentheses, quotes, colons
- โข TypeError: Check if data types match
- โข NameError: Check spelling and definitions
Defensive Programming for Biological Data
๐ก๏ธ Expect the Unexpected
Real biological data is messy:
- โข DNA sequences might not contain ATG start codons
- โข FASTA files may have ambiguous bases (N, R, Y)
- โข User input could be empty or invalid
- โข Sequences might be too short for analysis
Solution: Use if statements to validate data and handle expected scenarios gracefully!
๐ Simple ATG Finder with Data Validation
Two key checks: valid DNA bases and ATG presence
๐ก Defensive Programming Principles
- โข Check for empty or None values
- โข Validate data types (string vs number)
- โข Verify biological constraints
- โข No ATG found โ return None
- โข Invalid bases โ clean or warn
- โข Short sequences โ inform user
- โข Print warnings for data issues
- โข Return meaningful values
- โข Document what went wrong
- โข Return None instead of crashing
- โข Continue processing when possible
- โข Don't let one bad sequence stop analysis
๐ Alternative: try/except blocks
try/except is useful for building robust software applications:
- โข File I/O operations
- โข Network connections
- โข Database queries
- โข User interface errors
- โข Use simple
ifstatements - โข Validate data explicitly
- โข Handle expected scenarios
- โข Focus on data quality
๐ก For data science: Missing ATGs, invalid bases, or empty sequences aren't "exceptions" - they're normal biological scenarios that need explicit handling with if statements.
DNA Sequence File Formats: FASTA
๐ What is a FASTA File?
FASTA is the most common format for storing DNA, RNA, and protein sequences. It's a simple text format that biologists use worldwide.
๐ก Fun Fact
FASTA was named after the FASTA software program for sequence alignment, developed in the 1980s at the University of Virginia
๐๏ธ FASTA Format Structure
Basic Format Rules
- โข Header line starts with
> - โข Sequence ID comes right after
> - โข Description (optional) after the ID
- โข Sequence data on following lines
- โข No line length limit for sequence
Example FASTA File
>NM_000546.6 Homo sapiens tumor protein p53ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCG
>NM_001126115.2 Homo sapiens BRCA1 geneATGGATTTCCGTCTGAACAAACAACACCGCCGGCCCCGTGGGTCCGTGTCCCCGGCAAGCCCCACCCGGGCCCTCCCTCCCGGCTGGGGGCCGCCCCCCGACACCAATCAGGCCCCCCACCCCGGCTCTCTACCCCCGCGCCCCCGGACACTACCCCCCGCC๐ Where to Find FASTA Files
NCBI GenBank
National Center for Biotechnology Information
- โข Comprehensive gene database
- โข Download individual genes
- โข Genome assemblies available
Ensembl
European genome annotation database
- โข High-quality annotations
- โข Multiple species genomes
- โข Easy bulk downloads
UniProt
Protein sequence database
- โข Protein sequences only
- โข Functional annotations
- โข Research-quality curation
๐ก Key Takeaways: Working with FASTA Files
FASTA Essentials
- โข Simple, universal sequence format
- โข Header starts with
> - โข Can contain multiple sequences
- โข Used by all major databases
Python Skills
- โข File reading with
open() - โข String manipulation for parsing
- โข Dictionary storage for multiple sequences
- โข Always handle the last sequence!
๐ฏ FASTA files are your gateway to analyzing real biological sequences!
๐ป FASTA File Parsing: Professional Approach
Here's how bioinformaticians parse FASTA files in the real world - handling multiple sequences and complex structures.
๐งฌ Complete FASTA Parser
sequences = {}current_gene_id = Nonecurrent_sequence = ""
# Use context manager to safely open and read the filewith open(filename, 'r') as file: for line in file: line = line.strip() # Remove whitespace
if line.startswith('>'): # Save previous sequence if we have one if current_gene_id is not None: sequences[current_gene_id]['sequence'] = current_sequence
# Parse new header header_parts = line[1:].split(' ', 1) # Split on first space only current_gene_id = header_parts[0]
# Store header info sequences[current_gene_id] = { 'header': line[1:], # Full header without > 'sequence': "" }
current_sequence = "" # Reset sequence print(f"Found sequence: {current_gene_id}")
else: # Add to current sequence (sequences can span multiple lines) current_sequence += line.upper()
# Don't forget the last sequence!if current_gene_id is not None: sequences[current_gene_id]['sequence'] = current_sequence
# Display resultsfor gene_id, data in sequences.items(): print(f"\nGene: {gene_id}") print(f"Header: {data['header']}") print(f"Length: {len(data['sequence'])} bases") print(f"First 50 bases: {data['sequence'][:50]}...")๐ Why This Approach Works
๐งฉ Handles Multiple Sequences
Real FASTA files often contain multiple sequences. This parser stores each one with its own ID and metadata.
๐ Line-by-Line Processing
Sequences can span multiple lines. This approach reads line by line and concatenates sequence data properly.
๐พ Smart Data Structure
Uses nested dictionaries to store both header information and sequence data for easy access.
โ ๏ธ Edge Case Handling
Don't forget the last sequence! The final sequence needs special handling since there's no next header.
๐ก Key Programming Concepts
- โข Context Managers -
with open()safely handles files - โข String Methods -
.strip(),.startswith(),.split() - โข State Management - Tracking current sequence and ID
- โข Nested Dictionaries - Complex data organization
- โข Edge Cases - Handling the last sequence properly
- โข Data Validation - Checking for None values
๐ Ready for More Advanced Practice?
Build complete FASTA parsers and work with real research datasets
Advanced FASTA AnalysisFile I/O & the with Statement
๐ Why File Handling Matters in Biology
Biological data lives in files - sequences, experiment results, annotations. Learning proper file handling is essential for any bioinformatics work.
๐งฌ Real Examples
FASTA sequences, CSV experiment data, JSON annotations, XML databases, TSV gene expression data, and many more!
โ ๏ธ The Problem: Files Can Get "Stuck Open"
โ The Old Way (Risky)
# Opening a file the old wayfile = open('sequences.fasta', 'r')content = file.read()# Process the content...
# What if an error happens here?# The file might never get closed!# This can cause problems...
file.close() # Might never execute!๐จ What Can Go Wrong
- โข Memory leaks - Files stay open in memory
- โข File locks - Other programs can't access the file
- โข Resource exhaustion - System runs out of file handles
- โข Data corruption - Writes might not be saved
- โข Crashes - Program errors leave files open
โ The Solution: Context Managers & with Statement
๐ The Safe Way
# Using the with statementwith open('sequences.fasta', 'r') as file: content = file.read() # Process the content...
# Even if an error happens here, # the file will ALWAYS be closed!
# File is automatically closed here# No matter what happened above!๐ฏ Why It's Better
- โข Automatic cleanup - Files always close
- โข Exception safe - Works even if errors occur
- โข Cleaner code - No need to remember .close()
- โข Best practice - Used by all professional developers
- โข Resource efficient - Prevents memory leaks
๐ก Key Takeaways: File I/O Best Practices
Essential Rules
- โข Always use
with open() - โข Choose the right file mode for your task
- โข Handle large files line by line
- โข Check if files exist before reading
Bioinformatics Tips
- โข Use
.strip()to remove whitespace - โข Process files line by line for memory efficiency
- โข Validate file formats before processing
- โข Always backup important data files
๐ฏ Proper file handling prevents data loss and makes your code more reliable!
Lecture 2 Summary: What You've Learned Today
1๏ธโฃString Operations & DNA Analysis
String Slicing
- โธExtract sequence parts:
dna[0:3] - โธFind reading frames and ORFs
String Methods
- โธ
.find(),.upper(),.replace() - โธSearch for start/stop codons
Biological Context
- โธReading frames and translation
- โธOpen Reading Frame analysis
2๏ธโฃConditionals & Decision Making
If Statements
- โธ
if condition: - โธMake decisions in code
Logical Operators
- โธ
and,or,not - โธComplex condition testing
Error Handling
- โธValidate input sequences
- โธDefensive programming
3๏ธโฃDictionaries & Data Organization
Key-Value Pairs
- โธ
{'codon': 'amino_acid'} - โธStore genetic code tables
Translation
- โธDNA โ RNA โ Protein
- โธCodon table lookups
File Handling
- โธFASTA file parsing
- โธ
with open()best practices
๐งฌReal Bioinformatics Applications
ORF Finding
Identify potential protein-coding regions in DNA sequences
Sequence Translation
Convert DNA to protein sequences using codon tables
File Processing
Parse and analyze biological data formats like FASTA
๐กAdvanced Programming Skills Gained
- โString manipulation for sequence analysis
- โData validation with conditionals
- โEfficient data lookup using dictionaries
- โFile I/O operations for real data
- โError handling and defensive coding
- โBiological data processing workflows
๐What's Coming Next
Lecture 3
Data Analysis with Pandas: Tables, Gene Dependencies & Correlation
Pandas DataFrames
Work with tabular biological data, CSV files, and gene expression datasets
Gene Dependencies
Analyze relationships between genes, correlations, and biological networks
Data Visualization
Create plots and charts to visualize biological data patterns
๐ Excellent Progress!
You can now analyze DNA sequences, parse biological files, and make data-driven decisions in code
Next week we'll explore how to work with large datasets and discover gene relationships using Python!
Resources for DNA Pythonistas ๐๐งฌ
Biopython
The essential Python library for biological computation
What Biopython Does
- โธParse biological file formats (FASTA, GenBank, PDB, etc.)
- โธSequence manipulation and translation
- โธBLAST searches and alignment tools
- โธAccess NCBI databases (Entrez, PubMed)
- โธPhylogenetic tree analysis
Getting Started
Install with pip:
pip install biopythonQuick example:
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAA")
protein = dna.translate()
print(protein) # MAIV*๐งฐOther Essential Python Libraries for Biology
NumPy & Pandas
Numerical computing and data analysis. Essential for working with gene expression data, experimental results, and large datasets.
numpy.org โข pandas.pydata.orgscikit-bio
Bioinformatics library for sequence alignment, diversity analysis, and working with biological data structures.
scikit-bio.orgmatplotlib & seaborn
Data visualization libraries for creating publication-quality plots, charts, and figures for your biological data.
matplotlib.org๐Learning Resources & Documentation
Online Courses & Tutorials
- โธPython for Biologists
pythonforbiologists.com - Comprehensive tutorials
- โธRosalind
rosalind.info - Learn bioinformatics through problem solving
- โธBioPython Tutorial
Official tutorial with real-world examples
Databases & APIs
- โธNCBI Entrez
Access GenBank, PubMed, and other NCBI databases via Python
- โธUniProt
Protein sequence and functional information database
- โธEnsembl REST API
Genomic data access through Python requests
๐ฌSpecialized Bioinformatics Tools
PyMOL
3D molecular visualization and analysis of protein structures
DendroPy
Phylogenetic computing library for tree analysis
pysam
Python wrapper for SAM/BAM sequencing data formats
๐ฌCommunity & Getting Help
Bioinformatics StackExchange
Ask questions, get answers from the community
GitHub
Explore open-source bioinformatics projects
Python Documentation
Official Python docs - your best friend!
๐ You're Now Part of the DNA Pythonista Community!
These tools will empower you to tackle real biological problems with code
Start exploring Biopython today and see how much time you can save in your research!