📊 Sequence Analysis Assignment

Identify mutations in next-generation sequencing reads

🎯Your Task

Some sequencing reads perfectly match the reference; others contain single-base mismatches.

Write a Python program to identify and analyze these mutations.

✅Essential Functionality

1️⃣Compare & Count Mutations

Your program should:

• Compare each sequencing read to the reference base-by-base
• Record the total number of:
- → Wild-type (WT) reads (without any mutations)
- → Mutated reads (with 1 or more mutations)

2️⃣Record Mutation Details

For mutated reads, record:

• The position of the first mismatch (1–200)
• The reference base and the mutated base (i.e. A, C, T, or G) for the first mismatch

3️⃣Update the DataFrame

Store the results from step 2 as new columns in the pandas DataFrame with the following names:

• wildtype_base - Reference base
• mutated_base - Mutated base
• mutated_position - Position of 1st mutation in the read (1–200)

4️⃣Present the Findings

Present the findings from the mutation analysis by:

📝 Text Summary (print to screen):

• Percentage of WT vs. mutated reads
• Percentages of specific base mutations (e.g. A → T, A → C, etc.), ordered from most common to least common

📊 Publication-Ready Plots:

Create visualizations using appropriate plot types, well-labelled, with good use of style/colour to represent:

• Percentage of WT vs. mutated reads
• Percentages of specific base mutations
• Mutation positions and/or hotspots

5️⃣Save the Results

Save the updated DataFrame to a new CSV file called analysis.csv for further analysis.

• The CSV should include rows for every read (i.e. 1,000 plus the header row)
• Empty entries in the new columns if no mutations were found in a given read

💡 Note: You can access analysis.csv via the Files area on Colab (click the folder icon on the left toolbar). To download it, right-click the file and choose Download.

💻Code Quality Requirements

Your program should also:

✅ Handle unexpected input (e.g. truncated reads, unknown characters, upper- or lower-case bases) from the CSV file without crashing and by reporting errors in a user-friendly way
Note: You can assume that errors are only ever in the reads.csv file, not in reference.txt
✅ Use functions well to make the code easily readable, concise, and avoid code repetition
✅ Include good comments, including docstrings for functions
✅ Follow good practices in naming variables, functions, etc.

⭐Advanced Functionality

Improve your marks further by implementing:

🌟 Advanced plot types that are used in contemporary NGS literature
🌟 Complex statistical results (e.g. frequency of mutation at hotspots, types of mutations)
🌟 Advanced error handling (e.g. outputting informative error messages; logging and reporting skipped reads)

📤Submitting Your Assignment

Submission Details

Format: Colab notebook (.ipynb file)
Where: E-Submission on Canvas
Assignment: Report T1 Week 7
Deadline: Friday, November 14, 2025 @ 16:00

Getting Help

💡 Dedicated time in weekly workshops to work on your project
💬 Post questions on the Assessment Q&A discussion board
📧 Contact the module team directly

📁Assignment Resources

Complete the assignment by writing your code in the Colab notebook linked below

📓Open Assignment Notebook in Colab 🎥Watch Assignment Walkthrough Video

Data Files

📄ReadsData.csv↗

Contains 1,000 sequencing reads (200bp each) from NGS experiment

📄Reference Document.txt↗

Reference sequence to compare reads against

Both files are small and shared directly via GitHub - accessible from the Colab notebook