← Back to Course

📊 Sequence Analysis Assignment

Identify mutations in next-generation sequencing reads

🎯Your Task

Some sequencing reads perfectly match the reference; others contain single-base mismatches.

Write a Python program to identify and analyze these mutations.

Essential Functionality

1️⃣Compare & Count Mutations

Your program should:

  • • Compare each sequencing read to the reference base-by-base
  • • Record the total number of:
    • → Wild-type (WT) reads (without any mutations)
    • → Mutated reads (with 1 or more mutations)

2️⃣Record Mutation Details

For mutated reads, record:

  • • The position of the first mismatch (1–200)
  • • The reference base and the mutated base (i.e. A, C, T, or G) for the first mismatch

3️⃣Update the DataFrame

Store the results from step 2 as new columns in the pandas DataFrame with the following names:

  • wildtype_base - Reference base
  • mutated_base - Mutated base
  • mutated_position - Position of 1st mutation in the read (1–200)

4️⃣Present the Findings

Present the findings from the mutation analysis by:

📝 Text Summary (print to screen):

  • • Percentage of WT vs. mutated reads
  • • Percentages of specific base mutations (e.g. A → T, A → C, etc.), ordered from most common to least common

📊 Publication-Ready Plots:

Create visualizations using appropriate plot types, well-labelled, with good use of style/colour to represent:

  • • Percentage of WT vs. mutated reads
  • • Percentages of specific base mutations
  • • Mutation positions and/or hotspots

5️⃣Save the Results

Save the updated DataFrame to a new CSV file called analysis.csv for further analysis.

  • • The CSV should include rows for every read (i.e. 1,000 plus the header row)
  • • Empty entries in the new columns if no mutations were found in a given read

💡 Note: You can access analysis.csv via the Files area on Colab (click the folder icon on the left toolbar). To download it, right-click the file and choose Download.

💻Code Quality Requirements

Your program should also:

  • Handle unexpected input (e.g. truncated reads, unknown characters, upper- or lower-case bases) from the CSV file without crashing and by reporting errors in a user-friendly way

    Note: You can assume that errors are only ever in the reads.csv file, not in reference.txt

  • Use functions well to make the code easily readable, concise, and avoid code repetition
  • Include good comments, including docstrings for functions
  • Follow good practices in naming variables, functions, etc.

Advanced Functionality

Improve your marks further by implementing:

  • 🌟 Advanced plot types that are used in contemporary NGS literature
  • 🌟 Complex statistical results (e.g. frequency of mutation at hotspots, types of mutations)
  • 🌟 Advanced error handling (e.g. outputting informative error messages; logging and reporting skipped reads)

📤Submitting Your Assignment

Submission Details

  • Format: Colab notebook (.ipynb file)
  • Where: E-Submission on Canvas
  • Assignment: Report T1 Week 7
  • Deadline: Friday, November 14, 2025 @ 16:00

Getting Help

  • 💡 Dedicated time in weekly workshops to work on your project
  • 💬 Post questions on the Assessment Q&A discussion board
  • 📧 Contact the module team directly

📁Assignment Resources

Complete the assignment by writing your code in the Colab notebook linked below

Data Files

📄ReadsData.csv

Contains 1,000 sequencing reads (200bp each) from NGS experiment

📄Reference Document.txt

Reference sequence to compare reads against

Both files are small and shared directly via GitHub - accessible from the Colab notebook