Identify mutations in next-generation sequencing reads
Some sequencing reads perfectly match the reference; others contain single-base mismatches.
Write a Python program to identify and analyze these mutations.
Your program should:
For mutated reads, record:
Store the results from step 2 as new columns in the pandas DataFrame with the following names:
wildtype_base - Reference basemutated_base - Mutated basemutated_position - Position of 1st mutation in the read (1–200)Present the findings from the mutation analysis by:
Create visualizations using appropriate plot types, well-labelled, with good use of style/colour to represent:
Save the updated DataFrame to a new CSV file called analysis.csv for further analysis.
💡 Note: You can access analysis.csv via the Files area on Colab (click the folder icon on the left toolbar). To download it, right-click the file and choose Download.
Your program should also:
Note: You can assume that errors are only ever in the reads.csv file, not in reference.txt
Improve your marks further by implementing:
Complete the assignment by writing your code in the Colab notebook linked below
Contains 1,000 sequencing reads (200bp each) from NGS experiment
Reference sequence to compare reads against
Both files are small and shared directly via GitHub - accessible from the Colab notebook