Strategies for Optimisation of VariantPrediction in Yeast Genomes

Sritharan, Prithika (2021) Strategies for Optimisation of VariantPrediction in Yeast Genomes. Doctoral thesis, University of East Anglia.

[thumbnail of 2022SritharanPPhD.pdf]
Preview
PDF
Download (9MB) | Preview

Abstract

Linear reference genomes have guided the alignment of short sequence reads prior to variant prediction. This approach is, however, fundamentally limited when studying species with high levels of sequence diversity. Variation graphs can overcome such limitations by incorporating several genomes within a bi-directed reference structure. This dissertation explores methodologies that could be utilised to optimise variant prediction within yeast genomes, particularly Saccharomyces cerevisiae.
Variation graphs constructed from NCYC and third-party strains were found to increase the ability of reads to align in comparison to the S288c reference graph and linear reference genome. The novel FAT-CIGAR toolkit was developed to obtain exact read alignment information from linear and graph-based mappers, in the form of the FAT-CIGAR string. Sequence identity scores calculated from the FAT-CIGAR string showed that the vg variation graph produced a greater proportion of reads with perfect mapping (75.3%) whilst the SevenBridges variation graph mapped a higher number of reads with greater identity scores (96.4%). The accuracy of variant calling was compared for four graph genome software, determining that the SevenBridges variation graph produced the most accurate variant calls (F1 score = 0.995), with the greatest recall (0.991), followed by FreeBayes (F1 score = 0.995). The vg software produced the least accurate variant calls (F1 score = 0.972 to 0.986) due to calling a greater number of false positive variants.
The FAT-CIGAR toolkit also enabled the identification of a novel method of variant filtration, removing aligned reads likely to lead to false positive variant calls. SNP calls from reads filtered on the FAT-CIGAR string by 10 bases and indel calls from reads filtered on the CIGAR string by 30 bases removed the highest proportions of false positives in real and simulated datasets. Consequently, the use of the FAT-CIGAR toolkit as a standard methodology in future genomic analyses is recommended.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Biological Sciences
Depositing User: Nicola Veasy
Date Deposited: 08 Jun 2022 10:19
Last Modified: 08 Jun 2022 10:19
URI: https://ueaeprints.uea.ac.uk/id/eprint/85464
DOI:

Actions (login required)

View Item View Item