Strategies for Optimisation of Variant Prediction in Yeast Genomes

Sritharan, Prithika (2021) Strategies for Optimisation of Variant Prediction in Yeast Genomes. Doctoral thesis, University of East Anglia.

[thumbnail of 2022SritharanPPhDthesis.pdf]
Preview
PDF
Download (9MB) | Preview

Abstract

Linear reference genomes have guided the alignment of short sequence reads prior
to variant prediction. This approach is, however, fundamentally limited when studying
species with high levels of sequence diversity. Variation graphs can overcome such limitations
by incorporating several genomes within a bi-directed reference structure. This
dissertation explores methodologies that could be utilised to optimise variant prediction
within yeast genomes, particularly Saccharomyces cerevisiae.
Variation graphs constructed from NCYC and third-party strains were found to
increase the ability of reads to align in comparison to the S288c reference graph and
linear reference genome. The novel FAT-CIGAR toolkit was developed to obtain exact
read alignment information from linear and graph-based mappers, in the form of the
FAT-CIGAR string. Sequence identity scores calculated from the FAT-CIGAR string
showed that the vg variation graph produced a greater proportion of reads with perfect
mapping (75.3%) whilst the SevenBridges variation graph mapped a higher number
of reads with greater identity scores (96.4%). The accuracy of variant calling was
compared for four graph genome software, determining that the SevenBridges variation
graph produced the most accurate variant calls (F1 score = 0.995), with the greatest
recall (0.991), followed by FreeBayes (F1 score = 0.995). The vg software produced the
least accurate variant calls (F1 score = 0.972 to 0.986) due to calling a greater number
of false positive variants.
The FAT-CIGAR toolkit also enabled the identification of a novel method of variant
filtration, removing aligned reads likely to lead to false positive variant calls. SNP calls
from reads filtered on the FAT-CIGAR string by 10 bases and indel calls from reads
filtered on the CIGAR string by 30 bases removed the highest proportions of false
positives in real and simulated datasets. Consequently, the use of the FAT-CIGAR
toolkit as a standard methodology in future genomic analyses is recommended.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Biological Sciences
Depositing User: Jackie Webb
Date Deposited: 29 Apr 2022 11:48
Last Modified: 29 Apr 2022 11:48
URI: https://ueaeprints.uea.ac.uk/id/eprint/84857
DOI:

Actions (login required)

View Item View Item