SEPATH: Benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines

Gihawi, Abraham ORCID: https://orcid.org/0000-0002-3676-5561, Rallapalli, Ghanasyam, Hurst, Rachel, Cooper, Colin ORCID: https://orcid.org/0000-0003-2013-8042, Leggett, Richard M. and Brewer, Daniel ORCID: https://orcid.org/0000-0003-4753-9794 (2019) SEPATH: Benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines. Genome Biology, 20. ISSN 1474-760X

[thumbnail of Manuscript]
Preview
PDF (Manuscript) - Accepted Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview
[thumbnail of Additional_files] Archive (ZIP) (Additional_files)
Available under License Creative Commons Attribution.

Download (70kB)
[thumbnail of Supplementary_Figures]
Preview
PDF (Supplementary_Figures)
Available under License Creative Commons Attribution.

Download (1MB) | Preview
[thumbnail of Gihawi_etal_2019_GenomeBiology]
Preview
PDF (Gihawi_etal_2019_GenomeBiology) - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

Background : Human tissue is increasingly being whole genome sequenced as we transition into an era of genomic medicine. With this arises the potential to detect sequences originating from microorganisms, including pathogens amid the plethora of human sequencing reads. In cancer research, the tumorigenic ability of pathogens is being recognized, for example Helicobacter pylori and human papillomavirus in the cases of gastric non-cardia and cervical carcinomas respectively. As of yet, no benchmark has been carried out on the performance of computational approaches for bacterial and viral detection within host-dominated sequence data.   Results : We present the results of benchmarking over 70 distinct combinations of tools and parameters on 100 simulated cancer datasets spiked with realistic proportions of bacteria. mOTUs2 and Kraken are the highest performing individual tools achieving median genus level F1-scores of 0.90 and 0.91 respectively. mOTUs2 demonstrates a high performance in estimating bacterial proportions. Employing Kraken on unassembled sequencing reads produces a good but variable performance depending on post-classification filtering parameters. These approaches are investigated on a selection of cervical and gastric cancer whole genome sequences where Alphapapillomavirus and Helicobacter are detected in addition to a variety of other interesting genera.   Conclusions : We provide the top performing pipelines from this benchmark in a unifying tool called SEPATH, which is amenable to high throughput sequencing studies across a range of high-performance computing clusters. SEPATH provides a benchmarked and convenient approach to detect pathogens in tissue sequence data helping to determine the relationship between metagenomics and disease.

Item Type: Article
Uncontrolled Keywords: sdg 3 - good health and well-being ,/dk/atira/pure/sustainabledevelopmentgoals/good_health_and_well_being
Faculty \ School: Faculty of Medicine and Health Sciences > Norwich Medical School
UEA Research Groups: Faculty of Medicine and Health Sciences > Research Groups > Cancer Studies
Faculty of Medicine and Health Sciences > Research Centres > Metabolic Health
Depositing User: LivePure Connector
Date Deposited: 16 Sep 2019 07:30
Last Modified: 09 May 2024 10:31
URI: https://ueaeprints.uea.ac.uk/id/eprint/72206
DOI: 10.1186/s13059-019-1819-8

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item