Seqenv: Linking sequences to environments through text mining

Sinclair, Lucas, Ijaz, Umer Z., Jensen, Lars Juhl, Coolen, Marco J. L., Gubry-Rangin, Cecile, Chroňáková, Alica, Oulas, Anastasis, Pavloudi, Christina, Schnetzer, Julia, Weimann, Aaron, Ijaz, Ali, Eiler, Alexander, Quince, Christopher and Pafilis, Evangelos (2016) Seqenv: Linking sequences to environments through text mining. PeerJ, 2016 (12). ISSN 2167-8359

Full text not available from this repository. (Request a copy)


Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the ``nt'' nucleotide database provided by NCBI and, out of every hit, extracts-if it is available-the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to:

Item Type: Article
Additional Information: Seqenv was originally conceived in a series of <hackathons> supported by the European Union's Earth System Science and Environmental Management COST Action. This project was titled ``Microbial ecology amp; the earth system: collaborating for insight and success with the new generation of sequencing tools'' and can be viewed at We would like to thank the LifeWatchGreece project ( for their generous support in the organization of these meetings. Lucas Sinclair and Alexander Eiler were funded by the Swedish Foundation for strategic research (ICA10-0015). Umer Zeeshan Ijaz was funded by NERC IRF (NE/L011956/1). Lars Juhl Jensen was funded by the Novo Nordisk Foundation (NNF14CC0001). Evangelos Pafilis was supported by the European Commission FP7-REGPOT project MARBIGEN (grant agreement #264089) and the Life WatchGreece Research Infrastructure (384676-94/GSRT/NSRF C&E). Christopher Quince is funded through the MRC Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project (MR/L015080/1) through fellowship (MR/M50161X/1). Cecile Gubry was funded by the Environment Research Council Fellowship (NE/J019151/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Uncontrolled Keywords: bioinformatics,ecology,genomics,microbiology,open source software,pipeline,sequence analysis,statistics,text processing,neuroscience(all),biochemistry, genetics and molecular biology(all),agricultural and biological sciences(all) ,/dk/atira/pure/subjectarea/asjc/2800
Faculty \ School: Faculty of Science > School of Biological Sciences
Related URLs:
Depositing User: LivePure Connector
Date Deposited: 09 Sep 2022 08:30
Last Modified: 21 Oct 2022 01:39
DOI: 10.7717/peerj.2690

Actions (login required)

View Item View Item