Coelho, Luis Pedro, Alves, Renato, del Río, Álvaro Rodríguez, Myers, Pernille Neve, Cantalapiedra, Carlos P., Giner-Lamia, Joaquín, Schmidt, Thomas Sebastian, Mende, Daniel R., Orakov, Askarbek, Letunic, Ivica, Hildebrand, Falk, Van Rossum, Thea, Forslund, Sofia K., Khedkar, Supriya, Maistrenko, Oleksandr M., Pan, Shaojun, Jia, Longhao, Ferretti, Pamela, Sunagawa, Shinichi, Zhao, Xing Ming, Nielsen, Henrik Bjørn, Huerta-Cepas, Jaime and Bork, Peer (2022) Towards the biogeography of prokaryotic genes. Nature, 601 (7892). pp. 252-256. ISSN 0028-0836
Full text not available from this repository. (Request a copy)Abstract
Microbial genes encode the majority of the functional repertoire of life on earth. However, despite increasing efforts in metagenomic sequencing of various habitats1–3, little is known about the distribution of genes across the global biosphere, with implications for human and planetary health. Here we constructed a non-redundant gene catalogue of 303 million species-level genes (clustered at 95% nucleotide identity) from 13,174 publicly available metagenomes across 14 major habitats and use it to show that most genes are specific to a single habitat. The small fraction of genes found in multiple habitats is enriched in antibiotic-resistance genes and markers for mobile genetic elements. By further clustering these species-level genes into 32 million protein families, we observed that a small fraction of these families contain the majority of the genes (0.6% of families account for 50% of the genes). The majority of species-level genes and protein families are rare. Furthermore, species-level genes, and in particular the rare ones, show low rates of positive (adaptive) selection, supporting a model in which most genetic variability observed within each protein family is neutral or nearly neutral.
| Item Type: | Article |
|---|---|
| Additional Information: | Data availability: All data analysed during the current study are publicly available. Supplementary Table 1 contains the accession numbers for all the metagenomes used. GMGCv1 is available for download at https://gmgc.embl.de. The full catalogue is available for download as are sub-catalogues specialized to individual habitats and the subset derived only from sequenced genomes (which can be further subset to obtain the pangenome of a species of interest). Both the full catalogue and a version containing only complete ORFs are available as they represent different tradeoffs: the complete catalogue achieves higher coverage, while the version with only complete ORFs may be more appropriate for analyses that require the whole gene. Similarly, protein families are available at different amino acid identity thresholds (see ‘Protein family cluster calculation’). In addition to being available for download, the catalogue can be queried with an amino acid sequence. We developed and use a novel k-mer based algorithm (see ‘k-mer based homology search’) to enable fast queries over the complete 303 million protein database and allow interactive use. |
| Uncontrolled Keywords: | general,sdg 3 - good health and well-being ,/dk/atira/pure/subjectarea/asjc/1000 |
| Faculty \ School: | Faculty of Science > School of Biological Sciences |
| Related URLs: | |
| Depositing User: | LivePure Connector |
| Date Deposited: | 26 Jun 2026 08:24 |
| Last Modified: | 28 Jun 2026 23:02 |
| URI: | https://ueaeprints.uea.ac.uk/id/eprint/103499 |
| DOI: | 10.1038/s41586-021-04233-4 |
Actions (login required)
![]() |
View Item |
Tools
Tools