Winder, Johanna C., Poulton, Simon, Wu, Taoyang, Mock, Thomas and van Oosterhout, Cock (2025) Environmental adaptations in metagenomes revealed by deep learning. BMC Biology, 23. ISSN 1741-7007
Preview |
PDF (Winder_etal_2025_BMCBiology)
- Published Version
Available under License Creative Commons Attribution. Download (3MB) | Preview |
Abstract
Background: Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification. Results: Our ANN analysed 50,669 DUF3494 sequences from publicly available metagenomes, and successfully classified a large proportion of sequences by source environment (polar marine, glacier ice, frozen sediment, rock, subsurface). We identified environment-specific features that appear to drive classification. Our best-performing ANN was able to classify between 75.9 and 97.8% of sequences correctly. To enhance biological interpretability of these predictions, we compared this model with a genetic algorithm (GA), which, although it had lower predictive ability, provided transparent classification rules and predictors. Further in silico mutagenesis of key residues uncovered a vertically aligned column of amino acids on the b-face of the protein which was important for environmental differentiation, suggesting that both methods captured distinct evolutionary and ecological aspects of the sequences. Feature importance analysis identified that steric and electronic properties of the protein were associated with predictive ability. Conclusions: Our findings highlight the utility of deep learning for classification of diverse biological sequences and provide a framework for combining methods to improve model interpretability and ecological insights.
Item Type: | Article |
---|---|
Additional Information: | Data availability: This manuscript used publicly available metagenomics datasets, whose NCBI accessions are available in Additional file 1: Table 2. Sequences for the MOSAiC samples used is available at https://doi.org/10.6084/m9.figshare.25765707.v1 [85]. Code for the bioinformatics pipeline, data preprocessing and artificial neural network (ANN) can be found at https://github.com/jcwinder/Deep-learning-insights-into-ice-binding-protein-ecology (https://doi.org/10.5281/zenodo.16266611) [86]. Processed datasets used for building the model are also on the GitHub. Funding: J.C.W. was supported by the UKRI Biotechnology and Biological Sciences Research Council Norwich Research Park Biosciences Doctoral Training Partnership [grant number BB/T008717/1]. |
Faculty \ School: | Faculty of Science > School of Computing Sciences Faculty of Science > School of Environmental Sciences University of East Anglia Research Groups/Centres > Theme - ClimateUEA Faculty of Science Faculty of Science > School of Biological Sciences |
UEA Research Groups: | Faculty of Science > Research Centres > Centre for Ecology, Evolution and Conservation Faculty of Science > Research Groups > Computational Biology Faculty of Science > Research Groups > Data Science and AI Faculty of Science > Research Groups > Wolfson Centre for Advanced Environmental Microbiology Faculty of Science > Research Groups > Environmental Biology Faculty of Science > Research Groups > Centre for Ocean and Atmospheric Sciences |
Depositing User: | LivePure Connector |
Date Deposited: | 14 Aug 2025 08:32 |
Last Modified: | 15 Aug 2025 01:24 |
URI: | https://ueaeprints.uea.ac.uk/id/eprint/100139 |
DOI: | 10.1186/s12915-025-02361-1 |
Downloads
Downloads per month over past year
Actions (login required)
![]() |
View Item |