Accurate plant pathogen effector protein classification ab initio with deepredeff: An ensemble of convolutional neural networks

Kristianingsih, Ruth and MacLean, Dan (2021) Accurate plant pathogen effector protein classification ab initio with deepredeff: An ensemble of convolutional neural networks. BMC Bioinformatics, 22. ISSN 1471-2105

[thumbnail of s12859-021-04293-3]
Preview
PDF (s12859-021-04293-3) - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

Background: Plant pathogens cause billions of dollars of crop loss every year and are a major threat to global food security. Effector proteins are the tools such pathogens use to infect the cell, predicting effectors de novo from sequence is difficult because of the heterogeneity of the sequences. We hypothesised that deep learning classifiers based on Convolutional Neural Networks would be able to identify effectors and deliver new insights. Results: We created a training set of manually curated effector sequences from PHI-Base and used these to train a range of model architectures for classifying bacteria, fungal and oomycete sequences. The best performing classifiers had accuracies from 93 to 84%. The models were tested against popular effector detection software on our own test data and data provided with those models. We observed better performance from our models. Specifically our models showed greater accuracy and lower tendencies to call false positives on a secreted protein negative test set and a greater generalisability. We used GRAD-CAM activation map analysis to identify the sequences that activated our CNN-LSTM models and found short but distinct N-terminal regions in each taxon that was indicative of effector sequences. No motifs could be observed in these regions but an analysis of amino acid types indicated differing patterns of enrichment and depletion that varied between taxa. Conclusions: Small training sets can be used effectively to train highly accurate and sensitive deep learning models without need for the operator to know anything other than sequence and without arbitrary decisions made about what sequence features or physico-chemical properties are important. Biological insight on subsequences important for classification can be achieved by examining the activations in the model.

Item Type: Article
Additional Information: Availability of data and materials: The datasets generated and/or analysed during the current study are available in the ‘ruth-effectors-prediction’ repository, https://github.com/TeamMacLean/ruth-effectors-prediction. Individual datasets location within this repository are listed per dataset in the Results section. The R package created is available at https://ruthkr.github.io/deepredeff. Funding Information: RK and DM were supported by The Gatsby Charitable Foundation core grant to The Sainsbury Laboratory. The funding body did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Uncontrolled Keywords: ai,deep learning,effector protein,structural biology,biochemistry,molecular biology,computer science applications,applied mathematics,sdg 2 - zero hunger ,/dk/atira/pure/subjectarea/asjc/1300/1315
Faculty \ School: Faculty of Science > The Sainsbury Laboratory
Faculty of Science > School of Biological Sciences
Faculty of Science > School of Computing Sciences
Related URLs:
Depositing User: LivePure Connector
Date Deposited: 28 Oct 2024 11:30
Last Modified: 04 Nov 2024 13:30
URI: https://ueaeprints.uea.ac.uk/id/eprint/97228
DOI: 10.1186/s12859-021-04293-3

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item