Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation

Cawley, Gavin C. ORCID: https://orcid.org/0000-0002-4118-9095 and Talbot, Nicola L. C. (2006) Gene selection in cancer classification using sparse logistic regression with Bayesian regularisation. Bioinformatics, 22 (19). pp. 2348-2355. ISSN 1367-4803

Full text not available from this repository. (Request a copy)

Abstract

Motivation: Gene selection algorithms for cancer classification, based on the expression of a small number of biomarker genes, have been the subject of considerable research in recent years. Shevade and Keerthi propose a gene selection algorithm based on sparse logistic regression (SLogReg) incorporating a Laplace prior to promote sparsity in the model parameters, and provide a simple but efficient training procedure. The degree of sparsity obtained is determined by the value of a regularization parameter, which must be carefully tuned in order to optimize performance. This normally involves a model selection stage, based on a computationally intensive search for the minimizer of the cross-validation error. In this paper, we demonstrate that a simple Bayesian approach can be taken to eliminate this regularization parameter entirely, by integrating it out analytically using an uninformative Jeffrey's prior. The improved algorithm (BLogReg) is then typically two or three orders of magnitude faster than the original algorithm, as there is no longer a need for a model selection step. The BLogReg algorithm is also free from selection bias in performance estimation, a common pitfall in the application of machine learning algorithms in cancer classification. Results: The SLogReg, BLogReg and Relevance Vector Machine (RVM) gene selection algorithms are evaluated over the well-studied colon cancer and leukaemia benchmark datasets. The leave-one-out estimates of the probability of test error and cross-entropy of the BLogReg and SLogReg algorithms are very similar, however the BlogReg algorithm is found to be considerably faster than the original SLogReg algorithm. Using nested cross-validation to avoid selection bias, performance estimation for SLogReg on the leukaemia dataset takes almost 48 h, whereas the corresponding result for BLogReg is obtained in only 1 min 24 s, making BLogReg by far the more practical algorithm. BLogReg also demonstrates better estimates of conditional probability than the RVM, which are of great importance in medical applications, with similar computational expense.

Item Type: Article
Uncontrolled Keywords: sdg 3 - good health and well-being ,/dk/atira/pure/sustainabledevelopmentgoals/good_health_and_well_being
Faculty \ School: Faculty of Science > School of Computing Sciences

UEA Research Groups: Faculty of Science > Research Groups > Data Science and Statistics
Faculty of Science > Research Groups > Computational Biology
Faculty of Science > Research Groups > Centre for Ocean and Atmospheric Sciences
Depositing User: Vishal Gautam
Date Deposited: 10 Mar 2011 11:00
Last Modified: 21 Apr 2023 20:31
URI: https://ueaeprints.uea.ac.uk/id/eprint/21598
DOI: 10.1093/bioinformatics/btl386

Actions (login required)

View Item View Item