Benchmarking the Semi-Supervised Naïve Bayes Classifier

Saeed, Awat; Cawley, Gavin; Bagnall, Anthony

doi:10.1109/IJCNN.2015.7280665

Benchmarking the Semi-Supervised Naïve Bayes Classifier

Tools

Saeed, Awat, Cawley, Gavin ORCID: https://orcid.org/0000-0002-4118-9095 and Bagnall, Anthony (2015) Benchmarking the Semi-Supervised Naïve Bayes Classifier. In: 2015 International Joint Conference on Neural Networks, 2015-07-12 - 2015-07-17.

[thumbnail of IJCNN_Camera-ready submission]

Preview

PDF (IJCNN_Camera-ready submission) - Accepted Version
Download (604kB) | Preview

Abstract

Semi-supervised learning involves constructing predictive models with both labelled and unlabelled training data. The need for semi-supervised learning is driven by the fact that unlabelled data are often easy and cheap to obtain, whereas labelling data requires costly and time consuming human intervention and expertise. Semi-supervised methods commonly use self training, which involves using the labelled data to predict the unlabelled data, then iteratively reconstructing classifiers using the predicted labels. Our aim is to determine whether self training classifiers actually improves performance. Expectation maximization is a commonly used self training scheme. We investigate whether an expectation maximization scheme improves a naïve Bayes classifier through experimentation with 30 discrete and 20 continuous real world benchmark UCI datasets. Rather surprisingly we find that in practice the self training actually makes the classifier worse. The cause for this detrimental affect on performance could either be with the self training scheme itself, or how self training works in conjunction with the classifier. Our hypothesis is that it is the latter cause, and the violation of the naïve Bayes model assumption of independence of attributes means predictive errors propagate through the self training scheme. To test whether this is the case, we generate simulated data with the same attribute distribution as the UCI data, but where the attributes are independent. Experiments with this data demonstrate that semi-supervised learning does improve performance, leading to significantly more accurate classifiers. These results demonstrate that semi-supervised learning cannot be applied blindly without considering the nature of the classifier, because the assumptions implicit in the classifier may result in a degradation in performance.

Item Type:	Conference or Workshop Item (Paper)
Faculty \ School:	Faculty of Science > School of Computing Sciences
UEA Research Groups:	Faculty of Science > Research Groups > Machine learning in computational biology (former - to 2018) Faculty of Science > Research Groups > Computational Biology Faculty of Science > Research Groups > Data Science and AI Faculty of Science > Research Groups > Centre for Ocean and Atmospheric Sciences Faculty of Science > Research Groups > Statistics
Depositing User:	Pure Connector
Date Deposited:	28 Jul 2015 12:00
Last Modified:	18 Jun 2026 21:12
URI:	https://ueaeprints.uea.ac.uk/id/eprint/53484
DOI:	10.1109/IJCNN.2015.7280665

Downloads

Downloads per month over past year

Actions (login required)

View Item