Benchmarking the Semi-Supervised Naïve Bayes Classifier

Saeed, Awat, Cawley, Gavin ORCID: and Bagnall, Anthony (2015) Benchmarking the Semi-Supervised Naïve Bayes Classifier. In: 2015 International Joint Conference on Neural Networks, 2015-07-12 - 2015-07-17.

[thumbnail of IJCNN_Camera-ready submission]
PDF (IJCNN_Camera-ready submission) - Accepted Version
Download (604kB) | Preview


Semi-supervised learning involves constructing predictive models with both labelled and unlabelled training data. The need for semi-supervised learning is driven by the fact that unlabelled data are often easy and cheap to obtain, whereas labelling data requires costly and time consuming human intervention and expertise. Semi-supervised methods commonly use self training, which involves using the labelled data to predict the unlabelled data, then iteratively reconstructing classifiers using the predicted labels. Our aim is to determine whether self training classifiers actually improves performance. Expectation maximization is a commonly used self training scheme. We investigate whether an expectation maximization scheme improves a naïve Bayes classifier through experimentation with 30 discrete and 20 continuous real world benchmark UCI datasets. Rather surprisingly we find that in practice the self training actually makes the classifier worse. The cause for this detrimental affect on performance could either be with the self training scheme itself, or how self training works in conjunction with the classifier. Our hypothesis is that it is the latter cause, and the violation of the naïve Bayes model assumption of independence of attributes means predictive errors propagate through the self training scheme. To test whether this is the case, we generate simulated data with the same attribute distribution as the UCI data, but where the attributes are independent. Experiments with this data demonstrate that semi-supervised learning does improve performance, leading to significantly more accurate classifiers. These results demonstrate that semi-supervised learning cannot be applied blindly without considering the nature of the classifier, because the assumptions implicit in the classifier may result in a degradation in performance.

Item Type: Conference or Workshop Item (Paper)
Faculty \ School: Faculty of Science > School of Computing Sciences

UEA Research Groups: Faculty of Science > Research Groups > Computational Biology
Faculty of Science > Research Groups > Data Science and Statistics
Faculty of Science > Research Groups > Centre for Ocean and Atmospheric Sciences
Depositing User: Pure Connector
Date Deposited: 28 Jul 2015 12:00
Last Modified: 19 Apr 2023 01:31
DOI: 10.1109/IJCNN.2015.7280665


Downloads per month over past year

Actions (login required)

View Item View Item