Benchmarking the Semi-Supervised Naïve Bayes Classifier

Saeed, Awat, Cawley, Gavin and Bagnall, Anthony (2015) Benchmarking the Semi-Supervised Naïve Bayes Classifier. In: The International Joint Conference on Neural Networks, 2015-07-12 - 2015-07-17.

PDF (IJCNN_Camera-ready submission) - Submitted Version
Download (590kB) | Preview


    Semi-supervised learning involves constructing predictive models with both labelled and unlabelled training data. The need for semi-supervised learning is driven by the fact that unlabelled data are often easy and cheap to obtain, whereas labelling data requires costly and time consuming human intervention and expertise. Semi-supervised methods commonly use self training, which involves using the labelled data to predict the unlabelled data, then iteratively reconstructing classifiers using the predicted labels. Our aim is to determine whether self training classifiers actually improves performance. Expectation maximization is a commonly used self training scheme. We investigate whether an expectation maximization scheme improves a naïve Bayes classifier through experimentation with 30 discrete and 20 continuous real world benchmark UCI datasets. Rather surprisingly we find that in practice the self training actually makes the classifier worse. The cause for this detrimental affect on performance could either be with the self training scheme itself, or how self training works in conjunction with the classifier. Our hypothesis is that it is the latter cause, and the violation of the naïve Bayes model assumption of independence of attributes means predictive errors propagate through the self training scheme. To test whether this is the case, we generate simulated data with the same attribute distribution as the UCI data, but where the attributes are independent. Experiments with this data demonstrate that semi-supervised learning does improve performance, leading to significantly more accurate classifiers. These results demonstrate that semi-supervised learning cannot be applied blindly without considering the nature of the classifier, because the assumptions implicit in the classifier may result in a degradation in performance.

    Item Type: Conference or Workshop Item (Paper)
    Faculty \ School: Faculty of Science > School of Computing Sciences
    University of East Anglia > Faculty of Science > Research Groups > Computational Biology (subgroups are shown below) > Machine learning in computational biology
    ?? RGCB ??
    ?? RGMLS ??
    ?? RGCOASC ??
    Depositing User: Pure Connector
    Date Deposited: 28 Jul 2015 13:00
    Last Modified: 17 Jul 2018 18:08

    Actions (login required)

    View Item