Multiple Imputation for Classification: Dealing with Missing Data and Uncertainty

Aleryani, Aliya (2021) Multiple Imputation for Classification: Dealing with Missing Data and Uncertainty. Doctoral thesis, University of East Anglia.

[thumbnail of 2022AleryaniAPhD.pdf]
Download (113MB) | Preview


Dealing with missing data poses a challenge as the quality of data is a significant element when applying machine learning classification algorithms. Thus few methods have been utilised to deal with such an issue prior to building classification models. Multiple impu­tation has emerged as a more advanced technique for data recovery as it provides a best reflection of the uncertainty inherent in missing data.
This research develops methods to integrate multiple imputed data with ensembles of clas­sifiers for standard data and time series. It further proposes a new method for evaluating imputation for standard data based on dissimilarity measure and a novel multiple imputa­tion for univariate time series. The study investigates the performance of chosen standard and time series classifiers when missing data increases. For both types of data, we initially simulate a series of increasing missing data completely at random. Then missing data are recovered using single and multiple imputation methods. After that, multiple imputed data are employed to build our bagging and stacking ensembles. Various ensemble approaches are implemented then compared and tested with other competitive approaches.
The results show that the proposed methods improve the classification accuracy for most algorithms tested for both standard data and time series. One of the key findings is that even with a higher level of missing data, the ensemble approaches can obtain good performance, comparable to complete data or even better in some cases. The empirical evaluation shows that, for most algorithms except Random Forest, the ensemble approaches outperform the competitive methods in most scenarios of increasing uncertainty. Our methods and statistical analysis are evaluated on data missing completely at random, but the same experimental scenarios could be used for other types of missing data.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Jackie Webb
Date Deposited: 27 May 2022 09:46
Last Modified: 27 May 2022 09:46


Downloads per month over past year

Actions (login required)

View Item View Item