Imputation through Clustering of Time Series Data: a case study in air pollution

Alahamade, Wedad (2021) Imputation through Clustering of Time Series Data: a case study in air pollution. Doctoral thesis, University of East Anglia.

[thumbnail of 2022AlahamadeWPhD.pdf]
Download (42MB) | Preview


Air pollution is a global problem, and air pollution concentration assessment plays an essential role in evaluating the associated risk to human health. Unfortunately, air pollution monitoring stations often have periods of missing data.

In this thesis, we investigated missing values problem in air quality data by looking at the hourly pollutant concentration Time Series (TS) of the main four pollutants included in air quality assessment: O3, NO2, PM2.5, and PM10. The research presented in this thesis aims to reduce the uncertainty of the air quality assessment by proposing methods for the imputation of missing values either partially or completely. Our approach uses clustering of stations based on measured pollutants to inform the imputation.

We started by testing uni-variate clustering and then developing a multivariate time series (MVTS) clustering method that considers all measured pollutants at a station by aggregating the similarity between those pollutants (through a fused distance) followed by imputation models for the whole TS. We developed various imputation models including ensemble models which aggregate temporal similarity obtained from clustering and spatial similarity obtained by the geographical correlation between stations.

Our experimental results show that using MVTS clustering enables imputation of unmeasured pollutants in any station and produced plausible imputed values for all pollutants. Ensemble imputation models (Model 8 and 9) gave the lowest RMSE, the highest (IOA) between imputed and real values, and met the minimum requirement criteria using FAC2 for air quality modelling.

The imputation models reproduce high pollution episodes at stations within the clusters where these episodes possibly happened but were not measured, as some of them were captured by the cluster centroids. We also found two important pollutants associated with those episodes: PM2.5 and O3 which may require more measures or should be imputed in different locations for more realistic air quality monitoring.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Chris White
Date Deposited: 26 Jul 2022 10:48
Last Modified: 26 Jul 2022 10:48


Downloads per month over past year

Actions (login required)

View Item View Item