An integrated clustering analysis framework for heterogeneous data

Mojahed, Aalaa (2016) An integrated clustering analysis framework for heterogeneous data. Doctoral thesis, University of East Anglia.

[thumbnail of Thesis.pdf]
Download (5MB) | Preview


Big data is a growing area of research with some important research challenges that motivate
our work. We focus on one such challenge, the variety aspect. First, we introduce
our problem by defining heterogeneous data as data about objects that are described by
different data types, e.g., structured data, text, time-series, images, etc. Through our work
we use five datasets for experimentation: a real dataset of prostate cancer data and four
synthetic dataset that we have created and made them publicly available. Each dataset
covers different combinations of data types that are used to describe objects. Our strategy
for clustering is based on fusion approaches. We compare intermediate and late fusion
schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF),
where the integration process takes place at the level of calculating similarities. SMF produces
a single distance fusion matrix and two uncertainty expression matrices. We then
propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids
algorithm that utilises uncertainty calculations to improve on the clustering performance.
We evaluate our results by comparing them to clustering produced using individual elements
and show that the fusion approach produces equal or significantly better results.
Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids
does. In addition, from a theoretical point of view, our proposed Hk-medoids
algorithm has less computation complexity than the popular PAM implementation of the
k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering
by individual elements by combining cluster labels using an object co-occurrence
matrix technique. The final cluster is then derived by a hierarchical clustering algorithm.
We show that intermediate fusion for clustering of heterogeneous data is a feasible and
efficient approach using our proposed Hk-medoids algorithm.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Jackie Webb
Date Deposited: 03 Oct 2016 12:53
Last Modified: 03 Oct 2016 12:53

Actions (login required)

View Item View Item