A Novel Machine Learning Approach For File Fragments Classification

Algurashi, Alia (2022) A Novel Machine Learning Approach For File Fragments Classification. Doctoral thesis, University of East Anglia.

[thumbnail of AA 270923 Final PhD thesis.pdf]
Download (4MB) | Preview


Identifying types of manipulated or corrupted file fragments in isolation from their context is an essential task in digital forensics. In traditional file type identification, metadata, such as file extensions and header and footer signatures, is used. Traditional metadata-based approaches do not work where metadata is missing or altered, therefore some alternative strategies and approaches need to be applied or developed to solve the problem.

One approach is to apply some statistical techniques to extract features from the binary contents of file fragments and then use them as inputs for classification algorithms. This results in high dimensionality, causing learning and classification to be time-consuming. Another approach is deep learning neural networks, which extract features automatically. File fragment classification is further complicated by the high number of possible file classes. Also, some container file types, such as Powerpoint (PPT) include data belonging to other file types, such as JPEG, which can confuse the classification algorithms.

In this thesis, we developed a hybrid method to address high feature dimensionality. We use filters and wrappers to reduce the number of features. We explored the possible hierarchical relationships between file classes and we represent them with a hierarchy tree to help narrow the uncertainties for challenging file types. We proposed a novel hybrid approach that combines hierarchical models with feature selection to improve the accuracy of file fragment classification. We also explored the use of deep learning techniques for this task.

We test our methods using a benchmark dataset - GovDocs. The results from hybrid feature selection show a reduction in the number of features from 66,313 to 11–32, and provide improved accuracy compared to methods using all features. The accuracy increased from 69% using random forest to 75% using the DAG tree. We incorporate the hybrid feature selection into hierarchical modelling to generate trees that use only the most discriminative features. We find that these models outperformed classical machine-learning approaches. Finally, using deep learning for file fragment classification provided the highest accuracy of all techniques explored, obtaining accuracies of 86%.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Chris White
Date Deposited: 26 Oct 2023 08:34
Last Modified: 26 Oct 2023 08:34
URI: https://ueaeprints.uea.ac.uk/id/eprint/93475


Downloads per month over past year

Actions (login required)

View Item View Item