URL-Based Web Page Classification: With n-Gram Language Models

Abdallah, Tarek Amr and De La Iglesia, Beatriz (2015) URL-Based Web Page Classification: With n-Gram Language Models. In: Knowledge Discovery, Knowledge Engineering and Knowledge Management. Communications in Computer and Information Science, 553 . Springer, pp. 19-33. ISBN 978-3-319-25839-3

Full text not available from this repository. (Request a copy)

Abstract

There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets. We have recently proposed a new method for URL-based web page classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods but also scales well for larger datasets.

Item Type: Book Section
Uncontrolled Keywords: language models,information retrieval,web classification,web mining,machine learning
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Pure Connector
Date Deposited: 07 Dec 2015 13:00
Last Modified: 09 Aug 2019 13:31
URI: https://ueaeprints.uea.ac.uk/id/eprint/55687
DOI: 10.1007/978-3-319-25840-9_2

Actions (login required)

View Item View Item