URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models

Abdallah, Tarek Amr and De La Iglesia, Beatriz ORCID: https://orcid.org/0000-0003-2675-5826 (2014) URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models. In: SCITEPRESS Digital Library - KDIR 2014 - International Conference on Knowledge Discovery and Information Retrieval, 2014-11-13, Italy.

Full text not available from this repository. (Request a copy)

Abstract

This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself. For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.

Item Type: Conference or Workshop Item (Paper)
Uncontrolled Keywords: language models,information retrieval,web classification,web mining,machine learning
Faculty \ School: Faculty of Science > School of Computing Sciences
UEA Research Groups: Faculty of Science > Research Groups > Data Science and Statistics
Faculty of Medicine and Health Sciences > Research Centres > Business and Local Government Data Research Centre (former - to 2023)
Faculty of Science > Research Groups > Norwich Epidemiology Centre
Faculty of Medicine and Health Sciences > Research Groups > Norwich Epidemiology Centre
Depositing User: Pure Connector
Date Deposited: 25 Feb 2015 06:21
Last Modified: 19 Apr 2023 01:31
URI: https://ueaeprints.uea.ac.uk/id/eprint/52359
DOI: 10.5220/0005030500140021

Actions (login required)

View Item View Item