Website categorization: A formal approach and robustness analysis in the case of e-commerce detection

01 Pubblicazione su rivista
Bruni R., Bianchi G.
ISSN: 0957-4174

Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma