Nome e qualifica del proponente del progetto: 
sb_p_2802574
Anno: 
2021
Abstract: 

Privacy policy pages and 3rd party trackers are a gold mine of information. They allow knowing what kind of data a website collects, which data a website shares with other services, how the website uses the collected data, and so on.
Privacy policy pages are updated frequently. The reasons could be a change in the policy of the websites or to comply with new regulations. Since regulations are different from country by country, the same website often serves the user different privacy policy pages based on the estimated country of origin of the user.
As for now, retrieve and analyze information from the privacy policy page requires a heavy manual effort.
The goal of the St3p project is to build a framework to retrieve, store and analyze privacy policy pages and 3rd party trackers from over a million websites automatically. St3p will retrieve privacy policy pages and 3rd party trackers at regular intervals of time and from different locations for each website.
The data retrieved will be used to build a longitudinal dataset, useful to understand how online companies react to the regulations change or understand which personal data each specific business use case collects.
To carry out these analyses, there is a need to automate the information extraction process on such a large volume of data. Thus, the St3p's engine has to be powered by machine learning models able to extract meaningful information from the pages.
Finally, we plan to Leverage the processed information to automatically verify the compliance of the privacy policies pages with the actual regulations, the correspondence between 3rd party services declared in the privacy policy pages, and the 3rd party services the website really embeds.
All this is the goal of the St3p project proposal.

ERC: 
PE6_10
PE6_5
PE6_11
Componenti gruppo di ricerca: 
sb_cp_is_3625111
Innovatività: 

The St3p project aims to build a multilingual longitudinal dataset of privacy policy pages and third-party trackers collected from over a million different websites. This dataset can enable several types of research that span from the field of online measures to understanding the impact of new data regulation policies.

In previous works, studies are enabled by manual annotation of the privacy policy pages. Although effective, this approach limits the number of pages that can be taken into account because of the manual effort. The results of the St3p project will allow to analyze and extract information automatically from the privacy policy pages, enabling studies on a large scale.

Unlike previous studies that focused exclusively on one dimension, legal (privacy policy pages) or technical (third-party trackers), this project aims to make a conjunct analysis of both these dimensions. Indeed, while from the privacy policy pages, it is possible to know only the third-party services that the website declares to use. By analyzing third-party trackers, it is possible to unveil which services the website actually embeds.

Finally, tools and results from this study are relevant for regulators, industries, and web users. The regulators can leverage our data and analysis to understand how online services adapt to new privacy regulations and uncover specific business practices. The industries using the developed tools can self-assess their compliance with the existing regulations and avoid expensive fines. Users can be warned if the website adopts risky practices that can put at stake the user's personal data.

With this project, we aim to present a groundbreaking work and plan to submit an academic paper to the Web Conference (former WWW), the topmost conference on the topic of the future direction of the World Wide Web. In addition, we plan to release as open-source the St3p dataset and the tools we will realize.

Codice Bando: 
2802574

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma