FedSSL: Federated Self-supervised Learning with applications to Automatic Health Diagnosis.

Anno
2021
Proponente Fabrizio Silvestri - Professore Ordinario
Sottosettore ERC del proponente del progetto
PE6_11
Componenti gruppo di ricerca
Componente Categoria
Simone Scardapane Componenti strutturati del gruppo di ricerca
Enrico Giarnieri Componenti strutturati del gruppo di ricerca
Federico Siciliano Dottorando/Assegnista/Specializzando componente non strutturato del gruppo di ricerca
Abstract

Traditionally Machine Learning (ML) assumes training data is centrally available in big data centers. This assumption does not always hold in real-world applications.
Federated Learning (FL) allows training ML models by combining decentralized training data in a privacy-aware fashion, i.e., without the need for physically exchanging the data itself. State-of-the-art federated learning approaches, though, are designed for supervised learning, thus relying on expertly labeled data, which is costly and time-consuming to obtain in large quantities. For example, for automatic detection of carcinoma cells, public datasets are composed of a few hundred images, which is multiple orders of magnitude less than other domains. Recently, self-supervised models have gathered wide popularity in the literature as a potential solution to this issue, by reducing the reliance on fully labeled data and opening up the possibility of training Deep Neural Networks (DNNs) by relying almost exclusively on unsupervised data. However, very little is known about the performance and the theoretical guarantees of self-supervision in FL scenarios. To overcome these issues, the aims of FedSSL are twofold. First, we plan to design a fully self-supervised, FL algorithm for DNNs, allowing it to be efficiently trained from large quantities of unsupervised data distributed across multiple actors. Second, the algorithm will be thoroughly analyzed from a theoretical standpoint to understand its convergence guarantees, communication requirements, and generalization performance.
The model will be tested on a realistic medical use case, by training on a newly collected large unlabeled dataset of urinary bladder cancer, one of the leading death causes for men in Italy. Data and models will be publicly released to additionally foster research in these domains. The application to the medical domain will also allow testing the feasibility and the performance of self-supervised approaches to the medical domains.

ERC
PE6_7, LS7_1, PE6_11
Keywords:
APPRENDIMENTO AUTOMATICO, SISTEMI PARALLELI E DISTRIBUITI, CANCRO, CITOPATOLOGIA, UROLOGIA

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma