Accounting for de-duplication uncertainty in population size estimation and small area models

Anno
2018
Proponente Andrea Tancredi - Professore Ordinario
Sottosettore ERC del proponente del progetto
Componenti gruppo di ricerca
Abstract

De-duplication (record linkage or entity resolution) is the process of merging together potentially noisy lists, data sets, or databases, often in the absence of a unique identifier, both to remove duplicated information and to increase the informative content of each single file. In fact, from a statistical perspective, performing de-duplication is paramount to obtaining a more reliable and informative reference data set. Indeed, on one hand, the identification of duplications of the same entity would allow to increase the quality of the information associated to it. On the other hand, merging different files, once the common entities have been correctly detected, leads to a new, larger and richer data set. This new data set may be suitable to perform accurate model-based statistical analyses via the additional information which could not be extracted from a single data set, because the original data may not comprise some of the model variables. When unique identifiers are known exactly, the linkage process can be accomplished without errors. However, in practice, unique identifiers are rarely available and the researcher must deal with the uncertainty related to the linking step. The problem of how to account for the matching uncertainty has then caused an active line of recent research among the statistical, the machine learning, and the computer science communities. In fact, in practical applications of record linkage procedures, the concrete possibility to make wrong matching decisions should be accounted for, especially when the result of the linking step, namely the fused data set, will be used for further statistical analyses, such as regression, capture-recapture methods or small area estimation. In this project we will develop a unified Bayesian framework for population size estimation and small area estimation by using multiple files that require a preliminary de-duplication process.

ERC
SH1_6, PE1_14
Keywords:
INFERENZA STATISTICA, PROBABILITA', MODELLI STATISTICI, STATISTICA COMPUTAZIONALE

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma