Partition models

A unified framework for de-duplication and population size estimation (with Discussion)

Data de-duplication is the process of finding records in one or more datasets belonging to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of $N$ different entities. The main novelty of our approach is to consider the population size $N$ as an unknown model parameter. As a result, one salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation.

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma