Nome e qualifica del proponente del progetto: 
sb_p_1454277
Anno: 
2019
Abstract: 

Human migration is the movement of people from one place to another, with the aim of settling, at least temporarily, at a new location. It is today a relevant topic for researchers in Demography, Economics, Political Science, Public Health, Sociology and Statistics.
The quantitative study of migration is however hindered by a frequent lack of available data, especially related to what is commonly defined illegal immigration.

Except for a few developed countries, estimates are constructed from the integration of multiple data sources of varying quality and completeness. In addition, in many countries, the main source, namely the periodical population census, faces an uncertain future, because of funding pressures and the general shift of National Statistics Agencies (NSA) from Census data to the use of various integrated Administrative data sources.

Our project has a statistical flavor and it aims to contribute to the ongoing debate and research in terms of the following points.

1. A more principled methodology in the linkage step, that is the identification of common records in multiple lists. This step is important to avoid to introduce a bias in the estimates
2. The introduction of an explicit model to account for measurement errors in the observed variables. Administrative data are often collected for non statistical purposes and they must be cleaned and re-harmonized before using them into a learning algorithm.
When data are merged from different sources, the relative quality of each data set is different and account for it may be decisive.

3. The exploration of more flexible parametric model in order to improve the already existing model for projection and/or reconstruction of migration flows. In particular, we will consider the Conwey-Maxwell-Poisson model which generalizes the Poisson class, by allowing both under and over dispersion among counts.
Finally, a package implementing the new methodologies will be delivered using the statistical software R.

ERC: 
SH3_8
SH1_3
Componenti gruppo di ricerca: 
sb_cp_is_1980048
sb_cp_is_1983153
sb_cp_is_1811483
sb_cp_is_1819029
sb_cp_is_2241151
sb_cp_es_306136
sb_cp_es_306137
Innovatività: 

We will introduce at least three innovations in the already existing and above described methods namely
1. A more principled uncertainty quantification in the linkage step, that is the identification of common records in multiple lists. Since the Bayesian integrated model proposed by Bryant and Graham is based on the integration of multiple survey analysis and administrative lists, their harmonization at record level is an important step in order not to bias the estimates of the actual number of people involved in the migration processes. In particular we plan to adopt the latent approach to entity resolution, described in Tancredi et al. (2019), where multiple data sets are combined together with the aim of clusterizing the row into small groups, each one related to a single individuals. This will be performed using a Bayesian latent structure based approach. Data will be provided by a protocol agreement established between ISTAT and the PI of the project (PROTOCOLLO DI RICERCA per la collaborazione sul tema 'Metodi Bayesiani per la demografia' 2017).

2. The introduction of an explicit model to account for measurement errors in the observed variables.
Administrative data are often collected for non statistical purposes and they must be cleaned and re-harmonized before using them into a learning algorithm.
Also, when using integrated models which put together data from different sources, the relative quality of each data set may be quite different. In theory, one should be able to 'weigh' each single source of information in terms of reliability. This can be done, within a hierarchical Bayesian model, by replacing each single observed datum with a random variable centred at the observed value and with a standard error which can be estimated or exogenously introduced in terms of prior information. This can be done adopting the general approach described in Arima et al. (2017).

3. The use of a more flexible parametric class.
Options for data model distribution proposed in the literature include the Poisson, the Normal and the Poisson-Binomial mixture distributions. The last two options are distributions typically concentrated around the mean, hence they represent a suitable choice for good quality data where the expected variance is rather small. Even when a significant prior variance is assumed in the Gaussian case, results are still closer to the initial data than when adopting a Poisson distribution. The Poisson model is the leading choice for all the other data whose quality is not very high or it is even unknown. Unfortunately, the Poisson distribution has limitations that do not always suit the population it refers to. One of the main limitations of the Poisson distribution is that mean and variance have the same value, implying data equi-dispersion. This is not always the case as population characteristics could require variance higher than the mean or lower, depending on if population is more heterogeneous or homogeneous than in the equi-dispersion case.
In this respect we will propose a more flexible version of the Poisson model which includes an extra parameter in order to account for over- and under-dispersion around the mean, namely the Conwey-Maxwell-Poisson class of distributions (Kadane et al. 2006)

Last but not least, we plan to produce an R package, with the goal of integrating the facilities already offered by the existing ones.

Codice Bando: 
1454277

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma