A statistical model for binary response data with sample selection, choice based sampling, and misclassification problems

Anno
2017
Proponente Giuseppina Guagnano - Professore Associato
Sottosettore ERC del proponente del progetto
Componenti gruppo di ricerca
Componente Categoria
Maria Felice Arezzo Componenti il gruppo di ricerca
Abstract

In the social sciences, the analysis of behaviors or choices often involves statistical models where the response variable is observed only if a particular (selection) condition is met. Hence, the selection mechanism of data is not random and sample selection problems arise; the methodology introduced by Heckman (1978, 1979) allows solving this problem.
A more complex situation there is when the selection mechanism leads to a vast number of censored observations, so that the amount of data available for the estimation may become very low. In this situation random sampling is either inefficient, because very costly, or not feasible; to reduce costs in collecting data on choice behavior, often choices rather than decision makers are sampled, achieving a more balanced sample than random sampling would produce (hence the name, response-based or choice-based, of the sampling scheme). In this context, Greene (1992) used the Weighted Endogenous Sampling Maximum Likelihood (WESML) estimator proposed by Manski and Lerman (1977) that however requires the true population proportions of cases to be known.
Our research aims at deriving an alternative estimator for a binary choice model with sample selection problems and choice based sampling, which generalizes Greene¿s proposal.
A subsequent goal is to handle the possible misclassification of the response variable; for example, in fraud detection based on claims data, some claims classified as honest might actually be fraudulent (and vice versa). In particular, we will try to estimate the two probability of misclassification simultaneously with the parameters of the binary choice model, always considering the two kinds of sample bias.
We will assess the performance of our proposal by appropriate simulation studies and finally we will apply the estimation procedure to real data; in particular, we will analyze data on consumer loan default and credit card expenditure used by Greene (1992), to make a comparison with his results.

ERC
Keywords:
name

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma