Advances in ranked data modelling with the Extended Plackett-Luce distribution: handling data censoring, covariates and computational scalability

Inviato da Anonimo (non verificato) il Mar, 19/04/2022 - 10:26

Nome e qualifica del proponente del progetto:

sb_p_1646401

Anno:

2019

Abstract:

The project concentrates on the model-based analysis of ranking data, with the aim at introducing a series of methodological contributions and new computational strategies for a better implementation and learning from ranked observations. The starting point of our research is the Extended Plackett-Luce model (EPL), a recent extension of the PL distribution based on the relaxation of the forward order assumption, according to which the preferences are elicited sequentially from the top to the bottom position. The EPL is characterised by the additional reference order parameter, indicating the position assignment order and taking values in the complex discrete space of permutations. Theoretical EPL properties have not been explored earlier in the literature, but are crucial to define original diagnostics to test the adequacy of the EPL specification. From an inferential perspective, the mixed-type nature of the EPL parameter space needs to be suitably addressed for an efficient account of estimation uncertainty. In this regard, we are interested in going beyond the existing frequentist approach by adopting the Bayesian paradigm that, through an appropriate MCMC technique linking the mixed parameter components, could offer a valuable inferential alternative. In the EPL setup, the presence of censored (partial) data, unobserved sample heterogeneity and the inclusion of covariates are still open problems that will be also considered for further advances. The project focuses also on the computational aspects related to the multivariate structure of ranked data. The major contribution will concern the release of an upgraded version of our PLMIX package for the R environment. The main novelties will be the use of S3 class objects, as a first attempt to standardise ranking data formats in R and promote the interoperability among existing softwares, as well as the construction of novel strategies to guarantee a profitable computational scalability of estimation procedures.

ERC:

PE1_14

PE1_18

PE1_13

Componenti gruppo di ricerca:

sb_cp_is_2178481

Innovatività:

The state-of-art of ranking modelling could benefit from our proposals in several directions:

i) exploration of the theoretical properties of the EPL model class;

ii) contributing with new goodness-of-fit statistics specific for the family of multistage models;

iii) handling of partially ranked observations;

iv) comparative analysis of the novel model adequacy methods with standard diagnostics for ranking models;

v) alternative approaches to estimate the reference order parameter;

vi) generalisation of the finite EPL mixture into the Bayesian domain;

vii) inclusion of subject- and item-specific covariates.

The EPL was introduced to add flexibility to stagewise models, but its characterization in terms of theoretical properties, besides those shared with the PL, has not been addressed earlier in the literature (i). The idea (ii) is motivated to fill the deficiency of goodness-of-fit tools for the class of multistage models. The peculiarity of this parametric class, alluding to a sequential preference process, could be exploited to account for the complex multivariate dependence structure of the $K$-dimensional ranking space, rather than focussing only on marginal (univariate or bivariate) features thereof. In this sense, we expect to invest efforts in the appropriate handling of data censoring as well as in the need of suitable adjustments to accomodate sparse data problems when spanning the multivariate features of the ranking process (iii). Point (iv) aims at assessing the relative merits of the class-specific test statistics in the comparison with more general goodness-of-fit diagnostics. Formal EPL properties could supply ideas to go beyond traditional inferential procedures based on the likelihood function such as, for example, the construction of heuristic methods that could significantly reduce the computational burden (v). Moreover, moving to the Bayesian perspective could alleviate the issue of quantifying estimation uncertainty on the reference order parameter. This could be profitable with respect to the MLE approach which, instead, does not automatically return measures of inferential precision. Additional flexibility could be gained by extending the model setup to the finite mixture framework (vi), allowing for a model-based clustering of partial rankings. The latter method could be integrated with the inclusion of covariates to increase the existing approaches for the prediction of preferences (vii).

Nevertheless, computational improvements can be achieved by working on the PLMIX package with:

(a) the creation of brand new S3 classes for the codification of well-defined top ranking datasets;

(b) the construction of membership and coercion functions for the aforementioned S3 classes;

(c) the use of S3 classes for the output of the PLMIX routines implementing estimation procedures and the definition of related specific methods for the generic R functions;

(d) the expansion of the software to account also for the Bayesian estimation of the reference order parameter;

(e) the computational scalability of Bayesian EPL estimation procedures.

Step (a) reflects the attempt to establish standard structures for the collection of ranking datasets on computation environments. To our knowledge, the construction of proper objects to host the input data is only at embryonic level in R, but it is a necessary step to assist the user in the preliminary phase of a ranking analysis. Working in this direction could contribute to solve the ambiguity between ranking and ordering formats and offer the possibility to manage the differences among softwares concerning the codification of possible ties and various forms of censoring. Point (b) would provide effective tools to verify the consistency of the supplied data from different sources with the top ranking requirements as well as methods to convert, when feasible, the supplied observations into a proper top ordering dataset. This step aims at increasing the interoperability with the other packages and promote the exchange of alternative data formats among packages in a unified and controlled way. S3 classes can be also attached to the output the inferential procedures and the flexibility of generic R functions can be exploited by programming specific class methods (c). The latter step could integrate the well-established classes and methods for MCMC outputs from the R packages coda and ggmcm, to help the user in the readability, interpretation and visualization of the estimation results. Point (d) would extend the software to the more general Bayesian mixtures of EPL to infer alternative paths in the position assignment process. However, the complex nature of both the ranking and parameter space under the EPL formulation requires new methods to guarantee an adequate computational scalability. Recent hybrid strategies relied on the contamination of the Bayesian paradigm through resampling methods could be extended also to ranking models estimation (e).

Codice Bando:

1646401

Keywords:

MODELLI STATISTICI

ANALISI STATISTICA DEI DATI

PROBABILITA'

SISTEMI E SOFTWARE