Empowering the flexibility and use of the Mallows model for ranking data analysis

Inviato da Anonimo (non verificato) il Lun, 11/04/2022 - 12:48

Nome e qualifica del proponente del progetto:

sb_p_2550202

Anno:

2021

Abstract:

The project focuses on parametric modelling of ranking data, aiming at introducing a series of methodological advances and efficient computational strategies to enhance the analysis of preferences from ranked data. Our research interests concern the class of Mallows models (MMs), relying on the distance notion among permutations and occupying a central role in the ranking literature. Despite the wide range of metrics, the choice is typically limited to the Kendall or Cayley metrics, due to the related analytical simplifications. Within our project, we intend to go beyond these conventional few options and explore the formal properties of the MM with the Spearman distance (theta-model). The attractive feature of this model is its correspondence with the restriction of the normal distribution over the permutation set, for which it is expected to enjoy convenient closed-form expressions that could solve the critical estimation of the modal ranking. This means that, differently from the MMs with the other metrics, efficient and accurate inferential procedures could be developed, where the computational burden of inferring the discrete parameter is significantly reduced. The possibility of relaxing the exploration of the permutation space would favor the analysis of rankings of many alternatives, as well as the handling of censored observations via data augmentation. Additionally, an efficient estimation within the finite mixture framework is still an open area of research, to be pursued for enlarging the applicability of theta-models to samples characterized by a group structure. We stress that these further novelties require a methodological effort to construct new approximations of the Spearman distance distribution and the study of their behaviour under various model parameters settings and number of items. Another contribution will concern the release of a new R package to fill the gap of publicly available softwares to implement theta-models and mixtures thereof.

ERC:

PE1_14

PE1_13

PE1_18

Componenti gruppo di ricerca:

sb_cp_is_3388029

sb_cp_es_468970

Innovatività:

The project is intended to enrich the state-of-art of parametric ranking data modelling with several impactful innovations. From a methodological perspective, we are going to work on the

i) exploration of the theoretical properties of the theta-model;

ii) development of specific and efficient inferential procedures;

iii) performance evaluation of estimation algorithms;

iv) comparative assessment of data fit improvements in the comparison with other MMs;

v) handling of partially ranked observations;

vi) extension of the theta-model to the finite mixture framework;

vii) construction of approximations of the Spearman distance distribution;

viii) comparisons with existing approximations under alternative population scenarios;

ix) inclusion of subject- and item-specific covariates.

Despite the successful application of MMs to ranking data, the version with Spearman distance as metric on the permutation set has received less attention in the literature thus motivating the above listed point i). Understanding the theoretical properties of the theta-model could lead to analytical closed-form expressions similarly to the normal theory. These could be conveniently exploited to develop efficient and faster estimation procedures, especially for the critical and time-consuming step to estimate the modal ranking parameter (ii). Once the inferential properties have been suitably verified by means of extensive simulation studies (iii), the usefulness of the proposed model and estimation frameworks will be checked in practice with applications to real-world data and a comparative evaluation with the MMs based on other distances will be performed, in order to assess the relative merits in improving the goodness-of-fit (iv). Furthermore, we will invest efforts for the appropriate handling of partial data affected by various types of censoring by adopting, for instance, a data augmentation approach (v). In the presence of large sample, it is reasonable to expect the possible existence of unobserved heterogeneity in the sampling population. In this regard, an additional research task will be devoted to extend the theta-model setup to the finite mixture framework (vi), allowing for a model-based clustering of partial rankings via the construction of an effective EM algorithm. On the other hand, managing ranked sequences with a large number of items represents a critical obstacle that hampers the application of a ranking-based analysis in several research fields, especially in MMs including a consensus ranking parameter that can take an exploding number of possible values. To overcome this limit, a possible solution could be found in the derivation of an accurate approximation of the Spearman distance distribution (vii) and in checking its effectiveness in the comparison with the very few available alternatives introduced in the ranking literature, specifically under multiple model parameter settings (viii). From a methodological point of view, the inclusion of covariates in the theta-model and into its mixture version is still an open issue that we desire to address within our project (ix).

Computational advances are involved among the challenging research purposes of the present project. They include the fulfilment of the following main tasks:

(a) development of a new R package for implementing the methodological innovations;

(b) optimisation of the source code;

(d) construction of supporting material for the R package;

(e) maintenance and updating of the R package.

Step (a) reflects the attempt to address the lack of software for the MM with the Spearman distance in the popular and open source R statistical environment. The aim is to promote a wider use of our statistical proposals among researchers interested in ranked data analysis. Of course, it will require a deep effort in optimising the implementation of the procedures through both the integration with other faster languages, such as C++, and the possible adoption of parallel executions (b). Point (c) is important to assist the user willing to realise an in-depth analysis based on the estimation of alternative competing models to the same set of observations. In this regard, the various perspectives of interoperability and comparisons with the other packages have to be explore to offer the user a comfortable and controlled exchange of data structures among different libraries. Creating supporting material, including manual and vignettes, is a necessary step (d) to explain the functionalities of the software, which requires to be maintained and updated over time with the integrations and improvements suggested by future research developments (e).

Codice Bando:

2550202

Keywords:

MODELLI STATISTICI

PROBABILITA'

INFERENZA STATISTICA

ANALISI STATISTICA DEI DATI

SISTEMI E SOFTWARE