Clustering methods for complex, high-dimensional, data

Anno
2018
Proponente Marco Alfo' - Professore Ordinario
Sottosettore ERC del proponente del progetto
Componenti gruppo di ricerca
Abstract

Observed social and economic phenomena are increasingly complex, both from a conceptual and an empirical point of view, as new technologies may be used to gather a huge amount of information. The available data are relatively new to statistical analysis, as only during the last decade we have enough computational resources to start dealing with them. Complexity arises either when the dimension of the observed data is large or when peculiar data are analyzed. One of the main tasks for modern statistical approaches is to develop new methods to cluster complex data structures. Clustering methods can either be model or heuristics based. In the first case, the framework is based on assuming that data are generated from a well-specified probabilistic model. Finite Mixture models are a powerful tool to represent a clustering structure in the data, where data arise from groups (also referred to as components) described by homogeneous density functions with cluster-specific parameters. The number of mixture components is an unknown parameter and several criteria have been proposed for its choice. In a Bayesian framework to parameter estimation, it can be hard to select the number of components as the corresponding posterior may be (and often is) flat. To avoid overly subjective priors, a solution maybe to consider infinite mixture models. Heuristic clustering methods are not based on a proper probability distribution for the observed data; rather, a (penalized) objective function to be minimized is often introduced. Standard clustering approaches should be modified to be applied to complex data, due to the high computational cost, and data complexity. Our project aims at defining clustering methods for such complex, high-dimensional, data; specifically, we will introduce novel data analysis methods and estimation algorithms. The latter will be included in software macros/libraries to be shared with the aim at helping practitioners working in the field.

ERC
PE1_14, PE1_18
Keywords:
CLUSTER ANALYSIS, ANALISI STATISTICA DEI DATI, ANALISI MULTIVARIATA, STATISTICA COMPUTAZIONALE, RETI SOCIALI

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma