Applications in various domains often lead to high-dimensional data, which put up the challenge of interpreting a huge mass of data, which often consists of millions of measurements. A first step towards addressing this challenge is the use of data reduction techniques, which is essential in the data mining process to reveal natural structures and to identify interesting patterns in the analyzed data. The research project entails relevant classes of dimensionality reduction techniques, which are introduced to account for high-dimensional data complexities. We assume that the (high-dimensionality) complexity may be captured via two different approaches which allow to summarize the two modes (rows and columns) of a data matrix: asymmetric and symmetric treatment of the two modes of the data matrix. In the asymmetric approach, the two modes assume a different role. The first mode represents objects and is summarized by clustering methods; while the other mode refers to variables and is reduced according to a factorial technique. In the symmetric approach, the two modes have an equal role and both are summarized by clustering techniques. Both approaches will be considered in a finite mixture context due to the potential advantages it has when compared with non-probabilistic clustering techniques. In particular, attention will be focused on the development of clustering/biclustering and simultaneous clustering/factorial reduction approaches for a continuous/discrete data matrix. The impact of time occasions in the model specification, and similarities between the finite mixture and the fuzzy logic approaches will also be considered.
The massive size and high-dimensionality of observed data introduce unique computational and statistical challenges. Frequently, the data structure detected by dimensionality reduction techniques can give first insights into the data generating process. Clustering or factorial reduction approaches can, therefore, be considered probably as major concerns for data mining and knowledge discovery in everyday empirical applications. Their applications to several domains (marketing, customer satisfaction, psychology, text-mining and genomics) is well documented. Dimensionality reduction can be linked to variable selection, with the aim at discovering the most discriminating variables out of the observed ones, for biclustering purposes, where the interest is in defining homogeneous blocks of a data matrix or for the analysis of relationships between observed features and potential latent constructs, where the interest is to highlight possible causal relationships.
Within this general topic, the following sub-scopes will be focused on:
Topic 1
- Extension of finite mixtures of SEMs to longitudinal data
- Extension of finite mixtures of factor analyzers to clustering and dimensionality reduction in a longitudinal setting
- General techniques for simultaneous fuzzy clustering and factorial reduction of a data matrix
Topic 2
- Extension of finite mixtures of factor analyzers to simultaneous clustering of subjects and features when multivariate discrete outcomes are available
- Hybrid model unifying finite mixtures of factor analyzers and hidden Markov models to biclustering with allowing for homogeneous blocks for time course experiments
- Fuzzy biclustering for qualitative variables, dealt with either by fuzzy sets or by cluster prototypes based on modes
All the models discussed above not only pose challenging inference problems but also create issues such as heavy computational cost and algorithmic instability due to the high dimensionality of the data. Solution techniques are generally based on iterative methods, such as the Expectation-Maximization (EM) algorithm, variational methods and sampling-based methods such as Monte Carlo, MCMC or sequential Monte Carlo.