big data mining | Ricerc@Sapienza

Efficient approaches for solving the large-scale k-medoids problem

In this paper, we propose a novel implementation for solving the large-scale k-medoids clustering problem. Conversely to the most famous k-means, k-medoids suffers from a computationally intensive phase for medoids evaluation, whose complexity is quadratic in space and time; thus solving this task for large datasets and, speci?cally, for large clusters might be unfeasible.

Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering

In this paper we discuss techniques for potential speedups in k-medoids clustering. Specifically, we address the advantages of pre-caching the pairwise distance matrix, heart of the k-medoids clustering algorithm, not only in order to speedup the execution of the algorithm itself, but also in order to speedup the evaluation of the well-known Silhouette Index and Davies-Bouldin Index for clusters’ validation. A major disadvantage of such pre-caching is that it might not be suitable for large datasets.

Efficient approaches for solving the large-scale k-medoids problem: Towards structured data

The possibility of clustering objects represented by structured data with possibly non-trivial geometry certainly is an interesting task in pattern recognition. Moreover, in the Big Data era, the possibility of clustering huge amount of (structured) data challenges computer science and pattern recognition researchers alike. The aim of this paper is to bridge the gap on large-scale structured data clustering.

An evolutionary agents based system for data mining and local metric learning

Discovering regularities in Big Data is nowadays a crucial task in many different applications, from bioinformatics to cybersecurity. To this aim, a promising approach consists in performing data clustering with Local Metric Learning, i.e. trying to discover well-formed (compact and populated) clusters and, at the same time, a suitable subset of features corresponding to the subspace where each cluster lies.