Research areas concerning the development of hardware and software frameworks aiming at searching for regularities in massive datasets (big data mining) recently gained a strategic role, with remarkable impacts in different fields.
The objectives of the PARADISE project are:
1) Development and implementation of a distributed multi-agent clustering algorithm (hereinafter LS-EABC ¿ Large Scale Evolutive Agent Based Clustering) and its variant for solving classification problems (hereinafter SLS-EABC ¿ Supervised Large Scale Evolutive Agent Based Classifier), both based on evolutive optimization and designed for a full exploitation of massively parallel (i.e. multicore) hardware systems. Moreover, both the algorithms will be capable of data processing in non-metric spaces and will be conceived for performing Local Metric Learning, identifying the subsets of relevant characteristics in which significant clusters exist.
2) Design and realization of a Distributed Computing Platform (hereinafter PDCP) based on 8 low-cost devices (Parallella Board), each one equipped with a 16 physical cores co-processor (128 total physical cores).
3) Test and comparison between PDCP and a workstation (hereinafter XPW) equipped with an Intel Xeon Phi CPU, in running LS-EABC and SLS-EABC.
4) Application of LS-EABC and SLS-EABC on three different relevant topics, aiming at showing the high flexibility of the proposed system. Specifically:
a) Fault detection and classification of medium-voltage power lines faults for predictive maintenance systems in Smart Grids.
b) Real-time identification of security attacks in Wi-Fi networks.
c) Mining metabolic networks for charactering pathological gut flora mixture aimed at precision medicine.
Two main innovative topics characterize the proposed project:
1) The development of a non-supervised data mining algorithm (LS-EABC) and of a supervised classification system (SLS-EABC), both conceived to deal with Big Data, to face problem defined in both metric and non-metric spaces, and to perform Local Metric Learning.
2) The design and the implementation of a low-cost and massively multi-core computing system based on the Parallella boards.
Moreover, LS-EABC and SLS-EABC effectiveness and efficiency in dealing with big data will be evaluated by the following three applications, with the aim of highlighting the abstraction level of the proposed algorithms and their potential impact on the considered research topics.
a) Predictive maintenance systems in Smart Grids
Under the Framework Agreement between DIET and AReti S.p.A (ex Acea Distribuzione S.p.A) concerning research activities on Smart Grids, a fault detection system for medium voltage lines has been designed, aimed at predictive maintenance of the network and its elements, as tool in a Condition Based Maintenance (CBM) system. The goal is to minimize the total cost of inspection and repair by gathering and interpreting heterogeneous data related to the operating condition of the network and its components. In this regard, collecting, storing and managing real-time measures is of paramount importance for the correct representation of a possible breakdown scenario in a suitable structured space. To this aim, a sensor network captures a data structure with 20 different characteristics, including categorical, metric and time series of short breaks. In the first pilot learning system [1] the core clustering procedure was based on a customized k-medoids. In this application, LS-EABC is adopted as a new clustering approach, with the advantage of identifying subset of features that are peculiar to each class of failure.
b) Real-time identification of security attacks in Wi-Fi networks
Network Intrusion Detection Systems (NIDS) have become key components of the network security architecture, specifically designed to detect and highlight network traffic anomalies [2]. During 2016, the networking and computational intelligence research groups of DIET have started a collaboration to exploit the AWID dataset (a rigorously constructed traffic dataset, encompassing 15 different types of known WiFi attacks [3]) to develop new traffic analysis and anomaly detection algorithms. Starting from this huge dataset, we plan to use SLS-EABC to synthetize an ensemble of classifiers to identify in real time DoS attacks. A preliminary study showed that most attacks can be identified by analyzing a single frame, while others need to be sensed and represented by (small) sequences of frames. Each classifier is specifically trained to discriminate a given type of attack against all the others. These classifiers can work in parallel, and the same ensemble of classifiers can be replicated, in order to enhance the overall parallelism, exploiting the PDCP hardware. Our aim is to demonstrate that it is possible to build an effective, modular, and inexpensive NIDS for high-speed WiFi networks.
c) Mining metabolic networks for charactering pathological gut flora mixture aimed at precision medicine
In biology, many systems can be described by means of structured data such as graphs. The need for non-metric machine learning approaches is evident in many biologically relevant cases. Finding a common feature space spanned by network invariants, might lead to considerable information loss on connectivity. Metabolic pathways are an example of complex biological systems described by networks of chemical reactions for which the importance of finding motifs endowed with a meaningful sematic might play a huge role in defining suitable dissimilarity measures. The target is to measure similarities between metabolic networks belonging to micro-organisms hosted in the human gut in order to find clusters of (functionally) similar micro-organisms. The similarity can be evaluated by considering the complete pathway or portions of the network particularly important for the host (human). Clusters profiles can be evaluated for healthy patients. Sick patients can be characterized by matching their profiles with healthy subjects, fostering the development of low-cost target drugs (balanced drugs containing properly chosen micro-organisms mixture) in order to restore the optimal gut flora equilibrium, crucial to avoid systemic diseases.
[1] De Santis, et al. Modeling and recognition of smart grid faults by a combined approach of dissimilarity learning and one-class classification. Neurocomputing (2015)
[2] H. Alipour, et al. Wireless Anomaly Detection Based on IEEE 802.11 Behavior Analysis. IEEE Trans. on Information Forensics and Security (2015)
[3] C. Kolias, et al. Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset. IEEE Communications Surveys Tutorials (2016)