The ability to mine data to extract useful knowledge has become an important challenge and it is widely studied in the data mining and machine learning communities.
This proposal focuses on two basic problems: clustering and graph mining. If the data to be mined represents a set of independent entities and their attributes clustering is the basic problem of classification and data analysis. In particular, given a metric that measures similarity and dissimilarity of items the main goal of clustering is to categorize a set of data items into clusters such that items are grouped in the same cluster when they are similar and items that are in different clusters must be different as much as possible.
However, in many cases, there is interesting knowledge to be mined from the relationships between entities (e.g. friendship relations among members of a social network). The graph representation, that is, a collection of nodes and links between nodes, easily represents entities, their attributes, and their relationships to other entities. Graph mining is the extraction of novel and useful knowledge from a graph representation of data.
Clustering and graph mining have been extensively studied by researchers belonging to different areas. As a result there are many results but many issues are still open. Main open issues are the ability of dealing with big data and with sophisticated models. We will address these problems with a special focus on social networks and bioinformatics networks.
The group has an excellent record in the analysis and design of combinatorial algorithms and especially approximation algorithms and on-line algorithms. Many of the problems that we plan to study are computationally hard and therefore intractable when the input size is large. Also in many cases the input is presented on-line and irrevocable decisions should be taken immediately. For these reasons techniques of approximation algorithms and on-line algorithms are deemed.
The problems that we plan to study are computationally hard and therefore intractable when the input size is large. For this reason we plan to use techniques of algorithmic approximation for which the group has a long standing record. Such algorithms provably achieve near optimal solutions yet are efficient enough to deal with large scale instances.
Also in many cases optimal solutions cannot be achieved because of uncertainty in the input instance or because a model for real-life problems is too complicated for analysis and only an approximation thereof can be studied.
A third issue arises from the fact that due to uncertainty of the data we must be able to enumerate all interesting solutions that need to be verified. This is particularly relevant in bioinformatics when we need to define possible solutions that can be verified in laboratory.
We plan to apply the following methodologies and techniques that have been successfully applied by the team in other areas and that have a limited application in the area.
Streaming and online algorithms
One of the current main challenges in data mining related to big data problems is to find adequate approaches to analysing massive amounts of data that are so large that they might not be analysed directly. Also it is known that sometimes data are presented in a stream and decisione should be taken immediatley. In these case data stream analysis is used that can be classified into two main categories:
Offline analysis: We consider a portion of data (usually large data) and apply an offline clustering algorithm to analyse the data.
Online analysis: The data are analysed in real time. These kinds of algorithms are constantly receiving new data and are not usually able to keep past information.
A new generation of online and streaming algorithms is currently being developed in order to manage social big data challenges, and these algorithms require high scalability in both memory consumption and time computation.
Privacy issues in social networks
In the era of online big data and social media, protecting the privacy of the users on social media has been regarded as an important issue. Many privacy-preserving studies have been proposed to address privacy-related issues; there are two main well-known approaches. The first one is to exploit ¿k-anonymity¿, which is a property possessed by certain anonymised data. Given the private data and a set of specific fields, the system (or service) has to make the data practically useful without identifying the individuals who are the subjects of the data. The second approach is differential privacy, which can provide an efficient way to maximise the accuracy of queries from statistical databases while minimising the chances of identifying its records. However, there are still open issues related to privacy. Social identification is the important issue when social data are merged from available sources, and secure data communication and graph matching are potential research areas.
Enumeration techniques in Bioinformatics
We observe that Issues 1-3 described in the previous section above play a major role in the applications area of social networks and bioinformatics that will be considered in the project.
In fact the number and coverage of public databases that collect experimental data on protein physical bindings of diverse organisms have been increasing with the advancements in high-throughput techniques. Although there is no established standard database of PPIs today, there have been efforts to integrate existing interactions in publicly available data-bases. As of today, Human Protein Reference Database (HPRD) (http://www.hprd.org) includes 41,000 Protein-protein interactions between Human proteins that are derived from a number of platform. Similarly, another freely accessible database BIOGRID includes more than 1,5 M raw interactions that are represented in the form of networks to facilitate the process of knowledge discovery.
We also have robustness issue due to uncertainty in the input data; in fact in many cases proposed interactions are often obtained by in-vitro experiments and therefore data are highly uncertain.
Finally we observe that interactions are complex and modeling interactions among proteins using a graph is not adequate because there are unknown mechanisms that explain complex interactions occurring in reality.
For all above reasons we follow the approach that researchers of the group has successfully applied in the study of metabolic networks. In this area it is important to be able to enumerate all significant solutions that represent hypothesis that should be experimentally verified in the laboratory. For this reason enumeration techniques that are able to enumerate all possible interactions arising in the complex (hyper)-graph that represent reactions in the cell.and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems vol. 40, Springer (2010).