Supervised approaches for protein function prediction by topological data analysis
Topological Data Analysis is a novel approach, useful whenever data can be described by topological structures such as graphs. The aim of this paper is to investigate whether such tool can be used in order to define a set of descriptors useful for pattern recognition and machine learning tasks. Specifically, we consider a supervised learning problem with the final goal of predicting proteins' physiological function starting from their respective residue contact network. Indeed, folded proteins can effectively be described by graphs, making them a useful case-study for assessing Topological Data Analysis effectiveness concerning pattern recognition tasks. Experiments conducted on a subset of the Escherichia coli proteome using two different classification systems show that descriptors derived from Topological Data Analysis - namely, the Betti numbers sequence - lead to classification performances comparable with descriptors derived from widely-known centrality measures, as concerns the protein function prediction problem. Further benchmarking tests suggest the presence of some information despite the heavy compression intrinsic to the protein-to-Betti numbers casting.