Anno: 
2017
Nome e qualifica del proponente del progetto: 
sb_p_769995
Abstract: 

Since the early years of the new millennium, there has been a proliferation of programming models and software frameworks for large-scale data analysis. The ease of programming and the capability to express ample sets of algorithms have been the primary concerns in high-level big-data systems, that largely succeeded at simplifying the development and execution of applications on massively distributed platforms. However, efficient provisioning and fine-tuning of computations at large scale still remains a non-trivial undertaking, even for experienced programmers, as interesting and unique performance issues appear as the volume of data and the degree of parallelization increase. At the core of the performance engineering challenge lies a dichotomy between high-level programming abstractions exposed to developers and complexity of the
underlying hardware/software stack.

In this project, we propose to design and implement a software infrastructure to address these issues, producing cutting-edge methodologies and toolkits for identifying and optimizing crucial performance and reliability features of big data applications: we will devise methodologies and construct software tools to help developers understand the multiple, interdependent effects that their algorithmic, programming, and deployment choices have on an application's performance.

This goal will be achieved through the development of program analysis and profile-driven optimization techniques, exploiting information collected from the application, its workloads, and the underlying runtime system at different levels of granularity in big-data software stacks. By combining static and dynamic analyses and by leveraging novel data streaming methods and compact data sketches, we plan to manage huge volumes of profile data with different time/space/accuracy tradeoffs, enabling analyses and optimizations that are currently infeasible.

Componenti gruppo di ricerca: 
sb_cp_is_983055
sb_cp_is_999225
sb_cp_is_1003921
sb_cp_es_130335
sb_cp_es_130336
Innovatività: 

At the core of the performance optimization challenge in big-data systems lies the dichotomy between the high-level programming abstractions exposed to developers and the complexity of the underlying hardware/software stack. Hiding the multiple semantic layers of hardware and software from the programmer simplifies code development, but makes it very difficult to reason about performance. Massive parallelism and distribution of both data and computations further complicate the scenario. A major goal of the project is to bring the possibility for programmers to understand and tune the performance of large-scale computations. The ability to reason about performance will push towards a second main goal: devising automatic optimization techniques to improve the efficiency of big-data applications by exploiting insights from fine-grained profile data and workload analyses.

* Long-term impact. Methodologies devised by the project can provide the wherewithal for industry towards an efficient deployment of big-data frameworks on massively distributed systems and the optimal utilization of these frameworks for the analysis of extreme datasets. The fulfillment of this long-term vision requires a breakthrough in performance modeling, enabling software developers to master the complexity of big-data computations across large-scale systems.

The approach is to devise suitable abstractions and performance analysis tools, acting as a trait d'union between high-level programming models and complex real platforms.

The project crosses at three different branches of computer science (programming languages and systems, algorithms, and software engineering) and thus requires a variety of expertise, which the coordinator of this project has extensively pursued in the past. If successful, the project's ambition can open up novel scientific opportunities, raising new problems and perspectives in these areas. For instance, we foresee that the use of advanced algorithmic techniques will enable a range of dynamic analyses and optimizations that are currently infeasible, providing the means for collecting accurate, fine-grained profiles while maintaining minimal time and space overheads, and obtaining faster and more accurate tools based on sound theoretical foundations. In turn, and as a side effect, the vast amounts of data collected by dynamic analysis tools at scale will pose new genuine algorithmic problems and stimulate new algorithmic research in software analysis.

* Short-term impact. We expect the project to have also a short-term technological impact. The more and more widespread use of big-data systems in industry, academia, and governmental institutions can spur a quick practical impact of the results of the project. Optimizing performance is indeed crucial in big data applications. In particular, cloud services need performance guarantees to ensure their wide adoption: unpredictable performance can elongate the execution of jobs, resulting in higher costs and hurting the expectations of customers. Even a few percent of improvement in large clusters with hundreds to thousands of nodes can save millions of Euros a year. Prohibitive running times may also limit the applications that users can run in the cloud: as reported in [D2010], many enterprise-grade applications fail to operate acceptably in the current cloud platforms because of the lack of guaranteed performance. Since data analytics is a trial-and-error process, often requiring many iterations, faster applications result in a shorter data analytics cycle, which is a competitive advantage with respect to both software developers¿ productivity and business analyses.

[D2010] D. Durkee. Why cloud computing will never be free. Commun. ACM, 53(5):62-69, 2010.

Codice Bando: 
769995
Keywords: 

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma