Anno: 
2018
Nome e qualifica del proponente del progetto: 
sb_p_1075974
Abstract: 

Continuous, interactive, and exploratory analysis of extreme data sets lies at the heart of a $42B market in 2018, projected to grow at a 10.48% annual rate by 2027. Real-time, interactive, extreme-scale data analytics remains an elusive goal.

Our academic and industrial collaborators report challenges of agility, scale, and complexity. Application logic demands rapid and often exploratory development of new queries. Yet agile query development is rendered difficult by the size of the data sets that have to be analyzed. While HPC computing systems are key to the big data promise, their cost, complexity and steep of learning curve for non-specialist is an obstacle that inhibits adoption.

Rocket aims to lower the barrier to entry for data analytics on HPC systems by providing language technologies to address key technical research priorities of the ETP4HPC Strategic Research Agenda.

Rocket decouples data scientists, who specify what is to be computed, from data engineers, who define how it is to be computed efficiently. Both interact through a shared, flexible and deployment-agnostic, scripting language. Data scientists use it as part of an agile software engineering process to prototype and explore the design space of data analysis solutions. Rocket leverages mainstream distributed infrastructures to run their queries trading latency for accuracy with a novel model-driven sampling infrastructure. Data engineers use a novel annotation mechanism to specify deployment characteristics of the code and tune performance for a particular hardware platform.

By increasing productivity, Rocket aims contribute to expanding the HPC ecosystem towards SMEs, fostering novel opportunities, improved return of investment, and sustainable development in sectors such as intelligence, finance, healthcare, and multimedia.

ERC: 
PE6_3
PE6_2
Innovatività: 

Rocket aims to innovate in at least three aspects of computing: disk-based data analytics, in-memory data analytics, and high-performance data analytics.

DISK-BASED DATA ANALYTICS

Originally introduced in the Lisp programming language, the MapReduce (M/R) paradigm has been deployed by Google [Dean08] for large-scale distributed data analysis. Open-source implementations, such as Hadoop, subsequently made the paradigm widely available and popular. In Hadoop toolchains, the Hadoop Distributed File System (HDFS) [Shvachko10] (inspired by the Google's GFS [Ghemawat03]) serves as the default store for all data as well as for intermediate results, allowing scaling jobs to massive datasets. Programming low-level M/R code is not always easy or pleasant, so higher-level languages such as Pig Latin [Olston08], Flume-Java [Chambers10], or Cascading [Cascading] have been introduced.

Innovation: expressiveness and interactivity on extreme data.

Rocket combines the scalability of the first generation disk-based technologies with the expressiveness of later approaches that support iteration or incremental computation. Our new distributed programming abstractions will allow fine-grained control over latency, without compromising accuracy, by exploiting an informed sampling/aggregation phase followed by in-memory computation.

Rocket extends approximate query processing beyond the relational model by defining a probabilistic programming notation and exploiting sampling annotations to manage bias in the sampling process.

IN-MEMORY DATA ANALYTICS

An alternative to M/R-style computation that remains relatively easy to use for non-experts can be found in partitioned global address space (PGAS) systems such as X10. Other recent programming models offering distributed data structures as primitives include distributed arrays in Presto [Venkataraman12], or resilient distributed datasets (RDDs) in Spark [Zaharia12]. These systems aim to support iterative and incremental computing.

Innovation: scalability without compromising expressiveness.

The performance of in-memory approaches quickly degrades as data-set sizes approach the TB mark. We overcome this limitation through integrated sampling and a holistic approach and programming model.

HIGH-PERFORMANCE DATA ANALYTICS

Although huge efforts have been made to make data analytics highly scalable both in terms of data size and computational parallelism, the current state-of-art in this field does not offer a "one size fits all" solution. In particular, none of these works make the underlying parallel or distributed computing platform transparent to the user, requiring non-trivial programming effort to achieve efficient and scalable data analysis.

Previous solutions in the HPC setting include: 1) Parallelizing HPC with traditional data analytics tools [pbdR] [RHadoop], 2) Enhancing distributed computing frameworks using HPC accelerators [Yan09], and 3) Adapting MapReduce to HPC platforms [Elteir11].

Innovation: productivity environments for data analytics on HPC platforms.

Current solutions tightly couple the data analysis logic to the underlying data analysis infrastructure. Rocket breaks this coupling by letting developers use a high level analytics language and then orthogonally annotate code with hints for mapping the code to underlying resources.

REFERENCES

[Cascading] Cascading. http://www.cascading.org/. Retr.: June 2018.

[Chambers10] C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, and N. Weizenbaum. "FlumeJava: easy, efficient data-parallel pipelines". In PLDI, 363-375, 2010.

[Dean08] J. Dean and S. Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". In Comm. of the ACM, 51(1):107-113, 2008.

[Elteir11] M. Elteir, H. Lin, W.-c. Feng, and T. Scogland. "StreamMR: an optimized MapReduce framework for AMD GPUs". In ICPADS, 364-371, 2011.

[Ghemawat03] S. Ghemawat, H. Gobioff, and S.-T. Leung. "The Google file system". In SOSP, 29-43, 2003.

[Olston08] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. "Pig latin: a not-so-foreign language for data processing". In SIGMOD, 1099-1110, 2008.

[pbdR] Progr. with Big Data in R. http://r-pbd.org/. Retr.: June 2018.

[RHadoop] RHadoop. https://github.com/RevolutionAnalytics/RHadoop/wiki. Retr.: June 18.

[Shvachko10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. "The Hadoop Distrib. File System". In MSST, 1-10, 2010.

[Venkataraman12] S. Venkataraman, I. Roy, A. AuYoung, and R. Schreiber. "Using R for Iterative and Incremental Processing". In HotCloud 2012.

[Yan09] Y. Yan, M. Grossman, and V. Sarkar. "JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA". In Euro-Par, 887-899, 2009.

[Zaharia12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient Distrib. Datasets: a Fault-Tolerant Abstraction for In-memory Cluster Comp.". In NSDI, 2012.

Codice Bando: 
1075974

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma