DistributedProcessingFrameworks < CSLab

CSLab Web>LargeScaleDataManagement>DistributedProcessingFrameworks (2010-01-20, IoannisKonstantinou)

Distributed Processing Frameworks

Large scale data management achieved by the distributed cooperation of computational and storage resources is a challenging task. Application specific requirements (e.g. the need to push computation near the data) prohibit the use of typical general purpose job schedulers. To cope with these requirements, "data-aware" distributed data management frameworks have been proposed, with Google's MapRreduce [1] as the most prevalent. MapReduce is inspired by the typical "map" and "reduce" functions found in Lisp and other functional programming languages: a problem is separated in two different phases, the Map and Reduce phase. In the Map phase, non overlapping chunks of the input data get assigned to separate processes, called mappers, which process their input and emit a set of intermediate results. In the Reduce phase, these results are fed to a (usually smaller) number of separate processes called Reducers, that ''summarize'' their input in a smaller number of results that are the solution to the original problem. For more complex situations, a workflow of map and reduce steps is followed, where mappers feed reducers and vice versa.

The MapReduce framework is typically used by applications for distributed pattern matching, distributed sorting, web-links graph traversals, building inverted indexes, web sites data traffic analysis, etc. There is an open source Java implementation of the MapReduce architecture called Hadoop [2]. Hadoop is used by a large number of research and business organizations, such as Adobe, IBM, Yahoo, Facebook and NY Times. Hadoop can also be used in cloud computing infrastructures, such as Amazon Elastic Compute Cloud (EC2), quite easily (Amazon offers MapReduce job execution via Hadoop on EC2 as a service called Elastic MapReduce [3]).

Given the efficiency of MapReduce and the simplicity of the SQL query language, there is much interest nowadays in systems that attempt to combine these two approaches. In [4], the authors compare the Hadoop implementation of MapReduce with professional parallel SQL databases and identify cases where Hadoop is more efficient. In HadoopDB [5] the authors propose a hybrid system based on Hadoop, which integrates the positive elements of databases and MapReduce like systems. In addition, Yahoo has developed an open source software, the Pig project [6]. With Pig, users write their jobs in a scripting language. These jobs are transparently translated in a workflow of MapReduce jobs that are executed on Hadoop. Likewise, Hive [7], an open source software, offers the ability to run map/reduce tasks through SQL queries. Hive also translates user input (i.e. SQL queries) into MapReduce jobs that are executed on Hadoop. Microsoft's version, Scope [8], is installed on top of Dryad [9], Microsoft's distributed platform for execution of data parallel applications.

References

" MapReduce: simplified data processing on large clusters", J. Dean and S. Ghemawat, Commun. ACM, vol. 51, pp. 107-113, 2008 pdf
Hadoop Project, http://hadoop.apache.org/
Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/
"A comparison of approaches to large-scale data analysis", Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, and Michael Stonebraker, In SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference. ACM, June 2009. pdf
"HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads", Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Alexander Rasin, and Avi Silberschatz, In VLDB'09: Proceedings of the 2009 VLDB Endowment, August 2009. pdf
"Pig Latin: A not-so-foreign language for data processing", C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, SIGMOD, 2008, pp. 1099–1110 pdf
"Welcome to Hive!", http://hadoop.apache.org/hive/
"SCOPE: Easy and efficient parallel processing of massive data sets",R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, VLDB, vol. 1, 2008, pp. 1265–1276 pdf
"Dryad: Distributed data-parallel programs from sequential building blocks", M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, ACM SIGOPS Operating Systems Review, vol. 41, 2007, p. 72 pdf