Difference: DistributedProcessingFrameworks (2 vs. 3)

Revision 32010-01-20 - IoannisKonstantinou

Line: 1 to 1

META TOPICPARENT	name="LargeScaleDataManagement"

Distributed Processing Frameworks

Large scale data management achieved by the distributed cooperation of computational and storage resources is a challenging task. Application specific requirements (e.g. the need to push computation near the data) prohibit the use of typical general purpose job schedulers. To cope with these requirements, "data-aware" distributed data management frameworks have been proposed, with Google's MapRreduce [1] as the most prevalent. MapReduce is inspired by the typical "map" and "reduce" functions found in Lisp and other functional programming languages: a problem is separated in two different phases, the Map and Reduce phase. In the Map phase, non overlapping chunks of the input data get assigned to separate processes, called mappers, which process their input and emit a set of intermediate results. In the Reduce phase, these results are fed to a (usually smaller) number of separate processes called Reducers, that ''summarize'' their input in a smaller number of results that are the solution to the original problem. For more complex situations, a workflow of map and reduce steps is followed, where mappers feed reducers and vice versa.

The MapReduce framework is typically used by applications for distributed pattern matching, distributed sorting, web-links graph traversals, building inverted indexes, web sites data traffic analysis, etc. There is an open source Java implementation of the MapReduce architecture called Hadoop [2]. Hadoop is used by a large number of research and business organizations, such as Adobe, IBM, Yahoo, Facebook and NY Times. Hadoop can also be used in cloud computing infrastructures, such as Amazon Elastic Compute Cloud (EC2), quite easily (Amazon offers MapReduce job execution via Hadoop on EC2 as a service called Elastic MapReduce [3]).

Changed:

<
<

Given the efficiency of MapReduce and the simplicity of the SQL query language, there much interest nowadays in systems that attempt to combine these two approaches. In [4] the authors compare the Hadoop implementation of MapReduce with professional parallel SQL databases and identify cases where Hadoop is more efficient. Whereas, in [5] the authors propose a hybrid system based on Hadoop, which integrates the positive elements of databases. In addition, Yahoo has developed an open source software, the Pig project [6]. Pig uses MapReduce techniques in order to analyse in parallel large scale data, which are simplified and offered to the final user in the form of SQL queries. Likewise, Hive [7], an open source software, offers the ability to run map/reduce tasks through SQL queries. Microsoft's version, Scope [8], is installed on top of Dryad [9], Microsoft's distributed platform for execution of data parallel applications.

>
>

Given the efficiency of MapReduce and the simplicity of the SQL query language, there is much interest nowadays in systems that attempt to combine these two approaches. In [4], the authors compare the Hadoop implementation of MapReduce with professional parallel SQL databases and identify cases where Hadoop is more efficient. In HadoopDB [5] the authors propose a hybrid system based on Hadoop, which integrates the positive elements of databases and MapReduce like systems. In addition, Yahoo has developed an open source software, the Pig project [6]. With Pig, users write their jobs in a scripting language. These jobs are transparently translated in a workflow of MapReduce jobs that are executed on Hadoop. Likewise, Hive [7], an open source software, offers the ability to run map/reduce tasks through SQL queries. Hive also translates user input (i.e. SQL queries) into MapReduce jobs that are executed on Hadoop. Microsoft's version, Scope [8], is installed on top of Dryad [9], Microsoft's distributed platform for execution of data parallel applications.

References

" MapReduce: simplified data processing on large clusters", J. Dean and S. Ghemawat, Commun. ACM, vol. 51, pp. 107-113, 2008 pdf

View topic | History: r3 < r2 < r1 | More topic actions...

No permission to view TWiki.WebBottomBar