Difference: DistributedProcessingFrameworks (1 vs. 3)

Revision 32010-01-20 - IoannisKonstantinou

Line: 1 to 1

META TOPICPARENT	name="LargeScaleDataManagement"

Distributed Processing Frameworks

Large scale data management achieved by the distributed cooperation of computational and storage resources is a challenging task. Application specific requirements (e.g. the need to push computation near the data) prohibit the use of typical general purpose job schedulers. To cope with these requirements, "data-aware" distributed data management frameworks have been proposed, with Google's MapRreduce [1] as the most prevalent. MapReduce is inspired by the typical "map" and "reduce" functions found in Lisp and other functional programming languages: a problem is separated in two different phases, the Map and Reduce phase. In the Map phase, non overlapping chunks of the input data get assigned to separate processes, called mappers, which process their input and emit a set of intermediate results. In the Reduce phase, these results are fed to a (usually smaller) number of separate processes called Reducers, that ''summarize'' their input in a smaller number of results that are the solution to the original problem. For more complex situations, a workflow of map and reduce steps is followed, where mappers feed reducers and vice versa.

The MapReduce framework is typically used by applications for distributed pattern matching, distributed sorting, web-links graph traversals, building inverted indexes, web sites data traffic analysis, etc. There is an open source Java implementation of the MapReduce architecture called Hadoop [2]. Hadoop is used by a large number of research and business organizations, such as Adobe, IBM, Yahoo, Facebook and NY Times. Hadoop can also be used in cloud computing infrastructures, such as Amazon Elastic Compute Cloud (EC2), quite easily (Amazon offers MapReduce job execution via Hadoop on EC2 as a service called Elastic MapReduce [3]).

Changed:

<
<

Given the efficiency of MapReduce and the simplicity of the SQL query language, there much interest nowadays in systems that attempt to combine these two approaches. In [4] the authors compare the Hadoop implementation of MapReduce with professional parallel SQL databases and identify cases where Hadoop is more efficient. Whereas, in [5] the authors propose a hybrid system based on Hadoop, which integrates the positive elements of databases. In addition, Yahoo has developed an open source software, the Pig project [6]. Pig uses MapReduce techniques in order to analyse in parallel large scale data, which are simplified and offered to the final user in the form of SQL queries. Likewise, Hive [7], an open source software, offers the ability to run map/reduce tasks through SQL queries. Microsoft's version, Scope [8], is installed on top of Dryad [9], Microsoft's distributed platform for execution of data parallel applications.

>
>

Given the efficiency of MapReduce and the simplicity of the SQL query language, there is much interest nowadays in systems that attempt to combine these two approaches. In [4], the authors compare the Hadoop implementation of MapReduce with professional parallel SQL databases and identify cases where Hadoop is more efficient. In HadoopDB [5] the authors propose a hybrid system based on Hadoop, which integrates the positive elements of databases and MapReduce like systems. In addition, Yahoo has developed an open source software, the Pig project [6]. With Pig, users write their jobs in a scripting language. These jobs are transparently translated in a workflow of MapReduce jobs that are executed on Hadoop. Likewise, Hive [7], an open source software, offers the ability to run map/reduce tasks through SQL queries. Hive also translates user input (i.e. SQL queries) into MapReduce jobs that are executed on Hadoop. Microsoft's version, Scope [8], is installed on top of Dryad [9], Microsoft's distributed platform for execution of data parallel applications.

References

" MapReduce: simplified data processing on large clusters", J. Dean and S. Ghemawat, Commun. ACM, vol. 51, pp. 107-113, 2008 pdf

Revision 22010-01-19 - ChristinaBoumpouka

 			 META TOPICPARENT 
			 name="LargeScaleDataManagement"
- META TOPICPARENT
+ name="LargeScaleDataManagement"
-<
<
+The distributed cooperation of computational
and storage resources to perform large scale data management is a challenging task. Application specific
requirements (e.g. the need to
push computation near the data) prohibit the use of typical general
purpose job schedulers. To cope with these requirements, "data-aware"
distributed data management frameworks have been proposed, with
Google's MapRreduce as the most prevalent. MapReduce
is inspired by the typical "map" and "reduce" functions found in Lisp
and other functional programming languages: a problem is separated in
two different phases, the Map and Reduce phase. In the Map phase, non
overlapping chunks of the input data get assigned to seperate
processes, called mappers, which process their input and emmit a set of
intermediate results. In the Reduce phase, these results are fed to a
(usually smaller) number of seperate processes called Reducers, that
''summarize'' their input in a smaller number of results that are the
solution to the original problem. For more complex situations, a
workflow of map and reduce steps is followed, where mappers feed
reducers and vice versa.
->
>
+ Distributed Processing Frameworks 
Large scale data management achieved by the distributed cooperation of computational and storage resources is a challenging task. Application specific requirements  (e.g. the need to push computation near the data) prohibit the use of typical general purpose job schedulers. To cope with these requirements, "data-aware" distributed data management frameworks have been proposed, with Google's MapRreduce [1] as the most prevalent. MapReduce is inspired by the typical "map" and "reduce" functions found in Lisp and other functional programming languages: a problem is separated in two different phases, the Map and Reduce phase. In the Map phase, non overlapping chunks of the input data get assigned to separate processes, called mappers, which process their input and emit a set of intermediate results. In the Reduce phase, these results are fed to a (usually smaller) number of separate processes called Reducers, that ''summarize'' their input in a smaller number of results that are the solution to the original problem. For more complex situations, a workflow of map and reduce steps is followed, where mappers feed reducers and vice versa.
-<
<
+Τυπικές εφαρμογές στις οποίες χρησιμοποιείται η τεχνολογία MapReduce είναι η κατανεμημένη εκτέλεση ταιριάσματος προτύπων (pattern matching), η κατανεμημένη διαλογή (sorting), η διάσχιση γράφων συνδέσμων ιστού, η κατασκευή ανεστραμμένου ευρετηρίου (inverted index), η ανάλυση στοιχείων επισκεψιμότητας διαδικτυακών τόπων, κλπ. Μια υλοποίηση σε Java της αρχιτεκτονικής MapReduce είναι το πρόγραμμα ανοιχτού κώδικα Hadoop .
->
>
+The MapReduce framework is typically used by applications for distributed pattern matching, distributed sorting, web-links graph traversals, building inverted indexes, web sites data traffic analysis, etc. There is an open source Java implementation of the MapReduce architecture called Hadoop [2]. Hadoop is used by a large number of research and business organizations, such as Adobe, IBM, Yahoo, Facebook and NY Times. Hadoop can also be used in cloud computing infrastructures, such as Amazon Elastic Compute Cloud (EC2), quite easily (Amazon offers MapReduce job execution via Hadoop on EC2 as a service called Elastic MapReduce [3]).
-<
<
+Το Hadoop χρησιμοποιείται τόσο ερευνητικά όσο και επαγγελματικά από ένα μεγάλο αριθμό οργανισμών , ανάμεσά τους και οι Adobe, IBM, Yahoo, Facebook και NY Times. Το Hadoop μπορεί αρκετά εύκολα να χρησιμοποιηθεί σε υποδομές cloud computing όπως το Amazon Elastic Compute Cloud  (EC2) (η Amazon προσφέρει την εκτέλεση MapReduce μέσω Hadoop στο EC2 σαν υπηρεσία με την ονομασία Elastic MapReduce [17]). Μάλιστα, στο [18] οι συγγραφείς προτείνουν βελτιώσεις στον αρχικό κώδικα του Hadoop μετά από μελέτη της συμπεριφοράς του στο EC2.
->
>
+Given the efficiency of MapReduce and the simplicity of the SQL query language, there much interest nowadays in systems that attempt to combine these two approaches. In [4] the authors compare the Hadoop implementation of MapReduce with professional parallel SQL databases and identify cases where Hadoop is more efficient. Whereas, in [5] the authors propose a hybrid system based on Hadoop, which integrates the positive elements of databases. In addition, Yahoo has developed an open source software, the Pig project [6]. Pig uses MapReduce techniques in order to analyse in parallel large scale data, which are simplified and offered to the final user in the form of SQL queries. Likewise, Hive [7], an open source software, offers the ability to run map/reduce tasks through SQL queries. Microsoft's version, Scope [8], is installed on top of Dryad [9], Microsoft's distributed platform for execution of data parallel applications.
-<
<
+Λαμβάνοντας υπόψη την αποδοτικότητα του Map/Reduce και την ευκολία έκφρασης ερωτημάτων μέσω της γλώσσας SQL, γίνεται μια προσπάθεια ενοποίησης των δύο αυτών προσεγγίσεων. Στο [19] οι συγγραφείς συγκρίνουν την MapReduce υλοποίηση του Hadoop με αντίστοιχες επαγγελματικές παράλληλες βάσεις δεδομένων που υποστηρίζουν SQL, εντοπίζουν περιπτώσεις όπου είναι πιο αποδοτικό, και στο [20] προτείνουν ένα υβρίδιο βασισμένο στο Hadoop και το οποίο ενσωματώνει τα θετικά στοιχεία του και των βάσεων αυτών. Επιπλέον, η Yahoo αναπτύσσει και προσφέρει σαν λογισμικό ανοιχτού κώδικα το Pig project [21]. Το Pig προσφέρει δυνατότητες παράλληλης ανάλυσης μεγάλου όγκου δεδομένων χρησιμοποιώντας τεχνικές Map/Reduce, οι οποίες απλουστεύονται και προσφέρονται στον τελικό χρήστη/προγραμματιστή με την μορφή ερωτημάτων SQL. Επιπλέον, το Pig μπορεί να συνεργαστεί με το αποθηκευτικό υπόστρωμα του Hadoop. Το πρόγραμμα ανοιχτού κώδικα Hive [22] προσφέρει και αυτό την δυνατότητα εκτέλεσης map/reduce εργασιών μέσω SQL-like ερωτημάτων. Η εκδοχή της Microsoft είναι το πρόγραμμα Scope [23] το οποίο είναι εγκατεστημένο επάνω από το Dryad [24], την κατανεμημένη πλατφόρμα εκτέλεσης εφαρμογών της.
->
>
+ References 
 
 " MapReduce: simplified data processing on large clusters", J. Dean and S. Ghemawat,  Commun. ACM, vol. 51, pp. 107-113, 2008 pdf
  Hadoop Project, http://hadoop.apache.org/
  Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/
  "A comparison of approaches to large-scale data analysis", Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, and Michael Stonebraker, In SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference. ACM, June 2009. pdf
  "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads", Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Alexander Rasin, and Avi Silberschatz, In VLDB'09: Proceedings of the 2009 VLDB Endowment, August 2009. pdf
  "Pig Latin: A not-so-foreign language for data processing", C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, SIGMOD, 2008, pp. 1099–1110 pdf
  "Welcome to Hive!", http://hadoop.apache.org/hive/
  "SCOPE: Easy and efficient parallel processing of massive data sets",R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, VLDB, vol. 1, 2008, pp. 1265–1276 pdf
  "Dryad: Distributed data-parallel programs from sequential building blocks", M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, ACM SIGOPS Operating Systems Review, vol. 41, 2007, p. 72 pdf

Revision 12010-01-19 - IoannisKonstantinou

Line: 1 to 1

Added:

>
>

META TOPICPARENT	name="LargeScaleDataManagement"

The distributed cooperation of computational and storage resources to perform large scale data management is a challenging task. Application specific requirements (e.g. the need to push computation near the data) prohibit the use of typical general purpose job schedulers. To cope with these requirements, "data-aware" distributed data management frameworks have been proposed, with Google's MapRreduce as the most prevalent. MapReduce is inspired by the typical "map" and "reduce" functions found in Lisp and other functional programming languages: a problem is separated in two different phases, the Map and Reduce phase. In the Map phase, non overlapping chunks of the input data get assigned to seperate processes, called mappers, which process their input and emmit a set of intermediate results. In the Reduce phase, these results are fed to a (usually smaller) number of seperate processes called Reducers, that ''summarize'' their input in a smaller number of results that are the solution to the original problem. For more complex situations, a workflow of map and reduce steps is followed, where mappers feed reducers and vice versa.

Τυπικές εφαρμογές στις οποίες χρησιμοποιείται η τεχνολογία MapReduce είναι η κατανεμημένη εκτέλεση ταιριάσματος προτύπων (pattern matching), η κατανεμημένη διαλογή (sorting), η διάσχιση γράφων συνδέσμων ιστού, η κατασκευή ανεστραμμένου ευρετηρίου (inverted index), η ανάλυση στοιχείων επισκεψιμότητας διαδικτυακών τόπων, κλπ. Μια υλοποίηση σε Java της αρχιτεκτονικής MapReduce είναι το πρόγραμμα ανοιχτού κώδικα Hadoop .

Το Hadoop χρησιμοποιείται τόσο ερευνητικά όσο και επαγγελματικά από ένα μεγάλο αριθμό οργανισμών , ανάμεσά τους και οι Adobe, IBM, Yahoo, Facebook και NY Times. Το Hadoop μπορεί αρκετά εύκολα να χρησιμοποιηθεί σε υποδομές cloud computing όπως το Amazon Elastic Compute Cloud (EC2) (η Amazon προσφέρει την εκτέλεση MapReduce μέσω Hadoop στο EC2 σαν υπηρεσία με την ονομασία Elastic MapReduce [17]). Μάλιστα, στο [18] οι συγγραφείς προτείνουν βελτιώσεις στον αρχικό κώδικα του Hadoop μετά από μελέτη της συμπεριφοράς του στο EC2.

Λαμβάνοντας υπόψη την αποδοτικότητα του Map/Reduce και την ευκολία έκφρασης ερωτημάτων μέσω της γλώσσας SQL, γίνεται μια προσπάθεια ενοποίησης των δύο αυτών προσεγγίσεων. Στο [19] οι συγγραφείς συγκρίνουν την MapReduce υλοποίηση του Hadoop με αντίστοιχες επαγγελματικές παράλληλες βάσεις δεδομένων που υποστηρίζουν SQL, εντοπίζουν περιπτώσεις όπου είναι πιο αποδοτικό, και στο [20] προτείνουν ένα υβρίδιο βασισμένο στο Hadoop και το οποίο ενσωματώνει τα θετικά στοιχεία του και των βάσεων αυτών. Επιπλέον, η Yahoo αναπτύσσει και προσφέρει σαν λογισμικό ανοιχτού κώδικα το Pig project [21]. Το Pig προσφέρει δυνατότητες παράλληλης ανάλυσης μεγάλου όγκου δεδομένων χρησιμοποιώντας τεχνικές Map/Reduce, οι οποίες απλουστεύονται και προσφέρονται στον τελικό χρήστη/προγραμματιστή με την μορφή ερωτημάτων SQL. Επιπλέον, το Pig μπορεί να συνεργαστεί με το αποθηκευτικό υπόστρωμα του Hadoop. Το πρόγραμμα ανοιχτού κώδικα Hive [22] προσφέρει και αυτό την δυνατότητα εκτέλεσης map/reduce εργασιών μέσω SQL-like ερωτημάτων. Η εκδοχή της Microsoft είναι το πρόγραμμα Scope [23] το οποίο είναι εγκατεστημένο επάνω από το Dryad [24], την κατανεμημένη πλατφόρμα εκτέλεσης εφαρμογών της.

View topic | History: r3 < r2 < r1 | More topic actions...

No permission to view TWiki.WebBottomBar