[转]初识MapReduce[zz]
引用自Introduction to MapReduce
MapReduce = functional programming meets distributed processing on steroids
Date back to Lisp (in the 50's)
Lisp:
- Focus on particular a dialect: “Scheme”
- Functions written in prefix notation
(+ 1 2) -> 3
(* 3 4) -> 12
(sqrt (+ (* 3 3) (* 4 4))) -> 5
(define x 3) -> x
(* x 5) -> 15
Functional Programming:
- Computation as application of functions
- Data flows are implicit in program
- Different orders of execution are possible
Lisp -> MapReduce?
Lisp is about processing lists
Two important concepts in functional programming
- Map: do something to everything in a list
- Fold: combine results of a list in some way
Map:
Map is a higher-order function
How map works:
- Function is applied to every element in a list
- Result is a new list
Fold:
Fold is also a higher-order function
How fold works:
- Accumulator set to initial value
- Function applied to list element and the accumulator
- Result stored in the accumulator
- Repeated for every item in the list
- Result is the final value in the accumulator
Let’s assume a long list of records: imagine if...
- We can distribute the execution of map operations to multiple nodes
- We have a mechanism for bringing map results back together in the fold operation
That’s MapReduce! (and Hadoop)
Implicit parallelism:
- We can parallel execution of map operations since they are isolated
- We can reorder folding if the fold function is commutative and associative
Steps:
- Iterate over a large number of records
- Map: extract something of interest from each
- Shuffle and sort intermediate results
- Reduce: aggregate intermediate results
- Generate final output
Problems?
- How do we assign work units to workers?
- What if we have more work units than workers?
- What if workers need to share partial results?
- How do we aggregate partial results?
- How do we know all the workers have finished?
- What if workers die?
Skew Problem
Issue: reduce is only as fast as the slowest map
Solution: redundantly execute map operations, use results of first to finish
All of this depends on a storage system for managing all the data…
Everything happens on top of GFS, and by extension HDFS in Hadoop