

引用自Introduction to MapReduce

MapReduce = functional programming meets distributed processing on steroids 

Date back to Lisp (in the 50's)


  • Focus on particular a dialect: “Scheme”
  • Functions written in prefix notation
(+ 1 2) -> 3
(* 3 4) -> 12
(sqrt (+ (* 3 3) (* 4 4))) -> 5
(define x 3) -> x
(* x 5) -> 15

Functional Programming:

  • Computation as application of functions
  • Data flows are implicit in program
  • Different orders of execution are possible

Lisp -> MapReduce?

Lisp is about processing lists
Two important concepts in functional programming
- Map: do something to everything in a list
- Fold: combine results of a list in some way
Map is a higher-order function
How map works:
- Function is applied to every element in a list
- Result is a new list
Fold is also a higher-order function
How fold works:
- Accumulator set to initial value
- Function applied to list element and the accumulator
- Result stored in the accumulator
- Repeated for every item in the list
- Result is the final value in the accumulator

Let’s assume a long list of records: imagine if...
- We can distribute the execution of map operations to multiple nodes
- We have a mechanism for bringing map results back together in the fold operation

That’s MapReduce! (and Hadoop)

Implicit parallelism:
- We can parallel execution of map operations since they are isolated
- We can reorder folding if the fold function is commutative and associative

- Iterate over a large number of records
- Map: extract something of interest from each
- Shuffle and sort intermediate results
- Reduce: aggregate intermediate results
- Generate final output

- How do we assign work units to workers?
- What if we have more work units than workers?
- What if workers need to share partial results?
- How do we aggregate partial results?
- How do we know all the workers have finished?
- What if workers die?

Skew Problem
Issue: reduce is only as fast as the slowest map
Solution: redundantly execute map operations, use results of first to finish

All of this depends on a storage system for managing all the data…
Everything happens on top of GFS, and by extension HDFS in Hadoop

