My naive mapreduce notes

Challenges on parallel computing

1. node failure on huge clusters are frequent (~1000 nodes fail per day), storing data in the failed node will lost the data.

2. network bottlenecks: network bendwidth = 1Gbps, moving 10 TB data takes approximiately 1 day.

3. distributed programming is hard!

MapReduce addresses all the three challenges above:

1. Map-Reduce model addresses the node-failure-so-data-loss problem by storing data redundantly on multiple nodes. The same data is stored on multiple nodes so that even if you lose one of those nodes, the data is still available on the other nodes. 

2. Map-Reduces addresses the network bottlenecks by moving the computations close to the data to avoid copying data around the network

3. Map-Reduce provides a very simple programing model that hides the complexity of the parallel computing. 

 

posted @ 2016-06-01 10:25  Rui Yan  阅读(145)  评论(0编辑  收藏  举报