Hadoop- MapReduce在实际应用中常见的调优
1、Reduce Task Number
通常来说一个block就对应一个map任务进行处理,reduce任务如果人工不去设置干预的话就一个reduce。reduce任务的个数可以通过在程序中设置 job.setNumReduceTasks(个数); ,也可在配置文件上设置reduce任务个数,默认为1, 或者在代码config中配置
Configuration configuration = new Configuration(); configuration.set("mapreduce.job.reduces","2");//这个数字根据实际测试和调试来决定
2、Map Task 输出压缩
默认一个块对应一个map任务进行,没办法干预,那么就可以从输出的结果去优化,将结果压缩,如设置Map Task 输出压缩的格式:
Configuration configuration = new Configuration(); configuration.set("mapreduce.map.output.codec","org.apache.hadoop.io.compress.SnappyCodec")
3、shuffle phase 参数
mapreduce.task.io.sort.factor | 10 | The number of streams to merge at once while sorting files. This determines the number of open file handles. |
mapreduce.task.io.sort.mb | 100 | The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. |
mapreduce.map.sort.spill.percent | 0.80 | The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5 |
mapreduce.map.cpu.vcores | 1 | The number of virtual cores to request from the scheduler for each map task. |
mapreduce.reduce.memory.mb | 1024 | The amount of memory to request from the scheduler for each reduce task. |
mapreduce.reduce.cpu.vcores | 1 | The number of virtual cores to request from the scheduler for each reduce task. |