Hadoop & Hive Performance Tuning

2012-07-30 13:02 smilingleo 阅读(755) 评论(0) 编辑收藏举报

转自：http://cloud-computation.blogspot.com/2011/09/hadoop-performance-tuning-hadoop-hive.html

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:

· Operating System
· Processor and its number of cores
· Memory (RAM)
· Number of nodes in cluster
· Storage capacity of each node
· Network bandwidth
· Amount of input data
· Number of jobs in business logic

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.

Storage capacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writes log and need some space for swapping memory.

Network bandwidth is recommended to have at-least 100 Mbps, as well known while processing and loading data into HDFS, Hadoop moves lot of data over network. Lower bandwidth channel also degrade the performance of hadoop cluster.

Number of nodes requires for cluster is depends on amount of data to be processed and capacity of each node. For example node with 2GB Memory and 2 core processor can process 1GB of data in average time. It can also process 2 data block (of 256MB 0r 512MB) simultaneously. For Example: To process 5TB of data, it is recommended to have 1000 nodes with 4-to-8 Core processor and 8-to-10 GB of memory in each node to produce result in few minutes.

Hadoop Parameters:

Data block size (Chunk size):

dfs.block.size parameter will be in hdfs-site.xml file, parameter value is mentioned in number of bytes. Block size should be chosen completely based on each node memory capacity. If memory is less then set smaller block size. Because TaskTracker, bring whole block of data to memory while processing. So for 512MB RAM, it is advised to set block size as 64MB or 128MB. If it is dual core processor then TaskTracker can process 2 block of data at same time, so two data block will be bring to memory while processing, so it should be planned according to that, for this have to set concurrent tasktracker parameter also.

Number of Maps and Reducer:

mapred.reduce.tasks & mapred.map.tasks parameter will be in mapred-site.xml file. By default, number of maps will be equal to number of data block. For example, if input data is 2GB and block size is 256MB means, while processing 8 Maps will run. It won’t bother about memory capacity and number of processor. So we need to tune this parameter to number of nodes*number of cores in each node.

Number of Maps = Total number of processor core available in cluster.

As per above example it runs 8 Maps, if that cluster have only 4 processor core, then multiple thread will start running and keep swapping the memory data, which will degrade the performance of hadoop cluster. In same way set number of reducer to number of core in cluster. After mapping job is over, most of nodes go idle and few nodes working for reducer to complete, to make reducer job to complete fast, set its value to number of nodes or number of core processor.

Logging Level:

HADOOP_ROOT_LOGGER = ERROR set this value in hadoop script file. By default its set to INFO mode, in information mode, hadoop will log all information about including all event, jobs, tasks completed, IO info, warning and error. It won’t increase huge performance improvement, but it will help to reduce number of log file I/Os and give small improvement in performance.

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.

Map Output compression ( mapred.compress.map.output )

By default this value set to false, its recommend to set this parameter to true for cluster with large amount of input data to be processed. Because of compression data transfer between nodes are fast. Map output will not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write. And it’s not recommended to set this parameter to true for small amount of input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare to time its saves in transferring and disk read/write.

Once we set above configuration parameter to true, other dependent parameter will be active such as setting compression technique (codec) and compression type.

Compression method or technique orcodec (mapred.map.output.compression.codec )

Default value for this parameter is org.apache.hadoop.io.compress.DefaultCodec. Other available codec are org.apache.hadoop.io.compress.GzipCodec. DefaultCodec will take more time but more compression. In LZO method it will take less time for compression amount of compression is less. Our own codec also can be added. Add codec or compression library which is suitable (best) for your input data type.

mapred.map.output.compression.type parameter help to identify in which basis data should be compressed. User can set either RECORD or BLOCK. Record type is default type in which each individual value is compressed, means it will compress whole data block as it is. Block type is recommended one, in which data compressed based on data block key-value pairs, so it helps for sorting data in reducer side. In Cloudera Hadoop, default type is set to Block for better performance.

Three more configuration parameters are there

1.mapred.output.compress

2.mapred.output.compression.type

3.mapred.output.compression.codec

Same above rules apply here, but this parameter meant for MapReduce job output, first three parameters specify compressed output for map output alone. These three configuration parameter specify for all job output which should be compressed or not and in which type and codec.

More configuration parameter will be discussed here regarding hadoop hive performance tuning in upcoming posts

Before going to see some configuration parameter for performance tuning, I like to ask you a question, have you ever observed job and task tracker WebUI, there you can see lot of jobs are being killed after few seconds or minutes before completion. Why so? Have you ever think of it? Of course, few of them know. Those who already know about this, please skip next paragraph.

[NOTE: To check WebUI of Hadoop cluster: Open the browser, type http://masternode-machineip(or)localhost:portnumber. We can also check this port number by changing the configuration parameter value to the portnumber we want.]

Name	Port	Configuration parameter
Jobtracker	50030	`mapred.job.tracker.http.address`
Task trackers	50060	`mapred.task.tracker.http.address`

As you know, our data to be processed is replicated in multiple node. So while processing data hadoop will also start processing same(replicated) data chunk in multiple node. The node which complete first will kill other jobs which is processing same data. advantage here is job will be completed soon. Everyone will be puzzled, how?. While processing if multiple node process same data, whichever process first will kill other, so first processed data is considers. After we killed on job in a node, that node start processing next job(next data chunk).By default hadoop works like this. Cause by default, following parameter value is set to true.

mapred.map.tasks.speculative.execution

mapred.reduce.tasks.speculative.execution

Is this situation applicable to all type of applications? No.

If data to be processed contain complex calculation, each job take more time cause of executing in multiple node. This time can be utilize by processing unique data chunks. Hence for the applications which has more complex operation to be perform, recommended to set this parameter value to false.

io.sort.mb (buffer size for sorting)

Default value is 100, means 100 MB of buffer memory for sorting. After map jobs are completed processing, hadoop will sort the map outputs for reducer. If map output are larger, then its recommended to increase the value. we should also consider our memory size while increasing this buffer value, anyhow it will take part in memory(RAM). This parameter will give good performance improvement because, if buffer size is large, then less amount of spill to disk. So it reduces operations like read/write of data spills to disks.

io.sort.factor (stream merging factor)

Default value for this parameter is 10, this value is recommended to increase for job with larger output similar like above. This value tell how many streams can merge at once while sorting. This will give performance improvement because its reduce time in number of intermediate merging.

Above suggestions are observed with Hadoop cluster with Hive querying, If any information discussed here is misinterpreted, please leave a suggestion in comments. Recommend this post by clicking Facebook ‘Like’ button and ‘+1’ at bottom of this page. By clicking like button you got regular update about my post in your facebook updates.

会员力量，点亮园子希望

刷新页面返回顶部

smilingleo

Hadoop & Hive Performance Tuning

About