Hadoop & Hive Performance Tuning
2012-07-30 13:02 smilingleo 阅读(755) 评论(0) 编辑 收藏 举报Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:
- · Operating System
- · Processor and its number of cores
- · Memory (RAM)
- · Number of nodes in cluster
- · Storage capacity of each node
- · Network bandwidth
- · Amount of input data
- · Number of jobs in business logic
Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.
Storage
capacity of each node should have at-least 5GB extra after storing
distributed HDFS input data. For Example if input data in 1 TB and with
1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes =
approx 3GB of distributed data in each node, so it is recommended to
have at-least 8GB of storage in each node. Because each data node writes
log and need some space for swapping memory.
Network
bandwidth is recommended to have at-least 100 Mbps, as well known while
processing and loading data into HDFS, Hadoop moves lot of data over
network. Lower bandwidth channel also degrade the performance of hadoop
cluster.
Number
of nodes requires for cluster is depends on amount of data to be
processed and capacity of each node. For example node with 2GB Memory
and 2 core processor can process 1GB of data in average time. It can
also process 2 data block (of 256MB 0r 512MB) simultaneously. For
Example: To process 5TB of data, it is recommended to have 1000 nodes
with 4-to-8 Core processor and 8-to-10 GB of memory in each node to
produce result in few minutes.
Hadoop Parameters:
Data block size (Chunk size):
dfs.block.size parameter will be in hdfs-site.xml file, parameter
value is mentioned in number of bytes. Block size should be chosen
completely based on each node memory capacity. If memory is less then
set smaller block size. Because TaskTracker, bring whole block of data
to memory while processing. So for 512MB RAM, it is advised to set block
size as 64MB or 128MB. If it is dual core processor then TaskTracker
can process 2 block of data at same time, so two data block will be
bring to memory while processing, so it should be planned according to
that, for this have to set concurrent tasktracker parameter also.
Number of Maps and Reducer:
mapred.reduce.tasks & mapred.map.tasks parameter will be
in mapred-site.xml file. By default, number of maps will be equal to
number of data block. For example, if input data is 2GB and block size
is 256MB means, while processing 8 Maps will run. It won’t bother about
memory capacity and number of processor. So we need to tune this
parameter to number of nodes*number of cores in each node.
Number of Maps = Total number of processor core available in cluster.
As
per above example it runs 8 Maps, if that cluster have only 4 processor
core, then multiple thread will start running and keep swapping the
memory data, which will degrade the performance of hadoop cluster. In
same way set number of reducer to number of core in cluster. After
mapping job is over, most of nodes go idle and few nodes working for
reducer to complete, to make reducer job to complete fast, set its value
to number of nodes or number of core processor.
Logging Level:
HADOOP_ROOT_LOGGER = ERROR set this value in hadoop script file. By default its set to INFO mode, in information mode, hadoop will log all information about including all event, jobs, tasks completed, IO info, warning and error. It won’t increase huge performance improvement, but it will help to reduce number of log file I/Os and give small improvement in performance.
I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.
Map Output compression ( mapred.compress.map.output )
By
default this value set to false, its recommend to set this parameter to
true for cluster with large amount of input data to be processed.
Because of compression data transfer between nodes are fast. Map output
will not directly move to reducer, intermediately it will write to
disk. So this setting helps to save disk space and fast disk read/write.
And it’s not recommended to set this parameter to true for small amount
of input data to be processed, because it will increase the processing
time for compressing and decompressing data. But for Big data
compressing and decompression time is considerably small when compare to
time its saves in transferring and disk read/write.
Once
we set above configuration parameter to true, other dependent parameter
will be active such as setting compression technique (codec) and
compression type.
Compression method or technique orcodec (mapred.map.output.compression.codec )
Default
value for this parameter is org.apache.hadoop.io.compress.DefaultCodec.
Other available codec are org.apache.hadoop.io.compress.GzipCodec.
DefaultCodec will take more time but more compression. In LZO method it
will take less time for compression amount of compression is less. Our
own codec also can be added. Add codec or compression library which is
suitable (best) for your input data type.
mapred.map.output.compression.type
parameter help to identify in which basis data should be compressed.
User can set either RECORD or BLOCK. Record type is default type in
which each individual value is compressed, means it will compress whole
data block as it is. Block type is recommended one, in which data
compressed based on data block key-value pairs, so it helps for sorting
data in reducer side. In Cloudera Hadoop, default type is set to Block
for better performance.
Three more configuration parameters are there
1.mapred.output.compress
2.mapred.output.compression.type
3.mapred.output.compression.codec
Same
above rules apply here, but this parameter meant for MapReduce job
output, first three parameters specify compressed output for map output
alone. These three configuration parameter specify for all job output
which should be compressed or not and in which type and codec.
More configuration parameter will be discussed here regarding hadoop hive performance tuning in upcoming posts
Before going to see some configuration parameter for performance tuning, I like to ask you a question, have you ever observed job and task tracker WebUI, there you can see lot of jobs are being killed after few seconds or minutes before completion. Why so? Have you ever think of it? Of course, few of them know. Those who already know about this, please skip next paragraph.
[NOTE: To check WebUI of Hadoop cluster: Open the browser, type http://masternode-machineip(or)localhost:portnumber. We can also check this port number by changing the configuration parameter value to the portnumber we want.]
Above
suggestions are observed with Hadoop cluster with Hive querying, If any
information discussed here is misinterpreted, please leave a suggestion
in comments. Recommend
this post by clicking Facebook ‘Like’ button and ‘+1’ at bottom of
this page. By clicking like button you got regular update about my post
in your facebook updates.
[NOTE: To check WebUI of Hadoop cluster: Open the browser, type http://masternode-machineip(or)localhost:portnumber. We can also check this port number by changing the configuration parameter value to the portnumber we want.]
Name
|
Port
|
Configuration
parameter
|
Jobtracker
|
50030
|
mapred.job.tracker.http.address
|
Task
trackers
|
50060
|
mapred.task.tracker.http.address
|
As
you know, our data to be processed is replicated in multiple node. So
while processing data hadoop will also start processing
same(replicated) data chunk in multiple node. The node which complete
first will kill other jobs which is processing same data. advantage
here is job will be completed soon. Everyone will be puzzled, how?.
While processing if multiple node process same data, whichever
process first will kill other, so first processed data is considers.
After we killed on job in a node, that node start processing next
job(next data chunk).By default hadoop works like this. Cause by
default, following parameter value is set to true.
mapred.map.tasks.speculative.execution
mapred.reduce.tasks.speculative.execution
Is
this situation applicable to all type of applications? No.
If
data to be processed contain complex calculation, each job take more
time cause of executing in multiple node. This time can be utilize
by processing unique data chunks. Hence for the applications which
has more complex operation to be perform, recommended to set this
parameter value to false.
io.sort.mb
(buffer size for sorting)
Default
value is 100, means 100 MB of buffer memory for sorting. After map
jobs are completed processing, hadoop will sort the map outputs for
reducer. If map output are larger, then its recommended to increase
the value. we should also consider our memory size while increasing
this buffer value, anyhow it will take part in memory(RAM). This
parameter will give good performance improvement because, if buffer
size is large, then less amount of spill to disk. So it reduces
operations like read/write of data spills to disks.
io.sort.factor
(stream merging factor)
Default
value for this parameter is 10, this value is recommended to increase
for job with larger output similar like above. This value tell how
many streams can merge at once while sorting. This will give
performance improvement because its reduce time in number of
intermediate merging.