Hadoop3.1.3+HBase-2.2.4设置snappy压缩
HBase支持snappy的前提是hadoop支持snappy,所以需要先从底层,从hadoop开始添加snappy
同时,snappy设置完成为了保险起见可以进行压测,看看集群的效果,存储压缩效果和性能测试,性能测试报告点击这里
安装Snappy本地库:
下载snappy:
hadoop@hadoop1$ wget https://src.fedoraproject.org/repo/pkgs/snappy/snappy-1.1.4.tar.gz/sha512/\ 873f655713611f4bdfc13ab2a6d09245681f427fbd4f6a7a880a49b8c526875dbdd623e203905450268f542be24a2dc9dae50e6acc1516af1d2ffff3f96553da/\ snappy-1.1.4.tar.gz
安装snappy
hadoop@hadoop1$ tar zxvf snappy-1.1.4.tar.gz -C /tmp/snappy hadoop@hadoop1$ cd /tmp/snappy/snappy-1.1.4 hadoop@hadoop1$ ./autogen.sh hadoop@hadoop1$ ./configure hadoop@hadoop1$ make hadoop@hadoop1$ make install
编译安装默认是安装到/usr/local/lib下的,拷贝到/usr/lib64下
hadoop@hadoop1$ sudo cp -dr /usr/local/lib/* /usr/lib64
安装hadoop-snappy
安装hadoop-snappy的相关依赖
hadoop@hadoop1$ sudo apt-get install pkg-config libtool automake maven -y
下载,打包hadoop-snappy
hadoop@hadoop1$ git clone https://github.com/electrum/hadoop-snappy.git hadoop@hadoop1$ cd hadoop-snappy && mvn package
Hadoop配置snappy
添加snappy本地库到 $HADOOP_HOME/lib/native/
目录下
hadoop@hadoop1$ cp -dr /usr/local/lib/* /opt/hadoop-3.1.3/lib/native
将hadoop-snappy-0.0.1-SNAPSHOT.jar拷贝到
$HADOOP_HOME/lib、
snappy的library拷贝到
$HADOOP_HOME/lib/native/
目录下即可
hadoop@hadoop1$ cp -r /home/hadoop/snappy/hadoop-snappy/target/hadoop-snappy-0.0.1-SNAPSHOT.jar $HADOOP_HOME/lib
hadoop@hadoop1$ cp /home/hadoop/snappy/hadoop-snappy/target/hadoop-snappy-0.0.1-SNAPSHOT-tar/hadoop-snappy-0.0.1-SNAPSHOT/lib/native/Linux-amd64-64/* $HADOOP_HOME/lib/native/
添加配置到hadoopenv.sh
export LD_LIBRARY_PATH=/usr/local/hadoop/hadoop-3.1.3/lib/native:/usr/local/lib/
添加配置到core-site.xml
<!-- 开启压缩 --> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.compression.codec.lzo.class</name <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
添加配置到mapred-site.xml
<!-- 这个参数设为true启用压缩 --> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <!-- 使用编解码器 --> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
到此配置snappy完成,下面命令是验证(其中/input是HDFS上的目录,下面随便丢几个文本文件即可。同时/output目录必须是不存在的,否则会失败)
hadoop@hadoop1$ hadoop jar /usr/local/hadoop/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output
执行若成功,查看/output目录下的文件即可
hadoop@hadoop1:~$ hadoop fs -ls /output5 Found 2 items -rw-r--r-- 2 hadoop supergroup 0 2020-08-02 06:11 /output5/_SUCCESS -rw-r--r-- 2 hadoop supergroup 6994 2020-08-02 06:11 /output5/part-r-00000.snappy
对比同样的/input文本但未使用snappy执行的结果如下,6994(snappy)对比23635(非snappy),压缩效果还是挺明显的
hadoop@hadoop1:~$ hadoop fs -ls /output4 Found 2 items -rw-r--r-- 2 hadoop supergroup 0 2020-08-02 06:06 /output4/_SUCCESS -rw-r--r-- 2 hadoop supergroup 23635 2020-08-02 06:06 /output4/part-r-00000
HBase配置snappy
将hadoop-snappy-0.0.1-SNAPSHOT.jar拷贝到$HBASE_HOME/lib
目录下,同时将$HADOOP_HOME/lib/native软连接到$HBASE_HOME/lib/native/(native目录没有的话创建一个就好了)
hadoop@hadoop1$ cp /home/hadoop/hadoop-snappy/target/hadoop-snappy-0.0.1-SNAPSHOT.jar $HBASE_HOME/lib
hadoop@hadoop1$ ln -s /opt/hadoop-3.1.3/lib/native /opt/hbase-2.2.4/lib/native/Linux-amd64-64
添加配置到hbase-env.sh
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hadoop-3.1.3/lib/native/:/usr/local/lib export HBASE_LIBRARY_PATH=$HBASE_LIBRARY_PATH:/opt/hbase-2.2.4/lib/native/Linux-amd64-64/:/usr/local/lib/ export CLASSPATH=$CLASSPATH:$HBASE_LIBRARY_PATH
添加配置到hbase-site.xml
<property> <name>hbase.regionserver.codecs</name> <value>snappy</value> </property>
然后就是验证snappy功能
hbase org.apache.hadoop.hbase.util.CompressionTest file:///home/hadoop/ouput snappy
返回如下则为成功
hadoop@hadoop1:~$ hbase org.apache.hadoop.hbase.util.CompressionTest file:///home/hadoop/ouput snappy SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hbase-2.2.4/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2020-08-02 10:01:34,858 INFO [main] metrics.MetricRegistries: Loaded MetricRegistries class org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl 2020-08-02 10:01:34,921 INFO [main] compress.CodecPool: Got brand-new compressor [.snappy] 2020-08-02 10:01:34,924 INFO [main] compress.CodecPool: Got brand-new compressor [.snappy] 2020-08-02 10:01:34,983 INFO [main] compress.CodecPool: Got brand-new decompressor [.snappy] SUCCESS
进入hbase shell创建带有snappy的表(这里着重强调一下 "创建表时要指定多个region,否则创建表默认一个region,压测时就会疯狂压测region分布的regionserver机器上,会导致负载集中一台,进而导致压测结果无法表达集群的性能")
hbase(main):004:0> create 'snappy-test', {NUMREGIONS => 10, SPLITALGO => 'HexStringSplit' },{ NAME => 'data', COMPRESSION => 'snappy'} Created table snappy-test Took 1.2345 seconds => Hbase::Table - snappy-test hbase(main):005:0> put 'snappy-test', '001', 'data:addr', 'beijing' Took 0.0078 seconds hbase(main):006:0> put 'snappy-test', '001', 'data:comp', 'baidu' Took 0.0036 seconds hbase(main):007:0> describe 'snappy-test' Table snappy-test is ENABLED snappy-test COLUMN FAMILIES DESCRIPTION {NAME => 'data', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false' , DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'snappy', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} 1 row(s) QUOTAS 0 row(s) Took 0.0963 seconds hbase(main):008:0>
由创建表的代码块可以看出,创建snappy压缩表其实表述不是很准确, 因为看命令行可以了解到,snappy是赋予'data'列族的一个压缩选项,而不是'snappy-test'表的属性,所以执行desc 'snappy-test'所获取的关于snappy的属性本身是列族的属性,若多个列族则可以选择性的指定某个列族是否开启snappy压缩。
如下所示,我又创建一张带有snappy的表,不过这张表有两个列族,可以选择指定某一个列族snappy,或者都压缩,或者都不,或者选择其一进行压缩,都ok:
hbase(main):010:0> create 'snappy-test3', {NUMREGIONS => 10, SPLITALGO => 'HexStringSplit' }, {NAME => 'data', COMPRESSION => 'snappy'}, {NAME=> 'data1'} Created table snappy-test3 Took 2.3475 seconds => Hbase::Table - snappy-test3 hbase(main):012:0> desc 'snappy-test3' Table snappy-test3 is ENABLED snappy-test3 COLUMN FAMILIES DESCRIPTION {NAME => 'data', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPL ICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} {NAME => 'data1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REP LICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} 2 row(s) QUOTAS 0 row(s) Took 0.2907 seconds hbase(main):013:0>
至此,Hadoop,HBase安装snappy就完成了,如果有什么问题或者探讨欢迎评论和联系我,我每天都在线,欢迎讨论