201707121200复习-hbase篇-总结篇

一、meta的数据模型
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:regioninfo, timestamp=1499761583361,
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　value={ENCODED => bf7b572351276fe99d22e53a3675b846,
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 NAME => 'ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846.',

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 STARTKEY => '', ENDKEY => ''}
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:seqnumDuringOpen, timestamp=1499761583361, value=\x00\x00\x00\x00\x00\x00\x00$
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:server, timestamp=1499761583361, value=s129:16020
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:serverstartcode, timestamp=1499761583361, value=1499761550410

meta表记录的是元数据
meta表有一个rowkey ，和一个info列族， info列族又有四个列： regioninfo列，seqnumDuringOpen列， server列， serverstartcode列
1、先说rowkey：ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846.
用逗号分隔，分为3部分，
第一部分：ns1:test1 库名:表名（因为记录了所有库表，为了区分，所以库名表名都是需要的）
第二部分：当前region的在表中的记录的起始行，这里为空，相当于值是0，表示当前这个region管理表的从第0条记录开始的数据，
(因为记录了所有的分区，为了区分同一个库的同一个表的不同region,)
第三部分：时间戳.利用前两部分的编码. 注：时间戳后有一个点，编码后也有一个点，那并不是英文的句号。
总结：rowkey作为meta表中的primarykey，具有唯一性，所以rowkey是 "库, 表, region , 唯一编码" 的合体，这样才能唯一表示一个rowkey。
2、再说info列族：
2.1、regioninfo列，有两个kv对：
　　timestamp=1499761583361 一个时间戳
value={ENCODED => bf7b572351276fe99d22e53a3675b846, 编码：值等于rowkey中的编码，
　　 NAME => 'ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846.', 名称: 即rowkey的名称
STARTKEY => '', ENDKEY => ''} 开始行和结束行：当前region管理表中的从第几行到第几行的数据
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　都为空表示当前regiong管理表的全部数据，表还没有进行过切割
2.2、seqnumDuringOpen列：有两个kv对：
　　timestamp=1499761583361 一个时间戳
　　value=\x00\x00\x00\x00\x00\x00\x00$ 一个16进制的值，
2.3、sever列：一个比较重要的列：
　　value=s129:16020 表示了当前的region归哪个regionserver管理，联系这个regionserver用16020端口
2.4、serverstartcode列：两个时间戳，并没有多大意义

二、切割表时的meta数据模型 split 'ns1:test1'
注：并不是马上就执行切割,而是现有一个执行计划，在meta表中记录如下：
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:regioninfo, timestamp=1499765283100, value={ENCODED => bf7b572351276fe99d22e53a3675b846, NAME => 'ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846.', STARTKEY => '', ENDKEY => '', OFFLINE => true, SPLIT => true}
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:seqnumDuringOpen, timestamp=1499761583361, value=\x00\x00\x00\x00\x00\x00\x00$
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:server, timestamp=1499761583361, value=s129:16020
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:serverstartcode, timestamp=1499761583361, value=1499761550410
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:splitA, timestamp=1499765283100, value={ENCODED => a08936005520d400e9adb6f2626a7608, NAME => 'ns1:test1,,1499765280842.a08936005520d400e9adb6f2626a7608.', STARTKEY =>'', ENDKEY => '14917'}
ns1:test1,,1495607792432.bf7b572351276fe99d22e53a3675b846. column=info:splitB, timestamp=1499765283100, value={ENCODED => 3e42e28a2b0816105bef657dfa325867, NAME => 'ns1:test1,14917,1499765280842.3e42e28a2b0816105bef657dfa325867.', STARTKEY => '14917', ENDKEY => ''}

ns1:test1,,1499765280842.a08936005520d400e9adb6f2626a7608. column=info:regioninfo, timestamp=1499765284165, value={ENCODED => a08936005520d400e9adb6 f2626a7608, NAME => 'ns1:test1,,1499765280842.a08936005520d400e9adb6f2626a7608.', STARTKE Y => '', ENDKEY => '14917'}
ns1:test1,,1499765280842.a08936005520d400e9adb6f2626a7608. column=info:seqnumDuringOpen, timestamp=1499765284165, value=\x00\x00\x00\x00\x00\x00\x00\x1F
ns1:test1,,1499765280842.a08936005520d400e9adb6f2626a7608. column=info:server, timestamp=1499765284165, value=s129:16020
ns1:test1,,1499765280842.a08936005520d400e9adb6f2626a7608. column=info:serverstartcode, timestamp=1499765284165, value=1499761550410

ns1:test1,14917,1499765280842.3e42e28a2b0816105bef657dfa325867. column=info:regioninfo, timestamp=1499765284096, value={ENCODED => 3e42e28a2b0816105bef65 7dfa325867, NAME => 'ns1:test1,14917,1499765280842.3e42e28a2b0816105bef657dfa325867.', STARTKEY => '14917', ENDKEY => ''}
ns1:test1,14917,1499765280842.3e42e28a2b0816105bef657dfa325867. column=info:seqnumDuringOpen, timestamp=1499765284096, value=\x00\x00\x00\x00\x00\x00\x00
ns1:test1,14917,1499765280842.3e42e28a2b0816105bef657dfa325867. column=info:server, timestamp=1499765284096, value=s129:16020
ns1:test1,14917,1499765280842.3e42e28a2b0816105bef657dfa325867. column=info:serverstartcode, timestamp=1499765284096, value=1499761550410
首先会在原有基础上的info列族里增加两个列：splitA 和SplitB ，以及两个切割后的记录
当切割真正完成时，就会删除执行计划，即值保留切割后的两个记录，即只记录后面的8条记录，上面的那6条都会被删除。
Q1:什么情况下会开始切割
　　A1:当你手动指定切割时，可以切割table，也可以切割region
　　A2:当
Q2:切割详细过程?
　　A：

三、一些重要的类之间的关联性：

HMaster
HRegionServer
WALKey
HRegion
HeapSize
MemStore --> DefaultMemStore
Store --> HStore
StoreFile
HFile

3.1、HMaster 的解释说明
org.apache.hadoop.hbase.master.HMaster

@LimitedPrivate(value={"Tools"})
@SuppressWarnings(value={"deprecation"})

HMaster is the "master server" for HBase. An HBase cluster has one active master. If many masters are started, all compete. Whichever wins goes on to run the cluster. All others park themselves in their constructor until master or cluster shutdown or until the active master loses its lease in zookeeper. Thereafter, all running master jostle to take over master role.

The Master can be asked shutdown the cluster. See shutdown(). In this case it will tell all regionservers to go down and then wait on them all reporting in that they are down. This master will then shut itself down.

You can also shutdown just this master. Call stopMaster().

See Also:
org.apache.zookeeper.Watcher

3.2、HRegionServer 的解释说明
org.apache.hadoop.hbase.regionserver.HRegionServer

@LimitedPrivate(value={"Tools"})
@SuppressWarnings(value={"deprecation"})

HRegionServer makes a set of HRegions available to clients. It checks in with the HMaster. There are many HRegionServers in a single HBase deployment.
HRegionServer管理一个HRegion的集合
一个HRegionServer表示一台机器或者一台机器上的一个进程，它记录了当前机器的

3.3、WALKey 的解释说明
org.apache.hadoop.hbase.wal.WALKey

@LimitedPrivate(value={"Replication"})

A Key for an entry in the change log. The log intermingles edits to many tables and rows, so each log entry identifies the appropriate table and row. Within a table and row, they're also sorted.

Some Transactional edits (START, COMMIT, ABORT) will not have an associated row. Note that protected members marked @InterfaceAudience.Private are only protected to support the legacy HLogKey class, which is in a different package.

3.4、HRegion 没有注释，看父接口Region的说明
org.apache.hadoop.hbase.regionserver.Region

@LimitedPrivate(value={"Coprocesssor"})
@Evolving

Regions store data for a certain region of a table. It stores all columns for each row. A given table consists of one or more Regions.

An Region is defined by its table and its key extent.

Locking at the Region level serves only one purpose: preventing the region from being closed (and consequently split) while other operations are ongoing. Each row level operation obtains both a row lock and a region read lock for the duration of the operation. While a scanner is being constructed, getScanner holds a read lock. If the scanner is successfully constructed, it holds a read lock until it is closed. A close takes out a write lock and consequently will block for ongoing operations and will block new operations from starting while the close is in progress.

3.5、HeipSize 的解释说明
org.apache.hadoop.hbase.io.HeapSize

@Private

Implementations can be asked for an estimate of their size in bytes.

Useful for sizing caches. Its a given that implementation approximations do not account for 32 vs 64 bit nor for different VM implementations.

An Object's size is determined by the non-static data members in it, as well as the fixed Object overhead.

For example:

public class SampleObject implements HeapSize {

int [] numbers;
int x;
}

3.6、MemStore 的解释说明
org.apache.hadoop.hbase.regionserver.MemStore

@Private

The MemStore holds in-memory modifications to the Store. Modifications are Cells.

The MemStore functions should not be called in parallel. Callers should hold write and read locks. This is done in HStore.

3.7、DefaultMemStore 的解释说明
org.apache.hadoop.hbase.regionserver.DefaultMemStore

@Private

The MemStore holds in-memory modifications to the Store. Modifications are Cells. When asked to flush, current memstore is moved to snapshot and is cleared. We continue to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point we let the snapshot go.

The MemStore functions should not be called in parallel. Callers should hold write and read locks. This is done in HStore.

TODO: Adjust size of the memstore when we remove items because they have been deleted. TODO: With new KVSLS, need to make sure we update HeapSize with difference in KV size.

3.8、Store 的解释说明
org.apache.hadoop.hbase.regionserver.Store

@LimitedPrivate(value={"Coprocesssor"})
@Evolving

Interface for objects that hold a column family in a Region. Its a memstore and a set of zero or more StoreFiles, which stretch backwards over time.

3.9、HStore 的解释说明
org.apache.hadoop.hbase.regionserver.HStore

@Private
HSotre 对象持有一个region中的一个列族
A Store holds a column family in a Region. Its a memstore and a set of zero or more StoreFiles, which stretch backwards over time.

There's no reason to consider append-logging at this level; all logging and locking is handled at the HRegion level. Store just provides services to manage sets of StoreFiles. One of the most important of those services is compaction services where files are aggregated once they pass a configurable threshold.

The only thing having to do with logs that Store needs to deal with is the reconstructionLog. This is a segment of an HRegion's log that might NOT be present upon startup. If the param is NULL, there's nothing to do. If the param is non-NULL, we need to process the log to reconstruct a TreeMap that might not have been written to disk before the process died.

It's assumed that after this constructor returns, the reconstructionLog file will be deleted (by whoever has instantiated the Store).

Locking and transactions are handled at a higher level. This API should not be called directly but by an HRegion manager.

3.10、StoreFile 的解释说明
org.apache.hadoop.hbase.regionserver.StoreFile

@LimitedPrivate(value={"Coprocessor"})

A Store data file. Stores usually have one or more of these files. They are produced by flushing the memstore to disk. To create, instantiate a writer using StoreFile.WriterBuilder and append data. Be sure to add any metadata before calling close on the Writer (Use the appendMetadata convenience methods). On close, a StoreFile is sitting in the Filesystem. To refer to it, create a StoreFile instance passing filesystem and path. To read, call createReader().

StoreFiles may also reference store files in another Store. The reason for this weird pattern where you use a different instance for the writer and a reader is that we write once but read a lot more.

3.11、HFile 的解释说明
org.apache.hadoop.hbase.io.hfile.HFile

@Private

File format for hbase. A file of sorted key/value pairs. Both keys and values are byte arrays.

The memory footprint of a HFile includes the following (below is taken from the TFile documentation but applies also to HFile):

Some constant overhead of reading or writing a compressed block.
Each compressed block requires one compression/decompression codec for I/O.
Temporary space to buffer the key.
Temporary space to buffer the value.
HFile index, which is proportional to the total number of Data Blocks. The total amount of memory needed to hold the index can be estimated as (56+AvgKeySize)*NumBlocks.
Suggestions on performance optimization.
Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB.
The current implementation does not offer true multi-threading for reading. The implementation uses FSDataInputStream seek()+read(), which is shown to be much faster than positioned-read call in single thread mode. However, it also means that if multiple threads attempt to access the same HFile (using multiple scanners) simultaneously, the actual I/O is carried out sequentially even if they access different DFS blocks (Reexamine! pread seems to be 10% faster than seek+read in my testing -- stack).
Compression codec. Use "none" if the data is not very compressable (by compressable, I mean a compression ratio at least 2:1). Generally, use "lzo" as the starting point for experimenting. "gz" overs slightly better compression ratio over "lzo" but requires 4x CPU to compress and 2x CPU to decompress, comparing to "lzo".
For more on the background behind HFile, see HBASE-61.
File is made of data blocks followed by meta data blocks (if any), a fileinfo block, data block index, meta data block index, and a fixed size trailer which records the offsets at which file changes content type.

<data blocks><meta blocks><fileinfo><data index><meta index><trailer>
Each block has a bit of magic at its start. Block are comprised of key/values. In data blocks, they are both byte arrays. Metadata blocks are a String key and a byte array value. An empty file looks like this:
<fileinfo><trailer>
. That is, there are not data nor meta blocks present.
TODO: Do scanners need to be able to take a start and end row? TODO: Should BlockIndex know the name of its file? Should it have a Path that points at its file say for the case where an index lives apart from an HFile instance?

四、读数据流程

五、写数据流程

六、为何读写流程那么快速

七、优化

　　1、在客户端增加缓存，如果不增加缓存，那么生成一个put对象，就调用rpc往hbase里写一条，在客户端增加缓存，当缓存大于2M时，一次rpc将多个put对象存到hbase中。　

@Test
    public void insertData() throws Exception {
        HTable table = (HTable) (conn.getTable(TableName.valueOf("ns1:test1")));
        // 开启缓存插入
        table.setAutoFlush(false, true);
        // 循环生成put
        Put put = null;
        for (int i = 10000; i < 20000; i++) {
            put = new Put(Bytes.toBytes(i + ""));
            put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("age"), Bytes.toBytes(i + ""));
            put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes("tom" + i));
            put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("age"), Bytes.toBytes(i + ""));
            put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("name"), Bytes.toBytes("tomas" + i));
            table.put(put);
        }
        // 有一些会保留在缓存中,程序结束前,先刷新缓存
        table.flushCommits();
        table.close();
    }

View Code

　　2、写过滤器，减少服务器端通过网络传送的数据量。

　　3、协处理器，让用户把计算移动到数据存放端

　　4、垃圾回收优化

　　5、压缩，首选LZO，然后snappy

　　6、优化拆分和合并

　　　　关闭自动拆分：

　　　　预拆分：

　　7、客户端优化：

　　　　table.setAutoFlush() table.flushCommit() 客户端设置缓存，然后手动提交

　　　　使用扫描缓存：scan.setcaching()

　　　　不要全表扫描，限定列族扫描

　　　　设置startkey 和endkey

　　　　记得关闭ResultScanner，释放服务端资源

　　　　使用过滤器

　　　　使用协处理器

　　　　关闭WAL，关闭put的写前日志，增加吞吐量

八、宕机处理

　　master 宕机：

　　　　1、zk会启用另一个master，

　　　　2、数据的读写正常进行

　　　　3、table的切分，负载均衡无法正常进行

　　regionserver宕机：

　　　　master会把当前regionserver负责的region转交给其他的regionserver，

　　zk宕机可能性不大

九、hbase整合mr
mr程序读取数据进行处理，数据源可以是各种各样(看InputFormat的实现类)，数据源可以是hdfs上的一个文件，可以是hive中的数据，可以是mysql中的数据，那么同理，也可以是hbase中的数据：
输入数据是hbase，使用TableInputFormat
输出数据到hbase，使用TableOutputFormat

继承的Mapper类，也不再是原生的Mapper了，而是TableMap，同理reducer要使用TableReducer，然后重写里面的map方法和reduce方法。

常使用的一个辅助类：TableMapReduceUtil

十一、一些简单的小知识点

　　如果hbase集群重启(zk不重启)，启动时HMaster起作用了，它会分配region到不同的regionserver上，多次启动，会看到：同一个region，这次启动归这台机器管理，下次启动归另一台机器管理。

　　Q1：rowkey怎么设计？

　　A：

　　Q2：一张表，一个列族cf1，刚开始表是空的，put一条记录rowkey=1，放在一个列族文件中，再增加，一直增加到100000，此时，列族文件达到128M，如果在增加一条，是往哪里放？是新增一个文件？还是依然在源文件上增加？如果此时再增加一条rowkey为500的，是会增加在哪儿？

　　A：

　　Q3：新建一张表，就已经有了regionname了，在hdfs上已经有了相应的目录了，列族目录页已经有了，但是里面没有文件

　　Q4：什么阈值下，会从memstore中flush数据到hdfs上？内存大小？还是时间？本来的默认值是多少？哪个属性修改？

　　 <property>

　　　　　　 <name>hbase.hregion.memstore.flush.size</name>

　　　　　　<value>134217728</value>

　　　　　　<description>Memstore will be flushed to disk if size of the memstore exceeds(超过) this number of bytes(即128M). Value is checked by a thread that runs every hbase.server.thread.wakefrequency. 通过一个线程刷新到磁盘上

　　　　　　</description>

　　　　</property>

<property>　　　　　　 <name>hbase.regionserver.global.memstore.lowerLimit</name>
 　　　　　　<value>0.35</value>
 　　　　　　<description>Maximum size of all memstores in a region server before flushes are forced. Defaults to 35% of heap.
 　　　　　　　　　　　　 This value equal to hbase.regionserver.global.memstore.upperLimit causes the minimum possible flushing to occur when updates are blocked due to memstore limiting.
　　　　　　　　　　　　　　一个regionserver对应多个region，一个region有多个列族，每个列族对应一个memstore， 当这台机器上的所有的memstore占用的内存达到总内存的35%就flush到磁盘，flush的时候，从占用用内存大的memstore开始往磁盘上flush，从大到小往外flush。
 　　　　　　</description>
　　　　</property>
 
　　Q5：一个列族的文件的产生，什么情况下生成一个新文件，什么情况下会触发文件的合并？合并是怎么个合并法？

十、一些好的hbase 的总结：

　　　　http://blog.csdn.net/woshiwanxin102213/article/details/17584043

　　　　http://blog.csdn.net/frankiewang008/article/details/41965543

　　　　http://blog.csdn.net/u010270403/article/details/51648462

　　　　http://www.blogjava.net/DLevin/archive/2015/08/22/426877.html 写的最好

十二、

　　状态1：新建一张表，此时，regionname已经有了，而且在hdfs上有了相应的目录，目录里面四列族的子目录，此时列族子目录里面是空的，没有任何文件。假设这张表有两个列族cf1，cf2

　　状态2：put一条数据，被保存在了memstore中，一个列族对应一个memstore， hdfs上cf1目录和cf2目录下依然没有文件

　　状态3：再次put，再次保存到memstore上，hdfs上cf1和cf2还是没有任何文件

　　状态4：一直put，此时：如果其中一个memstore内存大于了128M，那么这个内存就被flush到磁盘上，如果两个memstore都没有达到128M呢，计算他两占用内存之和，如果大于了总内存的40%，那么两个memstore都要被flush到磁盘上，flush的顺序是从大到小，依次从占用内存大的memstore到占用内存小的memstore往外flush。数据已经持久化了，wal也没有用了，就成为oldwal，过会儿会被删除。注意：占用内存大小不等于flush到磁盘大小。内存中数据是有结构的。占用内存100M，flush到磁盘可能只用25M。

　　状态5：到达状态4时，hdfs上的cf1和cf2目录下分别都有了第一个文件。

　　状态6：有了第二个文件：此时有一部分数据在内存，一部分在磁盘，查询最新插入的数据比查询最早插入的数据速度要快，因为最新的是从内存查的。这是一个可以优化的点，增加memstore的内存。

　　状态7：有了第三个文件：此时如果对第1部分数据的修改，或者插入一条rowkey比较小的值，按rowkey排序来说，这个rowkey应该放到第一个文件中，但是此时还是将这个小的rowkey放在第3个文件中，等到文件合并时在排序。

　　状态8：有了第4个文件：假设这台机器的HRegionServer进程占用500M内存(HRegionServer是java程序，可以给他指定jvm参数)，那么两个列族memstore的总内存不得大于500*40% = 200M，也就是说，每个memstore大概到100M时就会被flush到磁盘，又因为内存大小和磁盘大小不一致的情况，到磁盘可能会大约有30M大小，所以此时共产生了4个小文件。

　　状态9：已经有4个文件了，开始合并，将4个小文件合并为一个大文件：

　　　　状态9.1:4个文件合并后，这个合并后文件大小并没有超过256M，说明这个regionserver管理的数据还不是特别多，还能hold住，然后再put，再产生小文件，再产生3各个小文件，加上一个大文件，又有4各个文件了，再次合并，合并后还是不超过256M，那么接着产生......直到合并后大于256M，进入状态10

　　状态10：文件合并后，大于256M，说明当前regionserver管理的这region有了足够的数据了，现在要做的是切分region，

　　状态11：切分region：

　　状态12：切分为小的region后，接着put数据，put的数据根据rowkey被分配到不同的regionserver上，进入不通过的region上，此时的这个region没有达到256M，重复上面的动作：memsotre满了，flush到磁盘，4个小文件合并，产生大文件，然后这个region再分，再次进入状态12，并一直循环下去。

　　综上：调整memstore、调整文件合并、调整region切分都是对于hbase很好的优化。

　　　　状态8，文件的合并过程和状态9，region的切分与分配过程，是个重点。

十三、日志

# 表刚创建，内容为空，
2017-07-15 10:49:35,938 INFO [PriorityRpcServer.handler=15,queue=1,port=16020] regionserver.RSRpcServices: Open ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9.
2017-07-15 10:49:35,989 INFO [StoreOpener-0ad1bba3bfd843923eb4f03c66e111a9-1] hfile.CacheConfig: Created cacheConfig for cf1: blockCache=LruBlockCache{blockCount=0, currentSize=103664, freeSize=99668344, maxSize=99772008, heapSize=103664, minSize=94783408, minFactor=0.95, multiSize=47391704, multiFactor=0.5, singleSize=23695852, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
2017-07-15 10:49:35,990 INFO [StoreOpener-0ad1bba3bfd843923eb4f03c66e111a9-1] compactions.CompactionConfiguration: size [134217728, 9223372036854775807, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000
2017-07-15 10:49:36,010 INFO [StoreOpener-0ad1bba3bfd843923eb4f03c66e111a9-1] hfile.CacheConfig: Created cacheConfig for cf2: blockCache=LruBlockCache{blockCount=0, currentSize=103664, freeSize=99668344, maxSize=99772008, heapSize=103664, minSize=94783408, minFactor=0.95, multiSize=47391704, multiFactor=0.5, singleSize=23695852, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
2017-07-15 10:49:36,010 INFO [StoreOpener-0ad1bba3bfd843923eb4f03c66e111a9-1] compactions.CompactionConfiguration: size [134217728, 9223372036854775807, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000
2017-07-15 10:49:36,052 INFO [RS_OPEN_REGION-s130:16020-1] regionserver.HRegion: Onlined 0ad1bba3bfd843923eb4f03c66e111a9; next sequenceid=2
2017-07-15 10:49:36,059 INFO [PostOpenDeployTasks:0ad1bba3bfd843923eb4f03c66e111a9] regionserver.HRegionServer: Post open deploy tasks for ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9.
2017-07-15 10:49:36,073 INFO [PostOpenDeployTasks:0ad1bba3bfd843923eb4f03c66e111a9] hbase.MetaTableAccessor: Updated row ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. with server=s130,16020,1500086847611
2017-07-15 10:49:40,971 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-07-15 10:49:40,974 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-07-15 10:49:41,476 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-07-15 10:49:41,483 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-15 10:49:41,483 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started

# memstore达到阈值，准备flush出去，
2017-07-15 10:52:35,104 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=101.23 KB, freeSize=95.05 MB, max=95.15 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=29, evicted=0, evictedPerRun=0.0
2017-07-15 10:53:12,807 INFO [MemStoreFlusher.0] regionserver.MemStoreFlusher: Flush of region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. due to global heap pressure. Total Memstore size=92.2 M, Region memstore size=92.2 M 由于已经达到了阈值：所有的memstore占用内存总和大于了堆内存的40%
2017-07-15 10:53:12,831 INFO [MemStoreFlusher.0] regionserver.HRegion: Flushing 2/2 column families, memstore=92.19 MB
2017-07-15 10:53:15,520 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=47, memsize=46.7 M, hasBloomFilter=true, into tmp file hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp/da700a92cfcb4c8995b28b70bd6c70da
2017-07-15 10:53:16,870 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=47, memsize=46.7 M, hasBloomFilter=true, into tmp file hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp/7048713f584c41f8849dd5d3a2cb9570
2017-07-15 10:53:16,933 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/cf1/da700a92cfcb4c8995b28b70bd6c70da, entries=280000, sequenceid=47, filesize=11.6 M
2017-07-15 10:53:16,968 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/cf2/7048713f584c41f8849dd5d3a2cb9570, entries=280000, sequenceid=47, filesize=12.1 M
2017-07-15 10:53:16,973 INFO [MemStoreFlusher.0] regionserver.HRegion: Finished memstore flush of ~93.46 MB/98000000, currentsize=0 B/0 for region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. in 4142ms, sequenceid=47, compaction requested=false
2017-07-15 10:54:33,578 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-07-15 10:54:33,581 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-07-15 10:54:34,090 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-07-15 10:54:34,098 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-15 10:54:34,098 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started

# memstore再次达到阈值，再次flush出去
2017-07-15 10:57:35,098 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=101.23 KB, freeSize=95.05 MB, max=95.15 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0
2017-07-15 10:59:33,577 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-07-15 10:59:33,580 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-07-15 10:59:34,089 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-07-15 10:59:34,104 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-15 10:59:34,104 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
2017-07-15 11:02:35,100 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=359.91 KB, freeSize=94.80 MB, max=95.15 MB, blockCount=4, accesses=8, hits=4, hitRatio=50.00%, , cachingAccesses=8, cachingHits=4, cachingHitsRatio=50.00%, evictions=89, evicted=0, evictedPerRun=0.0
2017-07-15 11:03:27,021 INFO [MemStoreFlusher.0] regionserver.MemStoreFlusher: Flush of region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. due to global heap pressure. Total Memstore size=90.8 M, Region memstore size=90.8 M
2017-07-15 11:03:27,024 INFO [MemStoreFlusher.0] regionserver.HRegion: Flushing 2/2 column families, memstore=90.79 MB
2017-07-15 11:03:27,117 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: Above memory mark but there are no flushable regions!
2017-07-15 11:03:28,118 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: Above memory mark but there are no flushable regions!
2017-07-15 11:03:29,023 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=96, memsize=45.4 M, hasBloomFilter=true, into tmp file hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp/10b01f62d4d54fed8af91022cb500e6c
2017-07-15 11:03:29,119 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: Above memory mark but there are no flushable regions!
2017-07-15 11:03:30,122 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: Above memory mark but there are no flushable regions!
2017-07-15 11:03:30,437 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=96, memsize=45.4 M, hasBloomFilter=true, into tmp file hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp/b735b5279ff94b5997f1366266816c14
2017-07-15 11:03:30,512 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/cf1/10b01f62d4d54fed8af91022cb500e6c, entries=272005, sequenceid=96, filesize=11.3 M
2017-07-15 11:03:30,539 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/cf2/b735b5279ff94b5997f1366266816c14, entries=272000, sequenceid=96, filesize=11.8 M
2017-07-15 11:03:30,543 INFO [MemStoreFlusher.0] regionserver.HRegion: Finished memstore flush of ~90.79 MB/95200880, currentsize=0 B/0 for region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. in 3521ms, sequenceid=96, compaction requested=false
2017-07-15 11:04:33,578 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-07-15 11:04:33,578 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-07-15 11:04:34,086 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-07-15 11:04:34,106 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-15 11:04:34,106 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started

# memstore再次达到阈值，准备flush出去
2017-07-15 11:07:35,107 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=489.26 KB, freeSize=94.67 MB, max=95.15 MB, blockCount=6, accesses=14, hits=8, hitRatio=57.14%, , cachingAccesses=14, cachingHits=8, cachingHitsRatio=57.14%, evictions=119, evicted=0, evictedPerRun=0.0
2017-07-15 11:09:33,579 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-07-15 11:09:33,583 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-07-15 11:09:34,086 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-07-15 11:09:34,090 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-15 11:09:34,090 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
2017-07-15 11:09:52,941 INFO [MemStoreFlusher.0] regionserver.MemStoreFlusher: Flush of region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. due to global heap pressure. Total Memstore size=91.7 M, Region memstore size=91.7 M
2017-07-15 11:09:52,941 INFO [MemStoreFlusher.0] regionserver.HRegion: Flushing 2/2 column families, memstore=91.67 MB
2017-07-15 11:09:54,886 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=140, memsize=45.8 M, hasBloomFilter=true, into tmp file hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp/a19fefa177284d9bac07536e9b5000ab
2017-07-15 11:09:56,030 INFO [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=140, memsize=45.8 M, hasBloomFilter=true, into tmp file hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp/a3dc23d6a820456c80618f2ba5d26289
2017-07-15 11:09:56,084 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/cf1/a19fefa177284d9bac07536e9b5000ab, entries=274640, sequenceid=140, filesize=11.5 M
2017-07-15 11:09:56,118 INFO [MemStoreFlusher.0] regionserver.HStore: Added hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/cf2/a3dc23d6a820456c80618f2ba5d26289, entries=274640, sequenceid=140, filesize=11.9 M
2017-07-15 11:09:56,121 INFO [MemStoreFlusher.0] regionserver.HRegion: Finished memstore flush of ~91.67 MB/96124000, currentsize=1.79 MB/1876000 for region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. in 3179ms, sequenceid=140, compaction requested=true

# 已经经过3次flush了，已经有3个小文件了，准备合并生成大文件：合并cf1列族
2017-07-15 11:09:56,129 INFO [regionserver/s130/192.168.40.130:16020-shortCompactions-1500086969377] regionserver.HRegion: Starting compaction on cf1 in region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9.
2017-07-15 11:09:56,133 INFO [regionserver/s130/192.168.40.130:16020-shortCompactions-1500086969377] regionserver.HStore: Starting compaction of 3 file(s) in cf1 of ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. into tmpdir=hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp, totalSize=34.4 M 先合并3个小文件到一个临时文件中去
2017-07-15 11:10:07,076 INFO [regionserver/s130/192.168.40.130:16020-shortCompactions-1500086969377] regionserver.HStore: Completed compaction of 3 (all) file(s) in cf1 of ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. into 74e1c9ba1e0144eb93ed6e045973e036(size=34.3 M), total size for store is 34.3 M. This selection was in queue for 0sec, and took 10sec to execute.完成合并，合并后的文件名是"74e1c9ba1e0144eb93ed6e045973e036",且文件大小是34.3M
2017-07-15 11:10:07,108 INFO [regionserver/s130/192.168.40.130:16020-shortCompactions-1500086969377] regionserver.CompactSplitThread: Completed compaction: Request = regionName=ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9., storeName=cf1, fileCount=3, fileSize=34.4 M, priority=7, time=11255336725475; duration=10sec

# 接下来合并cf2列族

2017-07-15 11:09:56,130 INFO [regionserver/s130/192.168.40.130:16020-longCompactions-1500086853919] regionserver.HRegion: Starting compaction on cf2 in region ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9.
2017-07-15 11:09:56,130 INFO [regionserver/s130/192.168.40.130:16020-longCompactions-1500086853919] regionserver.HStore: Starting compaction of 3 file(s) in cf2 of ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. into tmpdir=hdfs://s128:8020/hbase/data/ns1/t1/0ad1bba3bfd843923eb4f03c66e111a9/.tmp, totalSize=35.8 M 合并3个小文件到一个临时文件
2017-07-15 11:10:07,081 INFO [regionserver/s130/192.168.40.130:16020-longCompactions-1500086853919] regionserver.HStore: Completed compaction of 3 (all) file(s) in cf2 of ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9. into a2aa7398e1d74258bd7bb08bf79933e9(size=35.7 M), total size for store is 35.7 M. This selection was in queue for 0sec, and took 10sec to execute. 合并完成，合并后的文件名是"a2aa7398e1d74258bd7bb08bf79933e9",文件大小为35.7M
2017-07-15 11:10:07,096 INFO [regionserver/s130/192.168.40.130:16020-longCompactions-1500086853919] regionserver.CompactSplitThread: Completed compaction: Request = regionName=ns1:t1,,1500086975414.0ad1bba3bfd843923eb4f03c66e111a9., storeName=cf2, fileCount=3, fileSize=35.8 M, priority=7, time=11255337860129; duration=10sec

# 合并结束
2017-07-15 11:12:35,101 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=101.23 KB, freeSize=95.05 MB, max=95.15 MB, blockCount=0, accesses=1136, hits=15, hitRatio=1.32%, , cachingAccesses=18, cachingHits=12, cachingHitsRatio=66.67%, evictions=149, evicted=6, evictedPerRun=0.04026845470070839
2017-07-15 11:14:33,579 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2017-07-15 11:14:33,581 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2017-07-15 11:14:34,083 INFO [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2017-07-15 11:14:34,087 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2017-07-15 11:14:34,087 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
2017-07-15 11:17:35,099 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=101.23 KB, freeSize=95.05 MB, max=95.15 MB, blockCount=0, accesses=1136, hits=15, hitRatio=1.32%, , cachingAccesses=18, cachingHits=12, cachingHitsRatio=66.67%, evictions=179, evicted=6, evictedPerRun=0.03351955488324165

综上所述：当flush出的小文件达到3个，就开始合并文件。

十四：rowkey的设计是至关重要的，查询时只能get或者scan，根据rowkey范围来查询。

　　原则一：长度原则：rowkey是一个二进制码流，可以是任意字符串，长度最大64kb，实际应用中一般10~100byte.,存为byte[]数组，设计成定长，越短越好，不要超过16个字节。

　　　　　　　　因为hbase保存数据是kv保存的，每一行都要重复一遍key，所以key越多，导致文件越大。

　　　　　　　　而且导致MemStore存储效率低下。内存存储的数据少，大部分就得取磁盘找，hbase 的相应效率低下

　　原则二：散列原则：如果rowkey是按照时间戳的方式递增(即rowkey里包含时间戳)，那么时间戳不能在高位，要放在后边，放在低位上。

　　　　　　　　因为如果时间在高位，那么白天的某个时间段的人比较多，那么对导致数据比较集中，导致某个regionserver上数据量大，降低查询效果。

　　　原则三：唯一原则：rowkey必须唯一，RowKey是按照字典排序存储的，因此，设计RowKey时候，要充分利用这个排序特点，将经常一起读取的数据存储到一块，将最近可能会被访问的数据放在一块

　　　　例如：用户订单列表查询RowKey设计。

　　　　　　用户经常查自己的历史订单：那么userNum是需要的，可以将数据放在一起，然后是订单时间，然后是订单号。RowKey可以设计为：userNum$orderTime$seriaNum

十五：列族设计原则：

　　　　1、一般不建议多个列族，1~2个比较好，3个就算是多的了

　　　　2、

　　列族优化：

　　　　1、增加缓存

　　　　2、Bloom过滤器

　　　　3、压缩

　　　　4、减少保存版本

　　　　5、缓存block的大小，如果经常用get，那么block小一点比较好，如果经常scan，那么block大一点比较好。

posted @ 2017-09-04 17:13 IT豪哥阅读(800) 评论(0) 编辑收藏举报

刷新页面返回顶部

IT豪哥

知识共享，共享世界

201707121200复习-hbase篇-总结篇

公告