【转】华为HBase索引模块应用：HBase二级索引模块：hindex调研 2014年10月16日

hindx是HBase的二级索引方案，为HBase提供声明式的索引，使用协处理器对索引表进行自动创建和维护，客户端不需要对数据进行双写。并且hindex采用了一些巧妙的Rowkey编排方式，使索引数据和实际数据分布在同一个Region，实现了较高的查询性能。介绍如下：huawei-hbase-secondary-secondary-index-implementations

代码下载地址：https://github.com/Huawei-Hadoop/hindex

源码介绍

hindex基于HBase 0.94.8实现，运行在HBase服务端，采用协处理器实现索引表的维护和查询：

org.apache.hadoop.hbase.index.coprocessor.master.IndexMasterObserver
拦截DDL操作，在数据库表发生创建/删除/Enable/Disable/Drop操作时同步创建/更改/删除Index表。并且拦截region balance过程，在HFile发生合并和分裂时同步修改Index表，确保Index表的记录与统一Rowkey的数据记录永远在同一Region Server，加快查询效率。
org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver
拦截数据库表的Put/Delete/Get/Scan/Flush等操作，同步更新Index表的数据。
org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver 同步WAL操作，在Region Server的预写区域发生操作时判断Index Table是否需要同步操作，将预写操作提交到Region Server。

源码包含完整的hbase-0.94.8的实现，与二级索引相关的代码都在/secondaryindex/src/main/java目录。

目前已经实现了索引表同步/索引表Balance/索引同步/索引SCAN。发布说明中待实现功能如下：

Dynamically add/drop index
Integrate Secondary Index Management in the HBase Shell
Optimize range scan scenarios
HBCK tool support for Secondary index tables
WAL Optimizations for Secondary index table entries
Make Scan Evaluation Intelligence Pluggable

使用说明

下载源码，采用maven编译：

mvn package -DskipTests=true

将编译产物上传到HBase服务器：

scp target/hbase-0.94.8.jar user@server:$HBASE_HOME/conf/

HBase配置(hbase-env.sh)：

export HBASE_CLASSPATH=$HBASE_HOME/conf/hbase-0.94.8.jar

HBase配置(hbase-site.xml)：

<property>
 <name>hbase.coprocessor.master.classes</name>
 <value>org.apache.hadoop.hbase.index.coprocessor.master.IndexMasterObserver</value>
</property>
<property>
 <name>hbase.coprocessor.region.classes</name>
 <value>org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver</value>
</property>
<property>
 <name>hbase.coprocessor.wal.classes</name>
 <value>org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver</value>
</property>

重新启动HBase，访问：

可以看到协处理器已经安装成功：

Coprocessors [IndexMasterObserver]
Coprocessors [IndexRegionObserver, IndexWALObserver]

调用hbase和hindex的Java api实现创建表并且在表上创建索引：

 IndexedHTableDescriptor htd = new IndexedHTableDescriptor(usertableName);
 IndexSpecification iSpec = new IndexSpecification(indexName);
 HColumnDescriptor hcd = new HColumnDescriptor(columnFamily);
 iSpec.addIndexColumn(hcd, indexColumnQualifier, ValueType.String, 10);
 htd.addFamily(hcd);
 htd.addIndex(iSpec);
 admin.createTable(htd);

按照预料的情况，后台应该会同时出现数据表和索引表，向数据表中插入数据时，索引表中会按照索引定义自动出现反向索引。可是没有出现这个现象，why？

版本兼容性

原因是hindex和现场的HBase版本不兼容性。hindex基于hbase-0.94.8版本开发，但是现场采用的是hbase-0.94.6-cdh4.3.0，是由Cloudera基于hbase-0.94.6开发的，这两个版本的协处理器接口不一致。以MasterObserver为例，拦截create table操作在hbase-0.94.8中如下：

postCreateTable(ObserverContext<MasterCoprocessorEnvironment>, HTableDescriptor, HRegionInfo[])
postCreateTableHandler(ObserverContext<MasterCoprocessorEnvironment>, HTableDescriptor, HRegionInfo[])
preCreateTable(ObserverContext<MasterCoprocessorEnvironment>, HTableDescriptor, HRegionInfo[])
preCreateTableHandler(ObserverContext<MasterCoprocessorEnvironment>, HTableDescriptor, HRegionInfo[])

在hbase-0.94.6中如下：

postCreateTable(ObserverContext<MasterCoprocessorEnvironment>, HTableDescriptor, HRegionInfo[])
preCreateTable(ObserverContext<MasterCoprocessorEnvironment>, HTableDescriptor, HRegionInfo[])

因此在hindex中实现的协处理器无法按照预订的流程建立索引表。跟踪hbase的mastar日志，只有调用preCreateTable和postCreateTable的日志，preCreateTableHandler和postCreateTableHandler中的代码都没有执行。

尝试一下修改IndexMasterObserver的方法，按照hbase-0.94.6的接口名称实现，重新部署，创建数据表和索引后，发现索引表自动创建成功了：

User Table Online Regions Description
test 1 {NAME => ‘test’, FAMILIES => [{NAME => ‘info’}]}
test_idx 1 {NAME => ‘test_idx’, SPLIT_POLICY => ‘org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy’, MAX_FILESIZE => ‘9223372036854775807’, FAMILIES => [{NAME => ‘d’}]}

但是只有空的索引表，索引数据还没有进入到索引表中。需要再修改IndexRegionObserver和IndexWALObserver，还要处理HFile分裂与合并的情况，总之要把secondary index的功能完全移植到hbase-0.94.6-cdh4.3.0。还是算了吧。

查一下CDH有没有基于hbase-0.94.8的版本。CDH目前已经开发到5.0.0-beta-1，基于hbase-0.95.2开发：
CDH-Version-and-Packaging-Information。

看来无论是cdh4.5.0还是cdh5.0.0-beta1都没有直接基于hbase-0.94.8。如果想使用hindex的功能，需要做移植和测试工作。当然也可以选择使用Apache原生的hbase-0.94.8，但是这个版本没有基于Hadoop 2的部署版，只能基于Hadoop 1部署。

在查询需求比较确定的情况下，可以预先规划索引结构，相信hindex是一个很好的方案，是值得一搞的。有时间的话继续研究。

posted on 2015-05-07 15:20 南馨阅读(1960) 评论(0) 编辑收藏举报

刷新页面返回顶部

南馨

源码介绍

使用说明

版本兼容性

导航

公告