零起步的Hadoop实践日记（hbase in action）

笔者搭建的是伪分布，其他方式页面里面也有。安装参考：

新建表

create 'dailystats','uid','sTime','eTime','calories','steps','activeValue','pm25suck','runDist','runDura','cycDist','cycDura','walkDist','walkDura','runCal','cycCal','walkCal','goadCal','goalSteps','goalActiveVal','locations','day'

导入数据

默认是\t 分隔。下面是大量导入数据的方法之一：ImportTsv 和 LoadIncrementalHFiles / Completebulkload

第一步转化为HFile
sudo -su hdfs hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,sTime,eTime,calories,steps,activeValue,pm25suck,runDist,runDura,cycDist,cycDura,walkDist,walkDura,runCal,cycCal,walkCal,goadCal,goalSteps,goalActiveVal,locations,day -Dimporttsv.bulk.output=/user/hdfs/hbase_day_uid_file_head dailystats /user/hdfs/day_uid_file_head

第二步导入HFile到HBase
sudo -su hdfs hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hdfs/hbase_day_uid_file_head dailystats

导入15w用户，供140w数据，第一步需要提前上传文件到hdfs，导入耗时约10分钟，第二步2秒内完成

第二步也可以为（未测试）

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
实例：

hadoop jar ${HBASE_HOME}/hbase-0.92.1.jar completebulkload /user/hfile/test_hfile.log table_name

另外还有方法可以从HBase里面导入导出，可惜不是导入纯文本文件，是HBase产生的Sequence文件，只适合不同库导出导入。参考：HBase官方文中指南

14.1.7. Export
Export is a utility that will dump the contents of table to HDFS in a sequence file. Invoke via:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration. 

14.1.8. Import
Import is a utility that will load data that has been exported back into HBase. Invoke via:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

其中数据文件位置可为本地文件目录，也可以分布式文件系统hdfs的路径。

当其为前者时，直接指定即可，也可以加前缀file:///

当其伟后者时，必须明确指明hdfs的路径，例如hdfs://mymaster:9000/path

未删表格式化Hadoop重启的问题

可能是硬件或则软件的问题，hbase导入数据极慢，重新格式化后在启动hbase，出现问题

ERROR: Table already exists: dailystats!

Here is some help for this command:
Create table; pass table name, a dictionary of specifications per
column family, and optionally a dictionary of table configuration.
Dictionaries are described below in the GENERAL NOTES section.
Examples:

  hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
  hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
  hbase> # The above in shorthand would be the following:
  hbase> create 't1', 'f1', 'f2', 'f3'
  hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
  hbase> create 't1', 'f1', {SPLITS => ['10', '20', '30', '40']}
  hbase> create 't1', 'f1', {SPLITS_FILE => 'splits.txt'}
  hbase> # Optionally pre-split the table into NUMREGIONS, using
  hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)
  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

  You can also keep around a reference to the created table:

  hbase> t1 = create 't1', 'f1'

  Which gives you a reference to the table named 't1', on which you can then
  call methods.


hbase(main):005:0> enable 'dailystats'

ERROR: Table dailystats does not exist.'

Here is some help for this command:
Start enable of named table: e.g. "hbase> enable 't1'"


hbase(main):006:0> disable 'dailystats'

ERROR: Table dailystats does not exist.'

Here is some help for this command:
Start disable of named table: e.g. "hbase> disable 't1'"

第一次碰到这个问题的时候就想砍人，典型的精神分裂患者，到底是存在还是不存在啊。这个原因是因为zookeeper缓存不同步，zookeeper貌似只能在显示drop table_name的时候才能同步信息，我这样直接格式化重启hadoop就不同步了，anyway，删除zookeeper的数据就ok，cloudera的zookeeper数据在/var/lib/zookeeper/

删除在启动，欧了！

HBase Java API

首先非常想吐槽的是，目前网上，书上都没有一个完整的例子告诉你怎么弄，上来就核心代码，真不负责。要不就是Eclipse的各种操作，问题生产环境基本只有命令行，好吗？

都搭建到HBase了，Java肯定都装好了，另外我用到的Hadoop，Hive，Zookeeper，HBase均来自Cloudera。

先上一片测试完整代码（来自三劫散仙的博客园）：

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

/**
 * @author 三劫散仙
 * 
 * **/
public class Test {
    
    static Configuration conf=null;
    static{
        
          conf=HBaseConfiguration.create();//hbase的配置信息
          conf.set("hbase.zookeeper.quorum", "127.0.0.1");  //zookeeper的地址
        
    }
    
    public static void main(String[] args)throws Exception {
        
        Test t=new Test();
        //t.createTable("temp", new String[]{"name","age"});
     //t.insertRow("temp", "2", "age", "myage", "100");
    // t.getOneDataByRowKey("temp", "2");
        t.showAll("temp");
     
    }
    
    /***
     * 创建一张表
     * 并指定列簇
     * */
    public void createTable(String tableName,String cols[])throws Exception{
     HBaseAdmin admin=new HBaseAdmin(conf);//客户端管理工具类
    if(admin.tableExists(tableName)){
        System.out.println("此表已经存在.......");
    }else{
        HTableDescriptor table=new HTableDescriptor(tableName);
        for(String c:cols){
            HColumnDescriptor col=new HColumnDescriptor(c);//列簇名
            table.addFamily(col);//添加到此表中
        }
        
     admin.createTable(table);//创建一个表
     admin.close();
     System.out.println("创建表成功!");
    }
    }
    
    /**
     * 添加数据,
     * 建议使用批量添加
     * @param tableName 表名
     * @param row  行号
     * @param columnFamily 列簇
     * @param column   列
     * @param value   具体的值
     * 
     * **/
    public  void insertRow(String tableName, String row,  
            String columnFamily, String column, String value) throws Exception {  
        HTable table = new HTable(conf, tableName);  
        Put put = new Put(Bytes.toBytes(row));  
        // 参数出分别：列族、列、值  
        put.add(Bytes.toBytes(columnFamily), Bytes.toBytes(column),  
                Bytes.toBytes(value)); 
       
        table.put(put);  
        table.close();//关闭
        System.out.println("插入一条数据成功!");
    }    
    
    /**
     * 删除一条数据
     * @param tableName 表名
     * @param row  rowkey
     * **/
    public void deleteByRow(String tableName,String rowkey)throws Exception{
        HTable h=new HTable(conf, tableName);
        Delete d=new Delete(Bytes.toBytes(rowkey));
        h.delete(d);//删除一条数据
        h.close();
    }
    
    /**
     * 删除多条数据
     * @param tableName 表名
     * @param row  rowkey
     * **/
    public void deleteByRow(String tableName,String rowkey[])throws Exception{
        HTable h=new HTable(conf, tableName);
     
        List<Delete> list=new ArrayList<Delete>();
        for(String k:rowkey){
            Delete d=new Delete(Bytes.toBytes(k));
            list.add(d);
        }
        h.delete(list);//删除
        h.close();//释放资源
    }
    
    /**
     * 得到一条数据
     * 
     * @param tableName 表名
     * @param rowkey 行号
     * ***/
    public void getOneDataByRowKey(String tableName,String rowkey)throws Exception{
        HTable h=new HTable(conf, tableName);
        
        Get g=new Get(Bytes.toBytes(rowkey));
        Result r=h.get(g);
        for(KeyValue k:r.raw()){
            
            System.out.println("行号:  "+Bytes.toStringBinary(k.getRow()));
            System.out.println("时间戳:  "+k.getTimestamp());
            System.out.println("列簇:  "+Bytes.toStringBinary(k.getFamily()));
            System.out.println("列:  "+Bytes.toStringBinary(k.getQualifier()));
            //if(Bytes.toStringBinary(k.getQualifier()).equals("myage")){
            //    System.out.println("值:  "+Bytes.toInt(k.getValue()));
            //}else{
            String ss=    Bytes.toString(k.getValue());
            System.out.println("值:  "+ss);
            //}
            
             
            
        }
        h.close();
        
        
    }
    
    /**
     * 扫描所有数据或特定数据
     * @param tableName
     * **/
    public void showAll(String tableName)throws Exception{
        
HTable h=new HTable(conf, tableName);
        
         Scan scan=new Scan();
         //扫描特定区间
         //Scan scan=new Scan(Bytes.toBytes("开始行号"),Bytes.toBytes("结束行号"));
         ResultScanner scanner=h.getScanner(scan);
         for(Result r:scanner){
             System.out.println("==================================");
           for(KeyValue k:r.raw()){
            
            System.out.println("行号:  "+Bytes.toStringBinary(k.getRow()));
            System.out.println("时间戳:  "+k.getTimestamp());
            System.out.println("列簇:  "+Bytes.toStringBinary(k.getFamily()));
            System.out.println("列:  "+Bytes.toStringBinary(k.getQualifier()));
            //if(Bytes.toStringBinary(k.getQualifier()).equals("myage")){
            //    System.out.println("值:  "+Bytes.toInt(k.getValue()));
            //}else{
            String ss=    Bytes.toString(k.getValue());
            System.out.println("值:  "+ss);
            //}   
          }
        }
        h.close();
        
    }

}

上图红色部分要修改为自己对应的数据，我是伪分布，所以zookeeper的地址就是127.0.0.1，另外主程序里表名改为自己想要用的，函数showAll测试是可以用的。其他暂未测试。

没找到要加载哪些jar包，按照错误提示的一个个在hadoop和hbase目录下找（jar -tf），实在太麻烦了，也很少很久没直接上命令行做java编译，不知道怎么把文件夹加入classpath实现加载全部jar，最后干脆这样了：

javac -cp .:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hbase/* Test.java

这样得到Test.class，然后

java -cp .:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hbase/* Test

得到的一行数据为：

==================================
行号:  100044
时间戳:  1395839067455
列簇:  activeValue
列:  
值:  75561
行号:  100044
时间戳:  1395839067455
列簇:  calories
列:  
值:  203.087463109
行号:  100044
时间戳:  1395839067455
列簇:  cycCal
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  cycDist
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  cycDura
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  day
列:  
值:  20140309
行号:  100044
时间戳:  1395839067455
列簇:  eTime
列:  
值:  1395072000.0
行号:  100044
时间戳:  1395839067455
列簇:  goadCal
列:  
值:  200
行号:  100044
时间戳:  1395839067455
列簇:  goalActiveVal
列:  
值:  0
行号:  100044
时间戳:  1395839067455
列簇:  goalSteps
列:  
值:  7000
行号:  100044
时间戳:  1395839067455
列簇:  locations
列:  
值:  116.352251956_39.9705142606|116.352283224_39.9705169167|116.352249753_39.9705249571|116.352235717_39.9704804303|116.352235717_39.9704804303|116.352255264_39.9705145666|116.352211069_39.9705208877|116.352298561_39.9705498454|116.352174589_39.9705077219|116.352163865_39.9704987024|116.352202454_39.9705452733|116.352271801_39.9705272967|116.352240134_39.9705372404|116.35215525_39.9705303882|116.352256465_39.9705083685|116.352227205_39.9705281169|116.35221338_39.9705648936|116.352255973_39.9705810705|116.352225497_39.9704947125|116.35230949_39.9705885665
行号:  100044
时间戳:  1395839067455
列簇:  pm25suck
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  runCal
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  runDist
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  runDura
列:  
值:  0.0
行号:  100044
时间戳:  1395839067455
列簇:  sTime
列:  
值:  1394985600.0
行号:  100044
时间戳:  1395839067455
列簇:  steps
列:  
值:  3744
行号:  100044
时间戳:  1395839067455
列簇:  walkCal
列:  
值:  187.210574021
行号:  100044
时间戳:  1395839067455
列簇:  walkDist
列:  
值:  3924.92937003
行号:  100044
时间戳:  1395839067455
列簇:  walkDura
列:  
值:  3122.49075413

posted @ 2014-03-25 20:09 aquastar 阅读(631) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

阿卡的好奇心

亲，真相只有一个

零起步的Hadoop实践日记（hbase in action）

公告