parquet-tools使用

使用parquet-tools的方法有2种

1.在安装了CDH的机器上,会自动有parquet-tools命令

lintong@master:/opt/cloudera/parcels/CDH/bin$ ls| grep parquet-tools
parquet-tools
lintong@master:/opt/cloudera/parcels/CDH/bin$ parquet-tools

 

2.自行编辑jar

git clone并指定分支,master分支已经删除了parquet-tools

git clone git@github.com:apache/parquet-mr.git -b apache-parquet-1.10.1

编译

cd parquet-tools && mvn clean package -Plocal

 

parquet-tools可以使用的命令,参考:How to build and use parquet-tools to read parquet files

1.查看parquet文件的schema

由AvroParquet写的parquet文件的schema

lintong@lintongdeMacBook-Pro ~/coding/java/parquet-mr/parquet-tools/target $ java -jar parquet-tools-1.10.1.jar schema /xxx/avro_parquet/part-r-00000.snappy.parquet

message com.linkedin.haivvreo.test_serializer {
  required binary string1 (UTF8);
  required int32 int1;
  required int32 tinyint1;
  required int32 smallint1;
  required int64 bigint1;
  required boolean boolean1;
  required float float1;
  required double double1;
  required group list1 (LIST) {
    repeated binary array (UTF8);
  }
  required group map1 (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (UTF8);
      required int32 value;
    }
  }
  required group struct1 {
    required int32 sInt;
    required boolean sBoolean;
    required binary sString (UTF8);
  }
  required binary enum1 (ENUM);
  optional int32 nullableint;
}

由ThriftParquet写的parquet文件的schema

lintong@lintongdeMacBook-Pro ~/coding/java/parquet-mr/parquet-tools/target $ java -jar parquet-tools-1.10.1.jar schema /xxx/thrift_parquet/part-r-00000.snappy.parquet

message ParquetSchema {
  required binary string1 (UTF8) = 1;
  required int32 int1 = 2;
  required int32 tinyint1 = 3;
  required int32 smallint1 = 4;
  required int64 bigint1 = 5;
  required boolean boolean1 = 6;
  required double float1 = 7;
  required double double1 = 8;
  required group list1 (LIST) = 9 {
    repeated binary list1_tuple (UTF8);
  }
  required group map1 (MAP) = 10 {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (UTF8);
      optional int32 value;
    }
  }
  required group struct1 = 11 {
    required int32 sInt = 1;
    required boolean sBoolean = 2;
    required binary sString (UTF8) = 3;
  }
  required binary enum1 (UTF8) = 12;
  optional int32 nullableint = 13;
}

由hive job写的parquet文件的schema

message hive_schema {
  optional binary appid (UTF8);
  optional int64 ts;
  optional int32 userid;
  optional binary countries (UTF8);
}

  

2.查看parquet文件的head

java -jar parquet-tools-1.10.1.jar head -n 1 /xxx/thrift_parquet/part-r-00000.snappy.parquet
string1 = ecAsz6ca7E
int1 = 64676
tinyint1 = 8
smallint1 = 0
bigint1 = -9081354042296389692
boolean1 = true
float1 = 0.1271180510520935
double1 = 0.011293589263621895
list1:
.list1_tuple = v8gCJFRBIb
.list1_tuple = nfvrI1Rltp
map1:
.map:
..key = v8gCJFRBIb
..value = 428
.map:
..key = nfvrI1Rltp
..value = 1257
struct1:
.sInt = 740564
.sBoolean = true
.sString = RuiVISF2BI
enum1 = BLUE
nullableint = 5559

 

posted @ 2015-08-09 16:29  tonglin0325  阅读(1888)  评论(0编辑  收藏  举报