探究 Parquet 生成方式(impala,hive都可以查询)(一)
https://my.oschina.net/skyim/blog/479159
1. Parquet 的优点我就不说拉(列存储和良好的压缩),列存储可以参考如下链接
2.主要是项目中用到的存储
3.第一步,首先在hive中创建一张表,操作表语句如下
create external table parquet_example ( basketid bigint, productid bigint, quantity int, price float, totalbasketvalue float ) stored as parquet location '/user/hive/warehouse/parquet_example';
hive 中操作语句如下
4.我们到界面上面去看看这两张表,hive左下角已经有parquet-example
5.需要在impala里面查看的话
需要在impala执行如下语句 INVALIDATE METADAT6.现在主要是将表里面写入相关parquet文件
public class BasketWriter { public static void main(String[] args) throws IOException { DateFormat dateFormat = new SimpleDateFormat("YYYYMMddHHmmss"); new BasketWriter().generateBasketData("part_"+dateFormat.format(new Date())); } private void generateBasketData(String outFilePath) throws IOException { final MessageType schema = MessageTypeParser.parseMessageType("message basket { required int64 basketid; required int64 productid; required int32 quantity; required float price; required float totalbasketvalue; }"); Configuration config = new Configuration(); DataWritableWriteSupport.setSchema(schema, config); Path outDirPath = new Path("hdfs://192.168.0.80/user/hive/warehouse/parquet_example/"+outFilePath); //hdfs 文件目录 ParquetWriter writer = new ParquetWriter(outDirPath, new DataWritableWriteSupport () { @Override public WriteContext init(Configuration configuration) { if (configuration.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null) { configuration.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString()); } return super.init(configuration); } }, CompressionCodecName.SNAPPY, 256*1024*1024, 100*1024); int numBaskets = 1000000; Random numProdsRandom = new Random(); Random quantityRandom = new Random(); Random priceRandom = new Random(); Random prodRandom = new Random(); for (int i = 0; i < numBaskets; i++) { int numProdsInBasket = numProdsRandom.nextInt(30); numProdsInBasket = Math.max(7, numProdsInBasket); float totalPrice = priceRandom.nextFloat(); totalPrice = (float)Math.max(0.1, totalPrice) * 100; for (int j = 0; j < numProdsInBasket; j++) { Writable[] values = new Writable[5]; values[0] = new LongWritable(i); values[1] = new LongWritable(prodRandom.nextInt(200000)); values[2] = new IntWritable(quantityRandom.nextInt(10)); values[3] = new FloatWritable(priceRandom.nextFloat()); values[4] = new FloatWritable(totalPrice); ArrayWritable value = new ArrayWritable(Writable.class, values); writer.write(value); } } writer.close(); } }
7.下面可以查看到我们输入的数据
8.下面可以在hive或者 impala 查询写入的数据
源代码可以用如下找到
https://github.com/wangxuehui/writeparquet/