探究 Parquet 生成方式(impala,hive都可以查询)(一)

https://my.oschina.net/skyim/blog/479159

1. Parquet 的优点我就不说拉(列存储和良好的压缩),列存储可以参考如下链接

2.主要是项目中用到的存储

3.第一步,首先在hive中创建一张表,操作表语句如下

create external table parquet_example (
    basketid bigint,
    productid bigint,
    quantity int,
    price float,
    totalbasketvalue float
    ) stored as parquet location '/user/hive/warehouse/parquet_example';

  hive 中操作语句如下

4.我们到界面上面去看看这两张表,hive左下角已经有parquet-example

 

 

 5.需要在impala里面查看的话
需要在impala执行如下语句   INVALIDATE METADAT6.现在主要是将表里面写入相关parquet文件

public class BasketWriter {
	public static void main(String[] args) throws IOException {
        DateFormat dateFormat = new SimpleDateFormat("YYYYMMddHHmmss");
		new BasketWriter().generateBasketData("part_"+dateFormat.format(new Date()));
	}

	private void generateBasketData(String outFilePath) throws IOException {
		final MessageType schema = MessageTypeParser.parseMessageType("message basket { required int64 basketid; required int64 productid; required int32 quantity; required float price; required float totalbasketvalue; }");
		Configuration config = new Configuration();
		DataWritableWriteSupport.setSchema(schema, config);
		Path outDirPath = new Path("hdfs://192.168.0.80/user/hive/warehouse/parquet_example/"+outFilePath); //hdfs 文件目录

		ParquetWriter writer = new ParquetWriter(outDirPath, new DataWritableWriteSupport () {
			@Override
			public WriteContext init(Configuration configuration) {
				if (configuration.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null) {
					configuration.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
				}
				return super.init(configuration);
			}
		}, CompressionCodecName.SNAPPY, 256*1024*1024, 100*1024);
		int numBaskets = 1000000;
		Random numProdsRandom = new Random();
		Random quantityRandom = new Random();
		Random priceRandom = new Random();
		Random prodRandom = new Random();
		for (int i = 0; i < numBaskets; i++) {
			int numProdsInBasket = numProdsRandom.nextInt(30);
			numProdsInBasket = Math.max(7, numProdsInBasket);
			float totalPrice = priceRandom.nextFloat();
			totalPrice = (float)Math.max(0.1, totalPrice) * 100;
			for (int j = 0; j < numProdsInBasket; j++) {
				Writable[] values = new Writable[5];
				values[0] = new LongWritable(i);
				values[1] = new LongWritable(prodRandom.nextInt(200000));
				values[2] = new IntWritable(quantityRandom.nextInt(10));
				values[3] = new FloatWritable(priceRandom.nextFloat());
				values[4] = new FloatWritable(totalPrice);
				ArrayWritable value = new ArrayWritable(Writable.class, values);
				writer.write(value);
			}
		}
		writer.close();
	}
}

  7.下面可以查看到我们输入的数据

8.下面可以在hive或者 impala 查询写入的数据

 

 

 

 源代码可以用如下找到
https://github.com/wangxuehui/writeparquet/

 

posted @ 2020-12-16 17:11  一叶知秋。  阅读(757)  评论(0编辑  收藏  举报