CarbonData编译与安装
原文连接 http://xiguada.org/carbondata_compile/
CarbonData是啥?
CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).
编译安装
本想迅速试用一下,不过官网居然没有现成编译好的工程,没办法,只能自己编译一个。
安装需要三步(当然还需要jdk7或jdk8,,maven 3.3以上)
- 下载 Spark 1.5.0 或更新的版本。
- 下载并安装 Apache Thrift 0.9.3,并确认加到系统路径。
- 下载 Apache CarbonData code 并编译。
1 Spark可以直接下载,解压后设置PATH可执行spark-submit。
2 安装thrift前需要安装依赖,我的虚拟机啊ubuntu下安装依赖的命令如下。
sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
然后到thrift下编译安装
./configure
sudo make
sudo make install
3 编译CarbonData
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package
4 进入bin目录,修改carbon-spark-sql 文件中的 /bin/spark-submit,改为spark-submit
5 生成sample.csv文件
cd carbondata
cat > sample.csv << EOF
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35
EOF
6 执行
./carbon-spark-sql
spark-sql> create table if not exists test_table (id string, name string, city string, age Int) STORED BY 'carbondata'
spark-sql> load data inpath '../sample.csv' into table test_table
spark-sql> select city, avg(age), sum(age) from test_table group by city
执行结果
shenzhen 29.0 58
wuhan 35.0 35
看起来和执行SparkSQL一样,CarbonData这中间做了啥,有啥效果呢?后面继续分析。