基于 Databend 和腾讯云 COS 打造新型云数仓
本篇文章向大家演示如何使用 Databend 基于腾讯云 COS 构建新式数仓及其计算能力。如果你也在找一个低成本、高性能、支持弹性的数仓,Databend 可以为大家提供一个基于对象存储的云原生数仓解决方案。目前 Databend 支持数据的 stream load , copy into from stage , insert 等方式的数据写入,部署上支持单机和集群模式。需要更多支持添加微信: 82565387 。 文章较长,建议收藏 PC 端阅读。
Databend 介绍
Databend 是一款使用 Rust 研发、开源、完全面向对象存储架构的新式数仓,提供极速的弹性扩展能力,致力于打造按需、按量的 Data Cloud 产品体验。具备以下特点:
•Vectorized Execution 和 Pull&Push-Based Processor Model
•真正的存储、计算分离架构,高性能、低成本,按需按量使用
•完整的数据库支持,兼容 MySQL ,Clickhouse 协议, SQL Over http
•完善的事务性,支持 Data Time Travel, Database Zero Clone 等功能
•支持基于同一份数据的多租户读写、共享操作
github repo: https://github.com/datafuselabs/databend
Docs: https://databend.rs
关于 Databend 架构图,参考:https://databend.rs/doc/
腾讯云 COS
对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限制,可容纳海量数据且支持 HTTP/HTTPS 协议访问的分布式存储服务。腾讯云 COS 的存储桶空间无容量上限,无需分区管理,适用于 CDN 数据分发、数据万象处理或大数据计算与分析的数据湖等多种场景。官网:https://cloud.tencent.com/product/cos
测试环境介绍
北京区: CVM SA2.8XLARGE64 & COS(ap-beijing)
操作系统: ubuntu-20
Databend : 使用进二制发布版本 v0.6.99-nightly
本次测试安装部署方式参考:https://databend.rs/doc/deploy/cos
集群部署模式参考:https://databend.rs/doc/deploy/cluster_minio
测试数据
wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/
On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip
表结构参考:cat create_ontime.sql
CREATE TABLE ontime
(
Year UInt16 NOT NULL,
Quarter UInt8 NOT NULL,
Month UInt8 NOT NULL,
DayofMonth UInt8 NOT NULL,
DayOfWeek UInt8 NOT NULL,
FlightDate Date NOT NULL,
Reporting_Airline String NOT NULL,
DOT_ID_Reporting_Airline Int32 NOT NULL,
IATA_CODE_Reporting_Airline String NOT NULL,
Tail_Number String NOT NULL,
Flight_Number_Reporting_Airline String NOT NULL,
OriginAirportID Int32 NOT NULL,
OriginAirportSeqID Int32 NOT NULL,
OriginCityMarketID Int32 NOT NULL,
Origin String NOT NULL,
OriginCityName String NOT NULL,
OriginState String NOT NULL,
OriginStateFips String NOT NULL,
OriginStateName String NOT NULL,
OriginWac Int32 NOT NULL,
DestAirportID Int32 NOT NULL,
DestAirportSeqID Int32 NOT NULL,
DestCityMarketID Int32 NOT NULL,
Dest String NOT NULL,
DestCityName String NOT NULL,
DestState String NOT NULL,
DestStateFips String NOT NULL,
DestStateName String NOT NULL,
DestWac Int32 NOT NULL,
CRSDepTime Int32 NOT NULL,
DepTime Int32 NOT NULL,
DepDelay Int32 NOT NULL,
DepDelayMinutes Int32 NOT NULL,
DepDel15 Int32 NOT NULL,
DepartureDelayGroups String NOT NULL,
DepTimeBlk String NOT NULL,
TaxiOut Int32 NOT NULL,
WheelsOff Int32 NOT NULL,
WheelsOn Int32 NOT NULL,
TaxiIn Int32 NOT NULL,
CRSArrTime Int32 NOT NULL,
ArrTime Int32 NOT NULL,
ArrDelay Int32 NOT NULL,
ArrDelayMinutes Int32 NOT NULL,
ArrDel15 Int32 NOT NULL,
ArrivalDelayGroups Int32 NOT NULL,
ArrTimeBlk String NOT NULL,
Cancelled UInt8 NOT NULL,
CancellationCode String NOT NULL,
Diverted UInt8 NOT NULL,
CRSElapsedTime Int32 NOT NULL,
ActualElapsedTime Int32 NOT NULL,
AirTime Int32 NOT NULL,
Flights Int32 NOT NULL,
Distance Int32 NOT NULL,
DistanceGroup UInt8 NOT NULL,
CarrierDelay Int32 NOT NULL,
WeatherDelay Int32 NOT NULL,
NASDelay Int32 NOT NULL,
SecurityDelay Int32 NOT NULL,
LateAircraftDelay Int32 NOT NULL,
FirstDepTime String NOT NULL,
TotalAddGTime String NOT NULL,
LongestAddGTime String NOT NULL,
DivAirportLandings String NOT NULL,
DivReachedDest String NOT NULL,
DivActualElapsedTime String NOT NULL,
DivArrDelay String NOT NULL,
DivDistance String NOT NULL,
Div1Airport String NOT NULL,
Div1AirportID Int32 NOT NULL,
Div1AirportSeqID Int32 NOT NULL,
Div1WheelsOn String NOT NULL,
Div1TotalGTime String NOT NULL,
Div1LongestGTime String NOT NULL,
Div1WheelsOff String NOT NULL,
Div1TailNum String NOT NULL,
Div2Airport String NOT NULL,
Div2AirportID Int32 NOT NULL,
Div2AirportSeqID Int32 NOT NULL,
Div2WheelsOn String NOT NULL,
Div2TotalGTime String NOT NULL,
Div2LongestGTime String NOT NULL,
Div2WheelsOff String NOT NULL,
Div2TailNum String NOT NULL,
Div3Airport String NOT NULL,
Div3AirportID Int32 NOT NULL,
Div3AirportSeqID Int32 NOT NULL,
Div3WheelsOn String NOT NULL,
Div3TotalGTime String NOT NULL,
Div3LongestGTime String NOT NULL,
Div3WheelsOff String NOT NULL,
Div3TailNum String NOT NULL,
Div4Airport String NOT NULL,
Div4AirportID Int32 NOT NULL,
Div4AirportSeqID Int32 NOT NULL,
Div4WheelsOn String NOT NULL,
Div4TotalGTime String NOT NULL,
Div4LongestGTime String NOT NULL,
Div4WheelsOff String NOT NULL,
Div4TailNum String NOT NULL,
Div5Airport String NOT NULL,
Div5AirportID Int32 NOT NULL,
Div5AirportSeqID Int32 NOT NULL,
Div5WheelsOn String NOT NULL,
Div5TotalGTime String NOT NULL,
Div5LongestGTime String NOT NULL,
Div5WheelsOff String NOT NULL,
Div5TailNum String NOT NULL
);
加载表结构:
cat create_ontime.sql | mysql -h127.0.0.1 -P3307 -uroot
数据加载
cat load_ontime.sh
echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir"
ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset"
if [ $? -eq 0 ];
then
echo "unzip success"
else
echo "unzip was wrong!!!"
exit 1
fi
cat create_ontime.sql |mysql -h127.0.0.1 -P3307 -uroot
if [ $? -eq 0 ];
then
echo "Ontime table create success"
else
echo "Ontime table create was wrong!!!"
exit 1
fi
time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_load
使用方法
./load_ontime.sh ZIP文件目录
基于 Ontime 测试 SQL 展示
Q1 查询2000年到2008年每天的总的航班总
(0.494 sec., 143.75 million rows/sec., 431.25 MB/sec)
mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-----------+---------+\
| DayOfWeek | c |\
+-----------+---------+\
| 5 | 8732422 |\
| 1 | 8730614 |\
| 4 | 8710843 |\
| 3 | 8685626 |\
| 2 | 8639632 |\
| 7 | 8274367 |\
| 6 | 7514194 |\
+-----------+---------+\
7 rows in set (0.50 sec)\
Read 71000000 rows, 213 MB in 0.494 sec., 143.75 million rows/sec., 431.25 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: DayOfWeek:UInt8, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8], statistics: [read_rows: 71000000, read_bytes: 213000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)
Q2 查询 2000 年到 2008 年延迟超过 10 分钟,每天总的延迟发生情况
( 0.543 sec., 130.71 million rows/sec., 914.95 GB/sec.)
mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-----------+---------+\
| DayOfWeek | c |\
+-----------+---------+\
| 5 | 2175733 |\
| 4 | 2012848 |\
| 1 | 1898879 |\
| 7 | 1880896 |\
| 3 | 1757508 |\
| 2 | 1665303 |\
| 6 | 1510894 |\
+-----------+---------+\
7 rows in set (0.54 sec)\
Read 71000000 rows, 497 MB in 0.543 sec., 130.71 million rows/sec., 914.95 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: DayOfWeek:UInt8, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 497000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q3 2000-2008年机场的延误次数,显示最高的10条
(0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.)
Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;\
+--------+--------+\
| Origin | c |\
+--------+--------+\
| ORD | 860911 |\
| ATL | 831822 |\
| DFW | 614403 |\
| LAX | 402671 |\
| PHX | 400475 |\
| LAS | 362026 |\
| DEN | 352893 |\
| EWR | 302267 |\
| DTW | 296832 |\
| IAH | 290729 |\
+--------+--------+\
10 rows in set (0.69 sec)\
Read 71000000 rows, 1.21 GB in 0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.
mysql> explain SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: Origin:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[Origin]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Origin]], aggr=[[count()]] |\
| Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, Origin:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1271665856, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 14, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
7 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q4 2007年各航空公司延误的次数
(0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.)
mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;\
+---------+---------+\
| Carrier | count() |\
+---------+---------+\
| WN | 296451 |\
| AA | 179769 |\
| MQ | 152293 |\
| OO | 147019 |\
| US | 140199 |\
| UA | 135061 |\
| XE | 108571 |\
| EV | 104055 |\
| NW | 102206 |\
| DL | 98427 |\
| CO | 81039 |\
| YV | 79553 |\
| FL | 64583 |\
| OH | 60532 |\
| AS | 54326 |\
| B6 | 53716 |\
| 9E | 48578 |\
| F9 | 24100 |\
| AQ | 6764 |\
| HA | 4059 |\
+---------+---------+\
20 rows in set (0.19 sec)\
Read 15000000 rows, 240 MB in 0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, count():UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]] |\
| Filter: ((DepDelay > 10) and (Year = 2007)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((DepDelay > 10) AND (Year = 2007))]] |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q5 2007年各航空公司延误的千分比
(0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.)
mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| EV | 363.53123668047823 |\
| AS | 339.1453631738303 |\
| US | 288.8039271022377 |\
| AA | 283.6112877194699 |\
| MQ | 281.7663100792978 |\
| B6 | 280.5745625489684 |\
| UA | 275.63356884257615 |\
| YV | 270.25567158804466 |\
| OH | 256.4567516268981 |\
| WN | 253.62165713752844 |\
| CO | 250.77750030171651 |\
| XE | 249.71881878589517 |\
| NW | 246.56113247419944 |\
| F9 | 246.52209492635023 |\
| OO | 245.90051515354253 |\
| FL | 245.4143692596491 |\
| DL | 206.82764258051773 |\
| 9E | 187.66780889391967 |\
| AQ | 145.9016393442623 |\
| HA | 72.25634178905207 |\
+---------+--------------------+\
20 rows in set (0.27 sec)\
Read 15000000 rows, 240 MB in 0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64 |\
| Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy) |\
| Filter: (Year = 2007) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [(Year = 2007)]] |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q6 2000-2008年各航空公司延误的千分比
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)
mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| AS | 293.05649076611434 |\
| EV | 282.0709981074399 |\
| YV | 270.3897636688929 |\
| B6 | 257.40594891667007 |\
| FL | 249.28742951361826 |\
| XE | 246.59005902424192 |\
| MQ | 245.3695989400477 |\
| WN | 233.38127235928863 |\
| DH | 227.11013827345042 |\
| F9 | 226.08455653226812 |\
| UA | 224.42824657703645 |\
| OH | 215.52882835147614 |\
| AA | 211.97122176454556 |\
| US | 206.60330294168244 |\
| HP | 205.31690167066455 |\
| OO | 202.4243177198239 |\
| NW | 191.7393936377831 |\
| TW | 188.6912623180138 |\
| DL | 187.84162871590732 |\
| CO | 187.71301306878976 |\
| 9E | 181.6396991511518 |\
| RU | 181.46244295416398 |\
| TZ | 176.8928125899626 |\
| AQ | 145.65911608293766 |\
| HA | 79.38672451825789 |\
+---------+--------------------+\
25 rows in set (0.94 sec)\
Read 71000000 rows, 1.14 GB in 0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64 |\
| Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy) |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.003 sec., 0 rows/sec., 0 B/sec.
Q7 2000-2008年各航空公司平均延误时间
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)
ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| B6 | 16789.739456036365 |\
| NW | 11717.623092632819 |\
| F9 | 11232.889558936127 |\
| XE | 17092.548853057146 |\
| YV | 17971.53933699898 |\
| US | 11868.7097884053 |\
| RU | 12556.249210602802 |\
| AS | 14735.545887755581 |\
| HA | 6851.555976883671 |\
| OH | 12655.103820799075 |\
| UA | 14594.243159716054 |\
| TZ | 12618.760195758565 |\
| EV | 16374.703330010156 |\
| HP | 11625.682112859839 |\
| DH | 15311.949983190174 |\
| DL | 10943.456441165357 |\
| 9E | 13091.087573576122 |\
| FL | 15192.451732538268 |\
| MQ | 14125.201554023559 |\
| AQ | 7323.278123603293 |\
| OO | 11600.594852741107 |\
| AA | 13508.78515494305 |\
| TW | 10842.722114986364 |\
| WN | 10484.932610056378 |\
| CO | 12671.595978518368 |\
+---------+--------------------+\
25 rows in set (0.74 sec)\
Read 71000000 rows, 1.14 GB in 0.727 sec., 97.6 million rows/sec., 1.56 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(DepDelay) * 1000) as c3:Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(DepDelay) * 1000):Float64 (Before Projection) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]] |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q8 每年航班延误平均时间
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)
mysql> SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;\
+------+--------------------+\
| Year | avg(DepDelay) |\
+------+--------------------+\
| 1987 | 12.380385692195556 |\
| 1988 | 7.345867511864449 |\
| 1989 | 8.81845473300008 |\
| 1990 | 7.966702606180775 |\
| 1991 | 6.940411174086677 |\
| 1992 | 6.687364706154975 |\
| 1993 | 7.207721091071671 |\
| 1994 | 7.758752042452116 |\
| 1995 | 9.328649903752932 |\
| 1996 | 11.14468468976826 |\
| 1997 | 9.919225483813925 |\
| 1998 | 10.884314711941435 |\
| 1999 | 11.567390524113748 |\
| 2000 | 13.456897681824556 |\
| 2001 | 10.895474364001354 |\
| 2002 | 9.97856700710386 |\
| 2003 | 9.778465263372038 |\
| 2004 | 11.936799840656898 |\
| 2005 | 12.60167890747495 |\
| 2006 | 14.237297887039372 |\
| 2007 | 15.431738868356579 |\
| 2008 | 14.654588068064287 |\
| 2009 | 13.168984006133062 |\
| 2010 | 13.202976628175891 |\
| 2011 | 13.496191548097778 |\
| 2012 | 13.155971481255131 |\
| 2013 | 14.901210490900201 |\
| 2014 | 15.513697266113969 |\
| 2015 | 14.638336410280733 |\
| 2016 | 14.643883269504837 |\
| 2017 | 15.70225324299191 |\
| 2018 | 16.16188254545747 |\
| 2019 | 16.983263489524507 |\
| 2020 | 10.624498278073712 |\
| 2021 | 15.289615417399649 |\
+------+--------------------+\
35 rows in set (1.04 sec)\
Read 201816232 rows, 1.21 GB in 1.030 sec., 195.93 million rows/sec., 1.18 GB/sec.
mysql> explain SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: Year:UInt16, avg(DepDelay):Float64 |\
| AggregatorFinal: groupBy=[[Year]], aggr=[[avg(DepDelay)]] |\
| AggregatorPartial: groupBy=[[Year]], aggr=[[avg(DepDelay)]] |\
| ReadDataSource: scan schema: [Year:UInt16, DepDelay:Int32], statistics: [read_rows: 201816232, read_bytes: 1210897392, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 31]] |\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q9 每年有多少航班
(0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.)
mysql> SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;\
+------+---------+\
| Year | c1 |\
+------+---------+\
| 1987 | 440403 |\
| 1988 | 5202096 |\
| 1989 | 5041200 |\
| 1990 | 5270893 |\
| 1991 | 5076925 |\
| 1992 | 5092157 |\
| 1993 | 5070501 |\
| 1994 | 5180048 |\
| 1995 | 5327435 |\
| 1996 | 5351983 |\
| 1997 | 5411843 |\
| 1998 | 5384721 |\
| 1999 | 5527884 |\
| 2000 | 5683047 |\
| 2001 | 5967780 |\
| 2002 | 5271359 |\
| 2003 | 6488540 |\
| 2004 | 7129270 |\
| 2005 | 7140596 |\
| 2006 | 7141922 |\
| 2007 | 7455458 |\
| 2008 | 7009726 |\
| 2009 | 6450285 |\
| 2010 | 6450117 |\
| 2011 | 6085281 |\
| 2012 | 6096762 |\
| 2013 | 6369482 |\
| 2014 | 5819811 |\
| 2015 | 5819079 |\
| 2016 | 5617658 |\
| 2017 | 5674621 |\
| 2018 | 7213446 |\
| 2019 | 7422037 |\
| 2020 | 4688354 |\
| 2021 | 5443512 |\
+------+---------+\
35 rows in set (0.52 sec)\
Read 201816232 rows, 403.63 MB in 0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.
mysql> explain SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: Year:UInt16, count() as c1:UInt64 |\
| AggregatorFinal: groupBy=[[Year]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [Year:UInt16], statistics: [read_rows: 201816232, read_bytes: 403632464, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q10 计算每月延迟15分钟的航班平均数
(0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.)
mysql> SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;\
+-------------------+\
| avg(cnt) |\
+-------------------+\
| 81474.99019607843 |\
+-------------------+\
1 row in set (0.90 sec)\
Read 201816232 rows, 1.41 GB in 0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.
mysql> explain SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: avg(cnt):Float64 |\
| AggregatorFinal: groupBy=[[]], aggr=[[avg(cnt)]] |\
| AggregatorPartial: groupBy=[[]], aggr=[[avg(cnt)]] |\
| Projection: Year:UInt16, Month:UInt8, count() as cnt:UInt64 |\
| AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]] |\
| Filter: (DepDel15 = 1) |\
| ReadDataSource: scan schema: [Year:UInt16, Month:UInt8, DepDel15:Int32], statistics: [read_rows: 201816232, read_bytes: 1412713624, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2, 33], filters: [(DepDel15 = 1)]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q11 计算每月航班平均数
(0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.)
mysql> SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;\
+-------------------+\
| avg(c1) |\
+-------------------+\
| 494647.6274509804 |\
+-------------------+\
1 row in set (0.57 sec)\
Read 201816232 rows, 605.45 MB in 0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.
mysql> explain SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: avg(c1):Float64 |\
| AggregatorFinal: groupBy=[[]], aggr=[[avg(c1)]] |\
| AggregatorPartial: groupBy=[[]], aggr=[[avg(c1)]] |\
| Projection: Year:UInt16, Month:UInt8, count() as c1:UInt64 |\
| AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [Year:UInt16, Month:UInt8], statistics: [read_rows: 201816232, read_bytes: 605448696, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
7 rows in set (0.02 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q12 显示10个两个城市直飞线航班最多的前10个
(2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.)
mysql> SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;\
+-------------------+-------------------+--------+\
| OriginCityName | DestCityName | c |\
+-------------------+-------------------+--------+\
| San Francisco, CA | Los Angeles, CA | 514878 |\
| Los Angeles, CA | San Francisco, CA | 512147 |\
| New York, NY | Chicago, IL | 456042 |\
| Chicago, IL | New York, NY | 448756 |\
| Chicago, IL | Minneapolis, MN | 437913 |\
| Minneapolis, MN | Chicago, IL | 433688 |\
| Los Angeles, CA | Las Vegas, NV | 428942 |\
| Las Vegas, NV | Los Angeles, CA | 422825 |\
| New York, NY | Boston, MA | 419405 |\
| Boston, MA | New York, NY | 416324 |\
+-------------------+-------------------+--------+\
10 rows in set (2.94 sec)\
Read 201816232 rows, 8.54 GB in 2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.
mysql> explain SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: OriginCityName:String, DestCityName:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [OriginCityName:String, DestCityName:String], statistics: [read_rows: 201816232, read_bytes: 9829664815, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15, 24]] |\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q13 显示飞机最多航班的10个城市
(1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.)
mysql> SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;\
+-----------------------+----------+\
| OriginCityName | c |\
+-----------------------+----------+\
| Chicago, IL | 12545243 |\
| Atlanta, GA | 10900284 |\
| Dallas/Fort Worth, TX | 9011081 |\
| Houston, TX | 6844476 |\
| Los Angeles, CA | 6695628 |\
| New York, NY | 6309911 |\
| Denver, CO | 6283055 |\
| Phoenix, AZ | 5658884 |\
| Washington, DC | 4998047 |\
| San Francisco, CA | 4673365 |\
+-----------------------+----------+\
10 rows in set (1.23 sec)\
Read 201816232 rows, 4.27 GB in 1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.
mysql> explain SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: OriginCityName:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[OriginCityName]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[OriginCityName]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [OriginCityName:String], statistics: [read_rows: 201816232, read_bytes: 4914707403, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q14 查询 ontime 表总共有多少行
(0.002 sec., 443.51 rows/sec., 443.51 B/sec.)
mysql> SELECT count(*) FROM ontime;\
+-----------+\
| count() |\
+-----------+\
| 201816232 |\
+-----------+\
1 row in set (0.01 sec)\
Read 1 rows, 1 B in 0.002 sec., 443.51 rows/sec., 443.51 B/sec.
mysql> explain SELECT count(*) FROM ontime;\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: count():UInt64 |\
| Projection: 201816232 as count():UInt64 |\
| Expression: 201816232:UInt64 (Exact Statistics) |\
| ReadDataSource: scan schema: [dummy:UInt8], statistics: [read_rows: 1, read_bytes: 1, partitions_scanned: 1, partitions_total: 1] |\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
更多性能测试
Databend On Amazon S3 Performance
https://databend.rs/doc/performance/ec2-s3-performance
Databend On Alibaba Cloud ECS OSS Performance
https://databend.rs/doc/performance/ecs-oss-performance
\
Databend On Wasabi Performance
https://databend.rs/doc/performance/ec2-wasabi-performance
需要支持请添加微信: 82565387 获取更多帮助。
微信:Databend 文档:https://databend.rs