一篇文章看懂TPCx-BB(大数据基准测试工具)源码
TPCx-BB是大数据基准测试工具,它通过模拟零售商的30个应用场景,执行30个查询来衡量基于Hadoop的大数据系统的包括硬件和软件的性能。其中一些场景还用到了机器学习算法(聚类、线性回归等)。为了更好地了解被测试的系统的性能,需要对TPCx-BB整个测试流程深入了解。本文详细分析了整个TPCx-BB测试工具的源码,希望能够对大家理解TPCx-BB有所帮助。
代码结构
主目录($BENCH_MARK_HOME
)下有:
- bin
- conf
- data-generator
- engines
- tools
几个子目录。
bin下有几个 module
,是执行时需要用到的脚本:bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等
conf下有两个配置文件:bigBench.properties
和 userSettings.conf
bigBench.properties
主要设置 workload
(执行的benchmarkPhases)和 power_test_0
(POWER_TEST
阶段需要执行的SQL查询)
默认 workload
:
workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1
默认 power_test_0
:1-30
userSetting.conf
是一些基本设置,包括JAVA environment 、default settings for benchmark(database、engine、map_tasks、scale_factor ...)、HADOOP environment、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)
data-generator下是跟数据生成相关的脚本及配置文件。详细内容在下面介绍。
engines下是TPCx-BB支持的4种引擎:biginsights、hive、impala、spark_sql。默认引擎为hive。实际上,只有hive目录下不为空,其他三个目录下均为空,估计是现在还未完善。
tools下有两个jar包:HadoopClusterExec.jar
和 RunBigBench.jar
。其中 RunBigBench.jar
是执行TPCx-BB测试的一个非常重要的文件,大部分程序都在该jar包内。
数据生成
数据生成相关程序和配置都在 data-generator
目录下。该目录下有一个 pdgf.jar
包和 config、dicts、extlib
三个子目录。
pdgf.jar是数据生成的Java程序,代码量很大。config下有两个配置文件:bigbench-generation.xml
和 bigbench-schema.xml
。
bigbench-generation.xml
主要设置生成的原始数据(不是数据库表)包含哪几张表、每张表的表名、表的大小以及表输出的目录、表文件的后缀、分隔符、字符编码等。
<schema name="default">
<tables>
<!-- not refreshed tables -->
<!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons -->
<table name="income_band"></table>
<table name="reason"></table>
<table name="ship_mode"></table>
<table name="web_site"></table>
<!-- /tables not used in benchmark -->
<!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->
<table name="date_dim" static="true"></table>
<table name="time_dim" static="true"></table>
<table name="customer_demographics" static="true"></table>
<table name="household_demographics" static="true"></table>
<!-- /static tables -->
<!-- "normal" tables. split over all nodes. not generated during refresh -->
<table name="store"></table>
<table name="warehouse"></table>
<table name="promotion"></table>
<table name="web_page"></table>
<!-- /"normal" tables.-->
<!-- /not refreshed tables -->
<!--
refreshed tables. Generated on all nodes.
Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)
In "normal"-Phase generate table rows: [0,REFRESH_PERCENTAGE*Table.Size];
In "refresh"-Phase generate table rows: [REFRESH_PERCENTAGE*Table.Size+1, Table.Size]
.Has effect only if ${REFRESH_SYSTEM_ENABLED}==1.
-->
<table name="customer">
<scheduler name="DefaultScheduler">
<partitioner
name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">
<prePartition><![CDATA[
if(${REFRESH_SYSTEM_ENABLED}>0){
int tableID = table.getTableID();
int timeID = 0;
long lastTableRow=table.getSize()-1;
long rowStart;
long rowStop;
boolean exclude=false;
long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});
if(${REFRESH_PHASE}>0){
//Refresh part
rowStart = lastTableRow - refreshRows +1;
rowStop = lastTableRow;
if(refreshRows<=0){
exclude=true;
}
}else{
//"normal" part
rowStart = 0;
rowStop = lastTableRow - refreshRows;
}
return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);
}else{
//DEFAULT
return getParentPartitioner().getDefaultPrePartition(project, table);
}
]]></prePartition>
</partitioner>
</scheduler>
</table>
<output name="SplitFileOutputWrapper">
<!-- DEFAULT output for all Tables, if no table specific output is specified-->
<output name="CSVRowOutput">
<fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
<outputDir>output/</outputDir>
<fileEnding>.dat</fileEnding>
<delimiter>|</delimiter>
<charset>UTF-8</charset>
<sortByRowID>true</sortByRowID>
</output>
<output name="StatisticsOutput" active="1">
<size>${item_size}</size><!-- a counter per item .. initialize later-->
<fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
<outputDir>output/</outputDir>
<fileEnding>.csv</fileEnding>
<delimiter>,</delimiter>
<header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->
</header>
<footer></footer>
bigbench-schema.xml
设置了很多参数,有跟表的规模有关的,比如每张表的大小(记录的条数);绝大多数是跟表的字段有关的,比如时间的起始、结束、性别比例、结婚比例、指标的上下界等。还具体定义了每个字段是怎么生成的,以及限制条件。示例如下:
生成的数据大小由 SCALE_FACTOR(-f)
决定。如 -f 1
,则生成的数据总大小约为1G;-f 100
,则生成的数据总大小约为100G。那么SCALE_FACTOR(-f)
是怎么精确控制生成的数据的大小呢?
原因是 SCALE_FACTOR(-f)
决定了每张表的记录数。如下,customer
表的记录数为 100000.0d * ${SF_sqrt}
,即如果 -f 1
则 customer
表的记录数为 100000*sqrt(1)= 10万条
;如果 -f 100
则 customer
表的记录数为 100000*sqrt(100)= 100万条
<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>
<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>
<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>
<property name="${gender_likelihood}" type="double">0.5</property>
<property name="${married_likelihood}" type="double">0.3</property>
<property name="${WP_LINK_MIN}" type="double">2</property>
<property name="${WP_LINK_MAX}" type="double">25</property>
<field name="d_date" size="13" type="CHAR" primary="false">
<gen_DateTime>
<disableRng>true</disableRng>
<useFixedStepSize>true</useFixedStepSize>
<startDate>${date_dim_begin_date}</startDate>
<endDate>${date_dim_end_date}</endDate>
<outputFormat>yyyy-MM-dd</outputFormat>
</gen_DateTime>
</field>
<field name="t_time_id" size="16" type="CHAR" primary="false">
<gen_ConvertNumberToString>
<gen_Id/>
<size>16.0</size>
<characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>
</gen_ConvertNumberToString>
</field>
<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">
<gen_Null probability="${NULL_CHANCE}">
<gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />
</gen_Null>
</field>
dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件,表里每一条记录的各个字段应该是从这些字典里生成的。
extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jar
、commons-net-3.3.jar
、xml-apis.jar
和log4j-1.2.15.jar
等
总结:
pdgf.jar
根据bigbench-generation.xml
和 bigbench-schema.xml
两个文件里的配置(表名、字段名、表的记录条数、每个字段生成的规则),从 dicts
目录下对应的 .dict
文件获取表中每一条记录、每个字段的值,生成原始数据。
customer
表里的某条记录如下:
0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com gdMmGdU9
如果执行 TPCx-BB 测试时指定 -f 1(SCALE_FACTOR = 1)
则最终生成的原始数据总大小约为 1G(977M+8.6M)
[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data
12.7 M 38.0 M /user/root/benchmarks/bigbench/data/customer
5.1 M 15.4 M /user/root/benchmarks/bigbench/data/customer_address
74.2 M 222.5 M /user/root/benchmarks/bigbench/data/customer_demographics
14.7 M 44.0 M /user/root/benchmarks/bigbench/data/date_dim
151.5 K 454.4 K /user/root/benchmarks/bigbench/data/household_demographics
327 981 /user/root/benchmarks/bigbench/data/income_band
405.3 M 1.2 G /user/root/benchmarks/bigbench/data/inventory
6.5 M 19.5 M /user/root/benchmarks/bigbench/data/item
4.0 M 12.0 M /user/root/benchmarks/bigbench/data/item_marketprices
53.7 M 161.2 M /user/root/benchmarks/bigbench/data/product_reviews
45.3 K 135.9 K /user/root/benchmarks/bigbench/data/promotion
3.0 K 9.1 K /user/root/benchmarks/bigbench/data/reason
1.2 K 3.6 K /user/root/benchmarks/bigbench/data/ship_mode
3.3 K 9.9 K /user/root/benchmarks/bigbench/data/store
4.1 M 12.4 M /user/root/benchmarks/bigbench/data/store_returns
88.5 M 265.4 M /user/root/benchmarks/bigbench/data/store_sales
4.9 M 14.6 M /user/root/benchmarks/bigbench/data/time_dim
584 1.7 K /user/root/benchmarks/bigbench/data/warehouse
170.4 M 511.3 M /user/root/benchmarks/bigbench/data/web_clickstreams
7.9 K 23.6 K /user/root/benchmarks/bigbench/data/web_page
5.1 M 15.4 M /user/root/benchmarks/bigbench/data/web_returns
127.6 M 382.8 M /user/root/benchmarks/bigbench/data/web_sales
8.6 K 25.9 K /user/root/benchmarks/bigbench/data/web_site
执行流程
要执行TPCx-BB测试,首先需要切换到TPCx-BB源程序的目录下,然后进入bin目录,执行以下语句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
其中,-f、-m、-s、-j都是参数,用户可根据集群的性能以及自己的需求来设置。如果不指定,则使用默认值,默认值在 conf
目录下的 userSetting.conf
文件指定,如下:
export BIG_BENCH_DEFAULT_DATABASE="bigbench"
export BIG_BENCH_DEFAULT_ENGINE="hive"
export BIG_BENCH_DEFAULT_MAP_TASKS="80"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"
export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"
export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"
默认 MAP_TASKS
为 80(-m 80)
、SCALE_FACTOR
为 1000(-f 1000)
、NUMBER_OF_PARALLEL_STREAMS
为 2(-s 2)
。
所有可选参数及其意义如下:
General options:
-d 使用的数据库 (默认: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)
-e 使用的引擎 (默认: $BIG_BENCH_DEFAULT_ENGINE -> hive)
-f 数据集的规模因子(scale factor) (默认: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)
-h 显示帮助
-m 数据生成的`map tasks`数 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"
-s 并行的`stream`数 (默认: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)
Driver specific options:
-a 伪装模式执行
-b 执行期间将调用的bash脚本在标准输出中打印出来
-i 指定需要执行的阶段 (详情见$BIG_BENCH_CONF_DIR/bigBench.properties)
-j 指定需要执行的查询 (默认:1-30共30个查询均执行)"
-U 解锁专家模式
若指定了-U
,即解锁了专家模式,则:
echo "EXPERT MODE ACTIVE"
echo "WARNING - INTERNAL USE ONLY:"
echo "Only set manually if you know what you are doing!"
echo "Ignoring them is probably the best solution"
echo "Running individual modules:"
echo "Usage: `basename $0` module [options]"
-D 指定需要debug的查询部分. 大部分查询都只有一个单独的部分
-p 需要执行的benchmark phase (默认: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"
-q 指定需要执行哪个查询(只能指定一个)
-t 指定执行该查询时用第哪个stream
-v metastore population的sql脚本 (默认: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"
-w metastore refresh的sql脚本 (默认: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"
-y 含额外的用户自定义查询参数的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"
-z 含额外的用户自定义引擎设置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"
List of available modules:
$BIG_BENCH_ENGINE_BIN_DIR
回到刚刚执行TPCx-BB测试的语句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
bigBench
bigBench
是主脚本,runBenchmark
是module。
bigBench
里设置了很多环境变量(包括路径、引擎、STREAM数等等),因为后面调用 runBigBench.jar
的时候需要在Java程序里读取这些环境变量。
bigBench
前面都是在做一些基本工作,如设置环境变量、解析用户输入参数、赋予文件权限、设置路径等等。到最后一步调用 runBenchmark
的 runModule()
方法:
-
设置基本路径
export BIG_BENCH_VERSION="1.0" export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin" export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf" export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator" export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools" export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"
-
指定
core-site.xml
和hdfs-site.xml
的路径数据生成时要用到Hadoop集群,生成在hdfs上
export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"
export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"
```
-
赋予整个包下所有可执行文件权限(.sh/.jar/.py)
find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '.jar' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '.py' -exec chmod 755 {} +
```
-
设置
userSetting.conf
的路径并source
USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf" if [ ! -f "$USER_SETTINGS" ] then echo "User settings file $USER_SETTINGS not found" exit 1 else source "$USER_SETTINGS" fi
-
解析输入参数和选项并根据选项的内容作设置
第一个参数必须是
module_name
如果没有输入参数或者第一个参数以"-"开头,说明用户没有输入需要运行的module。
if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]] then export MODULE_NAME="" SHOW_HELP="1" else export MODULE_NAME="$1" shift fi export LIST_OF_USER_OPTIONS="$@"
解析用户输入的参数
根据用户输入的参数来设置环境变量
```bash
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do
case "$OPT" in
# script options
d)
#echo "-d was triggered, Parameter: $OPTARG" >&2
USER_DATABASE="$OPTARG"
;;
D)
#echo "-D was triggered, Parameter: $OPTARG" >&2
DEBUG_QUERY_PART="$OPTARG"
;;
e)
#echo "-e was triggered, Parameter: $OPTARG" >&2
USER_ENGINE="$OPTARG"
;;
f)
#echo "-f was triggered, Parameter: $OPTARG" >&2
USER_SCALE_FACTOR="$OPTARG"
;;
h)
#echo "-h was triggered, Parameter: $OPTARG" >&2
SHOW_HELP="1"
;;
m)
#echo "-m was triggered, Parameter: $OPTARG" >&2
USER_MAP_TASKS="$OPTARG"
;;
p)
#echo "-p was triggered, Parameter: $OPTARG" >&2
USER_BENCHMARK_PHASE="$OPTARG"
;;
q)
#echo "-q was triggered, Parameter: $OPTARG" >&2
QUERY_NUMBER="$OPTARG"
;;
s)
#echo "-t was triggered, Parameter: $OPTARG" >&2
USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"
;;
t)
#echo "-s was triggered, Parameter: $OPTARG" >&2
USER_STREAM_NUMBER="$OPTARG"
;;
U)
#echo "-U was triggered, Parameter: $OPTARG" >&2
USER_EXPERT_MODE="1"
;;
v)
#echo "-v was triggered, Parameter: $OPTARG" >&2
USER_POPULATE_FILE="$OPTARG"
;;
w)
#echo "-w was triggered, Parameter: $OPTARG" >&2
USER_REFRESH_FILE="$OPTARG"
;;
y)
#echo "-y was triggered, Parameter: $OPTARG" >&2
USER_QUERY_PARAMS_FILE="$OPTARG"
;;
z)
#echo "-z was triggered, Parameter: $OPTARG" >&2
USER_ENGINE_SETTINGS_FILE="$OPTARG"
;;
# driver options
a)
#echo "-a was triggered, Parameter: $OPTARG" >&2
export USER_PRETEND_MODE="1"
;;
b)
#echo "-b was triggered, Parameter: $OPTARG" >&2
export USER_PRINT_STD_OUT="1"
;;
i)
#echo "-i was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_WORKLOAD="$OPTARG"
;;
j)
#echo "-j was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"
;;
?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
😃
echo "Option -$OPTARG requires an argument." >&2
exit 1
;;
esac
done
```
设置全局变量。如果用户指定了某个参数的值,则采用该值,否则使用默认值。
```bash
export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}"
export SHOW_HELP="${SHOW_HELP:-"0"}"
export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}"
export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}"
export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}"
export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}"
export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}"
export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}"
export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}"
export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE"
export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf"
```
-
检测 -s -m -f -j的选项是否为数字
if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_MAP_TASKS is not a number" fi if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_SCALE_FACTOR is not a number" fi if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number" fi if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_STREAM_NUMBER is not a number" fi
-
检查引擎是否存在
if [ ! -d "$BIG_BENCH_ENGINE_DIR" ] then echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..." exit 1 fi if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ] then echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..." exit 1 fi
-
设置
engineSetting.conf
路径并source
ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf" if [ ! -f "$ENGINE_SETTINGS" ] then echo "Engine settings file $ENGINE_SETTINGS not found" exit 1 else source "$ENGINE_SETTINGS" fi
-
检查module是否存在
当输入某个module时,系统会先到
$BIG_BENCH_ENGINE_BIN_DIR/
目录下去找该module是否存在,如果存在,就source "$MODULE"
;如果该目录下不存在指定的module,再到export MODULE="$BIG_BENCH_BIN_DIR/"
目录下找该module,如果存在,就source "$MODULE"
;否则,输出Module $MODULE not found, aborting script.
export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else echo "Module $MODULE not found, aborting script." exit 1 fi fi
-
检查module里的runModule()、helpModule ( )、runEngineCmd()方法是否有定义
MODULE_RUN_METHOD="runModule" if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1 then echo "$MODULE_RUN_METHOD was not implemented, aborting script" exit 1 fi
-
运行
module
如果module是runBenchmark,执行
runCmdWithErrorCheck "$MODULE_RUN_METHOD"
也就是
runCmdWithErrorCheck runModule()
由上可以看出,bigBench脚本主要执行一些如设置环境变量、赋予权限、检查并解析输入参数等基础工作,最终调用runBenchmark
的runModule()
方法继续往下执行。
runBenchmark
接下来看看runBenchmark
脚本。
runBenchmark
里有两个函数:helpModule ()
和runModule ()
。
helpModule ()
就是显示帮助。
runModule ()
是运行runBenchmark
模块时真正调用的函数。该函数主要做四件事:
- 清除之前生成的日志
- 调用
RunBigBench.jar
来执行 - logEnvInformation
- 将日志文件夹打包成zip
源码如下:
runModule () {
#check input parameters
if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ]
then
echo "The number of parallel streams -s must be greater than 0"
return 1
fi
"${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS
"$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar"
"${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS
"${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS
return $?
}
相当于运行runBenchmark
模块时又调用了cleanLogs
、logEnvInformation
、zipLogs
三个模块以及RunBigBench.jar
。其中RunBigBench.jar
是TCPx-BB测试执行的核心代码,用Java语言编写。接下来分析RunBigBench.jar
源码。
runModule()
runModule()函数用来执行某个module。我们已知,执行某个module需要切换到主目录下的bin目录,然后执行:
./bigBench module_name arguments
在runModule()函数里,cmdLine用来生成如上命令。
ArrayList cmdLine = new ArrayList();
cmdLine.add("bash");
cmdLine.add(this.runScript);
cmdLine.add(benchmarkPhase.getRunModule());
cmdLine.addAll(arguments);
其中,this.runScript
为:
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
benchmarkPhase.getRunModule()
用来获得需要执行的module。
arguments
为用户输入的参数。
至此,cmdLine为:
bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments
那么,怎么让系统执行该bash命令呢?答案是调用runCmd()
方法。
boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));
接下来介绍rumCmd()方法
runCmd()
runCmd()方法通过ProcessBuilder
来创建一个操作系统进程,并用该进程执行以上的bash命令。
ProcessBuilder
还可以设置工作目录和环境。
ProcessBuilder pb = new ProcessBuilder(command);
pb.directory(new File(workingDirectory));
Process p = null;
---
p = pb.start();
getQueryList()
getQueryList()用来获得需要执行的查询列表。从$BIG_BENCH_LOGS_DIR/bigBench.properties
文件中读取。与$BIG_BENCH_HOME/conf/bigBench.properties
内容一致。
bigBench.properties
里power_test_0=1-30
规定了powter_test_0
阶段需要执行的查询及其顺序。
可以用区间如 5-12
或者单个数字如 21
表示,中间用 ,
隔开。
power_test_0=28-25,2-5,10,22,30
表示powter_test_0
阶段需要执行的查询及其顺序为:28,27,26,25,2,3,4,5,10,22,30
如果想让30个查询按顺序执行,则:
power_test_0=1-30
获得查询列表的源码如下:
private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) {
String SHUFFLED_NAME_PATTERN = "shuffledQueryList";
BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST;
String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber);
boolean queryOrderCached = benchmarkPhase.isQueryOrderCached();
if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) {
return new ArrayList((Collection)this.queryListCache.get(propertyKey));
} else {
Object queryList;
String basicPhaseNamePattern;
if(!this.properties.containsKey(propertyKey)) {
if(benchmarkPhase.isQueryOrderRandom()) {
if(!this.queryListCache.containsKey("shuffledQueryList")) {
basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0);
if(!this.properties.containsKey(basicPhaseNamePattern)) {
throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list.");
}
this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0));
}
queryList = (List)this.queryListCache.get("shuffledQueryList");
this.shuffleList((List)queryList, this.rnd);
} else {
queryList = this.getQueryList(queryOrderBasicPhase, 0);
}
} else {
queryList = new ArrayList();
String[] var11;
int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length;
label65:
for(int var9 = 0; var9 < var10; ++var9) {
basicPhaseNamePattern = var11[var9];
String[] queryRange = basicPhaseNamePattern.trim().split("-");
switch(queryRange.length) {
case 1:
((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim())));
break;
case 2:
int startQuery = Integer.parseInt(queryRange[0]);
int endQuery = Integer.parseInt(queryRange[1]);
int i;
if(startQuery > endQuery) {
i = startQuery;
while(true) {
if(i < endQuery) {
continue label65;
}
((List)queryList).add(Integer.valueOf(i));
--i;
}
} else {
i = startQuery;
while(true) {
if(i > endQuery) {
continue label65;
}
((List)queryList).add(Integer.valueOf(i));
++i;
}
}
default:
throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated.");
}
}
}
if(queryOrderCached) {
this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList));
}
return new ArrayList((Collection)queryList);
}
}
parseEnvironment()
parseEnvironment()读取系统的环境变量并解析。
Map env = System.getenv();
this.version = (String)env.get("BIG_BENCH_VERSION");
this.homeDir = (String)env.get("BIG_BENCH_HOME");
this.confDir = (String)env.get("BIG_BENCH_CONF_DIR");
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR");
this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR");
this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG");
this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG");
this.engine = (String)env.get("BIG_BENCH_ENGINE");
this.database = (String)env.get("BIG_BENCH_DATABASE");
this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS");
this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS"));
this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR"));
this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1");
并自动在用户指定的参数后面加上 -U
(解锁专家模式)
this.userArguments.add("-U");
如果用户指定了 PRETEND_MODE
、PRINT_STD_OUT
、WORKLOAD
、QUERIES_TO_RUN
,则以用户指定的参数为准,否则使用默认值。
if(env.containsKey("USER_PRETEND_MODE")) {
this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE"));
}
if(env.containsKey("USER_PRINT_STD_OUT")) {
this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT"));
}
if(env.containsKey("USER_DRIVER_WORKLOAD")) {
this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD"));
}
if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) {
this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN"));
}
读取 workload
并赋值 benchmarkPhases
。如果 workload
里不包含 BENCHMARK_START
和 BENCHMARK_STOP
,自动在 benchmarkPhases
的首位和末位分别加上 BENCHMARK_START
和 BENCHMARK_STOP
。
this.benchmarkPhases = new ArrayList();
Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator();
while(var7.hasNext()) {
String benchmarkPhase = (String)var7.next();
this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim()));
}
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) {
this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START);
}
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) {
this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP);
}
run()
run()
方法是 RunBigBench.jar
里核心的方法。所有的执行都是通过 run()
方法调用的。比如 runQueries()
、runModule()
、generateData()
等。runQueries()
、runModule()
、generateData()
又通过调用 runCmd()
方法来创建操作系统进程,执行bash命令,调用bash脚本。
run()
方法里通过一个 while
循环来逐一执行 workload
里的每一个 benchmarkPhase
。 不同的 benchmarkPhase
会调用 runQueries()
、runModule()
、generateData()
...中的不同方法。
try {
long e = 0L;
this.log.finer("Benchmark phases: " + this.benchmarkPhases);
Iterator startCheckpoint = this.benchmarkPhases.iterator();
long throughputStart;
while(startCheckpoint.hasNext()) {
BigBench.BenchmarkPhase children = (BigBench.BenchmarkPhase)startCheckpoint.next();
if(children.isPhaseDone()) {
this.log.info("The phase " + children.name() + " was already performed earlier. Skipping this phase");
} else {
try {
switch($SWITCH_TABLE$io$bigdatabenchmark$v1$driver$BigBench$BenchmarkPhase()[children.ordinal()]) {
case 1:
case 20:
throw new IllegalArgumentException("The value " + children.name() + " is only used internally.");
case 2:
this.log.info(children.getConsoleMessage());
e = System.currentTimeMillis();
break;
case 3:
if(!BigBench.BenchmarkPhase.BENCHMARK_START.isPhaseDone()) {
throw new IllegalArgumentException("Error: Cannot stop the benchmark before starting it");
}
throughputStart = System.currentTimeMillis();
this.log.info(String.format("%-55s finished. Time: %25s", new Object[]{children.getConsoleMessage(), BigBench.Helper.formatTime(throughputStart - e)}));
this.logTreeRoot.setCheckpoint(new BigBench.Checkpoint(BigBench.BenchmarkPhase.BENCHMARK, -1L, -1L, e, throughputStart, this.logTreeRoot.isSuccessful()));
break;
case 4:
case 15:
case 18:
case 22:
case 27:
case 28:
case 29:
this.runModule(children, this.userArguments);
break;
case 5:
case 10:
case 11:
this.runQueries(children, 1, validationArguments);
break;
case 6:
case 9:
this.runModule(children, validationArguments);
break;
case 7:
this.generateData(children, false, validationArguments);
break;
case 8:
this.generateData(children, true, validationArguments);
break;
case 12:
case 19:
case 24:
this.runQueries(children, 1, this.userArguments);
break;
case 13:
case 14:
case 21:
case 23:
case 25:
case 26:
this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
break;
case 16:
this.generateData(children, false, this.userArguments);
break;
case 17:
this.generateData(children, true, this.userArguments);
}
children.setPhaseDone(true);
} catch (IOException var21) {
this.log.info("==============\nBenchmark run terminated\nReason: An error occured while running a command in phase " + children + "\n==============");
var21.printStackTrace();
if(this.stopAfterFailure || children.mustSucceed()) {
break;
}
}
}
}
这里的 case 1-29
并不是 1-29
条查询,而是枚举类型里的 1-29
个 benmarkPhase
。如下所示:
private static enum BenchmarkPhase {
BENCHMARK((String)null, "benchmark", false, false, false, false, "BigBench benchmark"),
BENCHMARK_START((String)null, "benchmark_start", false, false, false, false, "BigBench benchmark: Start"),
BENCHMARK_STOP((String)null, "benchmark_stop", false, false, false, false, "BigBench benchmark: Stop"),
CLEAN_ALL("cleanAll", "clean_all", false, false, false, false, "BigBench clean all"),
ENGINE_VALIDATION_CLEAN_POWER_TEST("cleanQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Clean power test queries"),
ENGINE_VALIDATION_CLEAN_LOAD_TEST("cleanMetastore", "engine_validation_metastore", false, false, false, false, "BigBench engine validation: Clean metastore"),
ENGINE_VALIDATION_CLEAN_DATA("cleanData", "engine_validation_data", false, false, false, false, "BigBench engine validation: Clean data"),
ENGINE_VALIDATION_DATA_GENERATION("dataGen", "engine_validation_data", false, false, false, true, "BigBench engine validation: Data generation"),
ENGINE_VALIDATION_LOAD_TEST("populateMetastore", "engine_validation_metastore", false, false, false, true, "BigBench engine validation: Populate metastore"),
ENGINE_VALIDATION_POWER_TEST("runQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Power test"),
ENGINE_VALIDATION_RESULT_VALIDATION("validateQuery", "engine_validation_power_test", false, false, true, false, "BigBench engine validation: Check all query results"),
CLEAN_POWER_TEST("cleanQuery", "power_test", false, false, false, false, "BigBench clean: Clean power test queries"),
CLEAN_THROUGHPUT_TEST_1("cleanQuery", "throughput_test_1", false, false, false, false, "BigBench clean: Clean first throughput test queries"),
CLEAN_THROUGHPUT_TEST_2("cleanQuery", "throughput_test_2", false, false, false, false, "BigBench clean: Clean second throughput test queries"),
CLEAN_LOAD_TEST("cleanMetastore", "metastore", false, false, false, false, "BigBench clean: Load test"),
CLEAN_DATA("cleanData", "data", false, false, false, false, "BigBench clean: Data"),
DATA_GENERATION("dataGen", "data", false, false, false, true, "BigBench preparation: Data generation"),
LOAD_TEST("populateMetastore", "metastore", false, false, false, true, "BigBench phase 1: Load test"),
POWER_TEST("runQuery", "power_test", false, true, false, false, "BigBench phase 2: Power test"),
THROUGHPUT_TEST((String)null, "throughput_test", false, false, false, false, "BigBench phase 3: Throughput test"),
THROUGHPUT_TEST_1("runQuery", "throughput_test_1", true, true, false, false, "BigBench phase 3: First throughput test run"),
THROUGHPUT_TEST_REFRESH("refreshMetastore", "throughput_test_refresh", false, false, false, false, "BigBench phase 3: Throughput test data refresh"),
THROUGHPUT_TEST_2("runQuery", "throughput_test_2", true, true, false, false, "BigBench phase 3: Second throughput test run"),
VALIDATE_POWER_TEST("validateQuery", "power_test", false, false, true, false, "BigBench validation: Power test results"),
VALIDATE_THROUGHPUT_TEST_1("validateQuery", "throughput_test_1", false, false, true, false, "BigBench validation: First throughput test results"),
VALIDATE_THROUGHPUT_TEST_2("validateQuery", "throughput_test_2", false, false, true, false, "BigBench validation: Second throughput test results"),
SHOW_TIMES("showTimes", "show_times", false, false, true, false, "BigBench: show query times"),
SHOW_ERRORS("showErrors", "show_errors", false, false, true, false, "BigBench: show query errors"),
SHOW_VALIDATION("showValidation", "show_validation", false, false, true, false, "BigBench: show query validation results");
private String runModule;
private String namePattern;
private boolean queryOrderRandom;
private boolean queryOrderCached;
private boolean printStdOut;
private boolean mustSucceed;
private String consoleMessage;
private boolean phaseDone;
private BenchmarkPhase(String runModule, String namePattern, boolean queryOrderRandom, boolean queryOrderCached, boolean printStdOut, boolean mustSucceed, String consoleMessage) {
this.runModule = runModule;
this.namePattern = namePattern;
this.queryOrderRandom = queryOrderRandom;
this.queryOrderCached = queryOrderCached;
this.printStdOut = printStdOut;
this.mustSucceed = mustSucceed;
this.consoleMessage = consoleMessage;
this.phaseDone = false;
}
3对应 BENCHMARK_STOP
,4对应 CLEAN_ALL
,29对应 SHOW_VALIDATION
,依此类推...
可以看出:
CLEAN_ALL、CLEAN_LOAD_TEST、LOAD_TEST、THROUGHPUT_TEST_REFRESH、SHOW_TIMES、SHOW_ERRORS、SHOW_VALIDATION
等benchmarkPhases调用的是
this.runModule(children, this.userArguments);
方法是 runModule
,参数是 this.userArguments
。
ENGINE_VALIDATION_CLEAN_POWER_TEST、ENGINE_VALIDATION_POWER_TEST、ENGINE_VALIDATION_RESULT_VALIDATION
调用的是
this.runQueries(children, 1, validationArguments);
方法是 runQueries
,参数是 1
(stream number) 和 validationArguments
。
ENGINE_VALIDATION_CLEAN_LOAD_TEST
和 ENGINE_VALIDATION_LOAD_TEST
调用的是
this.runModule(children, validationArguments);
ENGINE_VALIDATION_CLEAN_DATA
调用的是
this.generateData(children, false, validationArguments);
ENGINE_VALIDATION_DATA_GENERATION
调用的是
this.generateData(children, true, validationArguments);
CLEAN_POWER_TEST
、POWER_TEST
、VALIDATE_POWER_TEST
调用的是
this.runQueries(children, 1, this.userArguments);
CLEAN_THROUGHPUT_TEST_1``CLEAN_THROUGHPUT_TEST_2``THROUGHPUT_TEST_1``THROUGHPUT_TEST_2``VALIDATE_THROUGHPUT_TEST_1
VALIDATE_THROUGHPUT_TEST_2
调用的是
this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
CLEAN_DATA
调用的是
this.generateData(children, false, this.userArguments);
DATA_GENERATION
调用的是
this.generateData(children, true, this.userArguments);
总结一下以上的方法调用可以发现:
- 跟
ENGINE_VALIDATION
相关的benchmarkPhase用的参数都是validationArguments
。其余用的是userArguments
( validationArguments 和 userArguments 唯一的区别是 validationArguments 的SCALE_FACTOR
恒为1) - 跟
POWER_TEST
相关的都是调用runQueries()
方法,因为POWER_TEST
就是执行SQL查询 - 跟
CLEAN_DATA
DATA_GENERATION
相关的都是调用generateData()
方法 - 跟
LOAD_TEST
SHOW
相关的都是调用runModule()
方法
benchmarkPhase 和 module 对应关系
具体每个 benchmarkPhase
跟 module
(执行的脚本)的对应关系如下:
CLEAN_ALL -> "cleanAll"
ENGINE_VALIDATION_CLEAN_POWER_TEST -> "cleanQuery"
ENGINE_VALIDATION_CLEAN_LOAD_TEST -> "cleanMetastore",
ENGINE_VALIDATION_CLEAN_DATA -> "cleanData"
ENGINE_VALIDATION_DATA_GENERATION -> "dataGen"
ENGINE_VALIDATION_LOAD_TEST -> "populateMetastore"
ENGINE_VALIDATION_POWER_TEST -> "runQuery"
ENGINE_VALIDATION_RESULT_VALIDATION -> "validateQuery"
CLEAN_POWER_TEST -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_1 -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_2 -> "cleanQuery"
CLEAN_LOAD_TEST -> "cleanMetastore"
CLEAN_DATA -> "cleanData"
DATA_GENERATION -> "dataGen"
LOAD_TEST -> "populateMetastore"
POWER_TEST -> "runQuery"
THROUGHPUT_TEST -> (String)null
THROUGHPUT_TEST_1 -> "runQuery"
THROUGHPUT_TEST_REFRESH -> "refreshMetastore"
THROUGHPUT_TEST_2 -> "runQuery"
VALIDATE_POWER_TEST -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_1 -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_2 -> "validateQuery"
SHOW_TIMES -> "showTimes"
SHOW_ERRORS -> "showErrors"
SHOW_VALIDATION -> "showValidation"
当执行某个 benchmarkPhase
时会去调用如上该 benchmarkPhase
对应的 module
(脚本位于 $BENCH_MARK_HOME/engines/hive/bin
目录下)
cmdLine.add(benchmarkPhase.getRunModule());
程序调用流程
接下来介绍每个module的功能
module
cleanAll
1. DROP DATABASE
2. 删除hdfs上的源数据
echo "dropping database (with all tables)"
runCmdWithErrorCheck runEngineCmd -e "DROP DATABASE IF EXISTS $BIG_BENCH_DATABASE CASCADE;"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
cleanQuery
1. 删除对应的 Query 生成的临时表
2. 删除对应的 Query 生成的结果表
runCmdWithErrorCheck runEngineCmd -e "DROP TABLE IF EXISTS $TEMP_TABLE1; DROP TABLE IF EXISTS $TEMP_TABLE2; DROP TABLE IF EXISTS $RESULT_TABLE;"
return $?
cleanMetastore
1. 调用 `dropTables.sql` 将23张表依次DROP
echo "cleaning metastore tables"
runCmdWithErrorCheck runEngineCmd -f "$BIG_BENCH_CLEAN_METASTORE_FILE"
export BIG_BENCH_CLEAN_METASTORE_FILE="$BIG_BENCH_CLEAN_DIR/dropTables.sql"
dropTables.sql
将23张表依次DROP,源码如下:
DROP TABLE IF EXISTS ${hiveconf:customerTableName};
DROP TABLE IF EXISTS ${hiveconf:customerAddressTableName};
DROP TABLE IF EXISTS ${hiveconf:customerDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:dateTableName};
DROP TABLE IF EXISTS ${hiveconf:householdDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:incomeTableName};
DROP TABLE IF EXISTS ${hiveconf:itemTableName};
DROP TABLE IF EXISTS ${hiveconf:promotionTableName};
DROP TABLE IF EXISTS ${hiveconf:reasonTableName};
DROP TABLE IF EXISTS ${hiveconf:shipModeTableName};
DROP TABLE IF EXISTS ${hiveconf:storeTableName};
DROP TABLE IF EXISTS ${hiveconf:timeTableName};
DROP TABLE IF EXISTS ${hiveconf:warehouseTableName};
DROP TABLE IF EXISTS ${hiveconf:webSiteTableName};
DROP TABLE IF EXISTS ${hiveconf:webPageTableName};
DROP TABLE IF EXISTS ${hiveconf:inventoryTableName};
DROP TABLE IF EXISTS ${hiveconf:storeSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:storeReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:webSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:webReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:marketPricesTableName};
DROP TABLE IF EXISTS ${hiveconf:clickstreamsTableName};
DROP TABLE IF EXISTS ${hiveconf:reviewsTableName};
cleanData
1. 删除hdfs上 /user/root/benchmarks/bigbench/data 目录
2. 删除hdfs上 /user/root/benchmarks/bigbench/data_refresh 目录
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
dataGen
1. 创建目录 /user/root/benchmarks/bigbench/data 并赋予权限
2. 创建目录 /user/root/benchmarks/bigbench/data_refresh 并赋予权限
3. 调用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data 到 /user/root/benchmarks/bigbench/data 目录下
4. 调用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data 到 /user/root/benchmarks/bigbench/data_refresh 目录下
创建目录 /user/root/benchmarks/bigbench/data 并赋予权限
runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
创建目录 /user/root/benchmarks/bigbench/data_refresh 并赋予权限
runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
调用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data
runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 0 -o "'${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1
调用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data
runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 1 -o "'${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1
populateMetastore
该过程是真正的创建数据库表的过程。建表的过程调用的是 $BENCH_MARK_HOME/engines/hive/population/
目录下的 hiveCreateLoad.sql
,通过该sql文件来建数据库表。
- 从 /user/root/benchmarks/bigbench/data 路径下读取 .dat 的原始数据,生成 TEXTFILE 格式的外部临时表
- 用
select * from 临时表
来创建最终的 ORC 格式的数据库表 - 删除外部临时表。
从 /user/root/benchmarks/bigbench/data 路径下读取 .dat 的原始数据,生成 TEXTFILE 格式的外部临时表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
( c_customer_sk bigint --not null
, c_customer_id string --not null
, c_current_cdemo_sk bigint
, c_current_hdemo_sk bigint
, c_current_addr_sk bigint
, c_first_shipto_date_sk bigint
, c_first_sales_date_sk bigint
, c_salutation string
, c_first_name string
, c_last_name string
, c_preferred_cust_flag string
, c_birth_day int
, c_birth_month int
, c_birth_year int
, c_birth_country string
, c_login string
, c_email_address string
, c_last_review_date string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;
用 select * from 临时表
来创建最终的 ORC 格式的数据库表
DROP TABLE IF EXISTS ${hiveconf:customerTableName};
CREATE TABLE ${hiveconf:customerTableName}
STORED AS ${hiveconf:tableFormat}
AS
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;
删除外部临时表
DROP TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
runQuery
1. runQuery 调用每个query下的 run.sh 里的 `query_run_main_method()` 方法
2. `query_run_main_method()` 调用 `runEngineCmd` 来执行query脚本(qxx.sql)
runQuery 调用每个query下的 run.sh 里的 query_run_main_method()
方法
QUERY_MAIN_METHOD="query_run_main_method"
-----------------------------------------
"$QUERY_MAIN_METHOD" 2>&1 | tee -a "$LOG_FILE_NAME" 2>&1
query_run_main_method()
调用 runEngineCmd
来执行query脚本(qxx.sql)
query_run_main_method () {
QUERY_SCRIPT="$QUERY_DIR/$QUERY_NAME.sql"
if [ ! -r "$QUERY_SCRIPT" ]
then
echo "SQL file $QUERY_SCRIPT can not be read."
exit 1
fi
runCmdWithErrorCheck runEngineCmd -f "$QUERY_SCRIPT"
return $?
}
一般情况下 query_run_main_method ()
方法只是执行对应的query脚本,但是像 q05、q20... 这些查询,用到了机器学习算法,所以在执行对应的query脚本后会把生成的结果表作为输入,然后调用执行机器学习算法(如聚类、逻辑回归)的jar包继续执行,得到最终的结果。
runEngineCmd () {
if addInitScriptsToParams
then
"$BINARY" "${BINARY_PARAMS[@]}" "${INIT_PARAMS[@]}" "$@"
else
return 1
fi
}
--------------------------
BINARY="/usr/bin/hive"
BINARY_PARAMS+=(--hiveconf BENCHMARK_PHASE=$BIG_BENCH_BENCHMARK_PHASE --hiveconf STREAM_NUMBER=$BIG_BENCH_STREAM_NUMBER --hiveconf QUERY_NAME=$QUERY_NAME --hiveconf QUERY_DIR=$QUERY_DIR --hiveconf RESULT_TABLE=$RESULT_TABLE --hiveconf RESULT_DIR=$RESULT_DIR --hiveconf TEMP_TABLE=$TEMP_TABLE --hiveconf TEMP_DIR=$TEMP_DIR --hiveconf TABLE_PREFIX=$TABLE_PREFIX)
INIT_PARAMS=(-i "$BIG_BENCH_QUERY_PARAMS_FILE" -i "$BIG_BENCH_ENGINE_SETTINGS_FILE")
INIT_PARAMS+=(-i "$LOCAL_QUERY_ENGINE_SETTINGS_FILE")
if [ -n "$USER_QUERY_PARAMS_FILE" ]
then
if [ -r "$USER_QUERY_PARAMS_FILE" ]
then
echo "User defined query parameter file found. Adding $USER_QUERY_PARAMS_FILE to hive init."
INIT_PARAMS+=(-i "$USER_QUERY_PARAMS_FILE")
else
echo "User query parameter file $USER_QUERY_PARAMS_FILE can not be read."
return 1
fi
fi
if [ -n "$USER_ENGINE_SETTINGS_FILE" ]
then
if [ -r "$USER_ENGINE_SETTINGS_FILE" ]
then
echo "User defined engine settings file found. Adding $USER_ENGINE_SETTINGS_FILE to hive init."
INIT_PARAMS+=(-i "$USER_ENGINE_SETTINGS_FILE")
else
echo "User hive settings file $USER_ENGINE_SETTINGS_FILE can not be read."
return 1
fi
fi
return 0
validateQuery
1. 调用每个query下的 run.sh 里的 `query_run_validate_method()` 方法
2. `query_run_validate_method()` 比较 `$BENCH_MARK_HOME/engines/hive/queries/qxx/results/qxx-result` 和hdfs上 `/user/root/benchmarks/bigbench/queryResults/qxx_hive_${BIG_BENCH_BENCHMARK_PHASE}_${BIG_BENCH_STREAM_NUMBER}_result` 两个文件,如果一样,则验证通过,否则验证失败。
if diff -q "$VALIDATION_RESULTS_FILENAME" <(hadoop fs -cat "$RESULT_DIR/*")
then
echo "Validation of $VALIDATION_RESULTS_FILENAME passed: Query returned correct results"
else
echo "Validation of $VALIDATION_RESULTS_FILENAME failed: Query returned incorrect results"
VALIDATION_PASSED="0"
fi
SF为1时(-f 1),用上面的方法比较,SF不为1(>1)时,只要hdfs上的结果表中行数大于等于1即验证通过
if [ `hadoop fs -cat "$RESULT_DIR/*" | head -n 10 | wc -l` -ge 1 ]
then
echo "Validation passed: Query returned results"
else
echo "Validation failed: Query did not return results"
return 1
fi
refreshMetastore
1. 调用 `$BENCH_MARK_HOME/engines/hive/refresh/` 目录下的 `hiveRefreshCreateLoad.sql` 脚本
2. `hiveRefreshCreateLoad.sql` 将hdfs上 `/user/root/benchmarks/bigbench/data_refresh/` 目录下每个表数据插入外部临时表
3. 外部临时表再将每个表的数据插入Hive数据库对应的表中
hiveRefreshCreateLoad.sql
将hdfs上 /user/root/benchmarks/bigbench/data_refresh/
目录下每个表数据插入外部临时表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
( c_customer_sk bigint --not null
, c_customer_id string --not null
, c_current_cdemo_sk bigint
, c_current_hdemo_sk bigint
, c_current_addr_sk bigint
, c_first_shipto_date_sk bigint
, c_first_sales_date_sk bigint
, c_salutation string
, c_first_name string
, c_last_name string
, c_preferred_cust_flag string
, c_birth_day int
, c_birth_month int
, c_birth_year int
, c_birth_country string
, c_login string
, c_email_address string
, c_last_review_date string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;
-----------------
set hdfsDataPath=${env:BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR};
外部临时表再将每个表的数据插入Hive数据库对应的表中
INSERT INTO TABLE ${hiveconf:customerTableName}
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;
附录
在测试之前会先用PDGF(并行数据生成框架)生成数据。指定Scale Factor为1(1GB),生成以下共23张表。以下是每张表的建表语句及每张表里的某一条记录。
23张表
hive> show tables;
OK
customer
customer_address
customer_demographics
date_dim
household_demographics
income_band
inventory
item
item_marketprices
product_reviews
promotion
reason
ship_mode
store
store_returns
store_sales
time_dim
warehouse
web_clickstreams
web_page
web_returns
web_sales
web_site
Time taken: 0.017 seconds, Fetched: 23 row(s)
customer (99000行)4.03MB
建表语句:
hive> show create table customer;
OK
CREATE TABLE `customer`(
`c_customer_sk` bigint,
`c_customer_id` string,
`c_current_cdemo_sk` bigint,
`c_current_hdemo_sk` bigint,
`c_current_addr_sk` bigint,
`c_first_shipto_date_sk` bigint,
`c_first_sales_date_sk` bigint,
`c_salutation` string,
`c_first_name` string,
`c_last_name` string,
`c_preferred_cust_flag` string,
`c_birth_day` int,
`c_birth_month` int,
`c_birth_year` int,
`c_birth_country` string,
`c_login` string,
`c_email_address` string,
`c_last_review_date` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'99000\',
\'rawDataSize\'=\'88084062\',
\'totalSize\'=\'4221267\',
\'transient_lastDdlTime\'=\'1473167154\')
Time taken: 0.116 seconds, Fetched: 34 row(s)
某一行:
0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com gdMmGdU9
customer_address(49500行) 0.92MB
建表语句:
hive> show create table customer_address;
OK
CREATE TABLE `customer_address`(
`ca_address_sk` bigint,
`ca_address_id` string,
`ca_street_number` string,
`ca_street_name` string,
`ca_street_type` string,
`ca_suite_number` string,
`ca_city` string,
`ca_county` string,
`ca_state` string,
`ca_zip` string,
`ca_country` string,
`ca_gmt_offset` decimal(5,2),
`ca_location_type` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer_address\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'49500\',
\'rawDataSize\'=\'55836000\',
\'totalSize\'=\'966548\',
\'transient_lastDdlTime\'=\'1473167160\')
Time taken: 0.043 seconds, Fetched: 29 row(s)
某一行:
6187 AAAAAAAAAAAAAJDZ 536 6th Lake Drive UKL8bE5C Lowell Brule County SD 18464 United States -6 apartment
customer_demographics (19200800行)6.84MB
建表语句:
hive> show create table customer_demographics;
OK
CREATE TABLE `customer_demographics`(
`cd_demo_sk` bigint,
`cd_gender` string,
`cd_marital_status` string,
`cd_education_status` string,
`cd_purchase_estimate` int,
`cd_credit_rating` string,
`cd_dep_count` int,
`cd_dep_employed_count` int,
`cd_dep_college_count` int)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer_demographics\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'1\',
\'numRows\'=\'1920800\',
\'rawDataSize\'=\'718371044\',
\'totalSize\'=\'7167503\',
\'transient_lastDdlTime\'=\'1473167169\')
某一行:
0 F U Primary 6000 Good 0 5 0
date_dim (109573行) 1.67MB
建表语句:
hive> show create table date_dim;
OK
CREATE TABLE `date_dim`(
`d_date_sk` bigint,
`d_date_id` string,
`d_date` string,
`d_month_seq` int,
`d_week_seq` int,
`d_quarter_seq` int,
`d_year` int,
`d_dow` int,
`d_moy` int,
`d_dom` int,
`d_qoy` int,
`d_fy_year` int,
`d_fy_quarter_seq` int,
`d_fy_week_seq` int,
`d_day_name` string,
`d_quarter_name` string,
`d_holiday` string,
`d_weekend` string,
`d_following_holiday` string,
`d_first_dom` int,
`d_last_dom` int,
`d_same_day_ly` int,
`d_same_day_lq` int,
`d_current_day` string,
`d_current_week` string,
`d_current_month` string,
`d_current_quarter` string,
`d_current_year` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/date_dim\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'1\',
\'numRows\'=\'109573\',
\'rawDataSize\'=\'123050479\',
\'totalSize\'=\'1748707\',
\'transient_lastDdlTime\'=\'1473167172\')
某一行:
0 AAAAAAAAAAAAAAAA 1900-01-01 0 0 0 1900 1 1 1 1 1900 0 0 Monday 1900Q1 Y N N 2448812 2458802 2472542 2420941 N N NN N
household_demographics (7200行) 14.31KB
建表语句:
hive> show create table household_demographics;
OK
CREATE TABLE `household_demographics`(
`hd_demo_sk` bigint,
`hd_income_band_sk` bigint,
`hd_buy_potential` string,
`hd_dep_count` int,
`hd_vehicle_count` int)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/household_demographics\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'1\',
\'numRows\'=\'7200\',
\'rawDataSize\'=\'835168\',
\'totalSize\'=\'14655\',
\'transient_lastDdlTime\'=\'1473167173\')
某一行:
0 3 1001-5000 0 0
income_band (20行) 754B
建表语句:
hive> show create table income_band;
OK
CREATE TABLE `income_band`(
`ib_income_band_sk` bigint,
`ib_lower_bound` int,
`ib_upper_bound` int)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/income_band\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'20\',
\'rawDataSize\'=\'320\',
\'totalSize\'=\'754\',
\'transient_lastDdlTime\'=\'1473167179\')
Time taken: 0.032 seconds, Fetched: 19 row(s)
某一行:
0 1 10000
inventory (23255100行) 34.55MB
建表语句:
hive> show create table inventory;
OK
CREATE TABLE `inventory`(
`inv_date_sk` bigint,
`inv_item_sk` bigint,
`inv_warehouse_sk` bigint,
`inv_quantity_on_hand` int)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/inventory\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'3\',
\'numRows\'=\'23255100\',
\'rawDataSize\'=\'651118804\',
\'totalSize\'=\'36234106\',
\'transient_lastDdlTime\'=\'1473167235\')
Time taken: 0.031 seconds, Fetched: 20 row(s)
某一行:
36890 0 0 503
item (17820行) 2.36MB
建表语句:
hive> show create table item;
OK
CREATE TABLE `item`(
`i_item_sk` bigint,
`i_item_id` string,
`i_rec_start_date` string,
`i_rec_end_date` string,
`i_item_desc` string,
`i_current_price` decimal(7,2),
`i_wholesale_cost` decimal(7,2),
`i_brand_id` int,
`i_brand` string,
`i_class_id` int,
`i_class` string,
`i_category_id` int,
`i_category` strong,
`i_manufact_id` int,
`i_manufact` string,
`i_size` string,
`i_formulation` string,
`i_color` string,
`i_units` string,
`i_container` string,
`i_manager_id` int,
`i_product_name` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/item\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'1\',
\'numRows\'=\'17820\',
\'rawDataSize\'=\'31238428\',
\'totalSize\'=\'2472973\',
\'transient_lastDdlTime\'=\'1473167181\')
某一行:
0 AAAAAAAAAAAAAAAA 2000-01-14 quickly even dinos beneath the frays must have to boost boldly careful bold escapades: stealthily even forges over the dependencies integrate always past the quiet sly decoys-- notornis sol 72.29 64.96 3898712 71TS7NSbvH1YbdiQMG6ttBHKAljiNoIRB 1 Fan Shop 9 Sports & Outdoors 995 2VOxvrIWwlJQTSk6 small 99Ee1r6uFbZSSClAX3 dodger Oz Unknown 18 8m9n5Q7T33DNWidoA6nWlg6ydmpA1SKOoOJLXiLVb
item_marketprices (89100行) 0.63MB
建表语句:
hive> show create table item_marketprices;
OK
CREATE TABLE `item_marketprices`(
`imp_sk` bigint,
`imp_item_sk` bigint,
`imp_competitor` string,
`imp_competitor_price` decimal(7,2),
`imp_start_date` bigint,
`imp_end_date` bigint)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/item_marketprices\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'89100\',
\'rawDataSize\'=\'21736912\',
\'totalSize\'=\'657024\',
\'transient_lastDdlTime\'=\'1473167275\')
某一行:
5 4737 AAAAAAAAAAAAAAIN 66.4 36890 36958
product_reviews (89991行) 24.36MB
建表语句:
hive> show create table product_reviews;
OK
CREATE TABLE `product_reviews`(
`pr_review_sk` bigint,
`pr_review_date` string,
`pr_review_time` string,
`pr_review_rating` int,
`pr_item_sk` bigint,
`pr_user_sk` bigint,
`pr_order_sk` bigint,
`pr_review_content` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/product_reviews\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'89991\',
\'rawDataSize\'=\'79597043\',
\'totalSize\'=\'25546821\',
\'transient_lastDdlTime\'=\'1473167305\')
某一行:
4 2004-02-25 21:43:07 1 3683 88783 41241 Had to offset the praise handed down by so many years is probably the result of some sin committed (obviously so grievous don\'t "get" it. Why all the original artwork by Stephen Gammel.
promotion (300行) 15.83KB
建表语句:
hive> show create table promotion;
OK
CREATE TABLE `promotion`(
`p_promo_sk` bigint,
`p_promo_id` string,
`p_start_date_sk` bigint,
`p_end_date_sk` bigint,
`p_item_sk` bigint,
`p_cost` decimal(15,2),
`p_response_target` int,
`p_promo_name` string,
`p_channel_dmail` string,
`p_channel_email` string,
`p_channel_catalog` string,
`p_channel_tv` string,
`p_channel_radio` string,
`p_channel_press` string,
`p_channel_event` string,
`p_channel_demo` string,
`p_channel_details` string,
`p_purpose` string,
`p_discount_active` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/promotion\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'300\',
\'rawDataSize\'=\'404926\',
\'totalSize\'=\'16204\',
\'transient_lastDdlTime\'=\'1473167186\')
某一行:
7 AAAAAAAAAAAAAAAH 94455 108529 13511 427.76 1 bar YN Y N N N N N blithe grouches past the blithe quick epitaphs print rut Unknown N
reason (35行) 3.17KB
建表语句:
hive> show create table reason;
OK
CREATE TABLE `reason`(
`r_reason_sk` bigint,
`r_reason_id` string,
`r_reason_desc` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/reason\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'35\',
\'rawDataSize\'=\'9027\',
\'totalSize\'=\'3240\',
\'transient_lastDdlTime\'=\'1473167190\')
某一行:
5 uy busily sly excuses hang: slow braids to the daring somas was toward the epitaphs-- gifts betw
ship_mode (20行) 2.93KB
建表语句:
hive> show create table ship_mode;
OK
CREATE TABLE `ship_mode`(
`sm_ship_mode_sk` bigint,
`sm_ship_mode_id` string,
`sm_type` string,
`sm_code` string,
`sm_carrier` strong,
`sm_contract` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/ship_mode\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'20\',
\'rawDataSize\'=\'9576\',
\'totalSize\'=\'3001\',
\'transient_lastDdlTime\'=\'1473167196\')
某一行:
12 wGyozLRZ3uL iCTZsMlNzsQ cBRc FlWM9v tm0ehuQ2
store (12行) 8.03KB
建表语句:
hive> show create table store;
OK
CREATE TABLE `store`(
`s_store_sk` bigint,
`s_store_id` string,
`s_rec_start_date` string,
`s_rec_end_date` string,
`s_closed_date_sk` bigint,
`s_store_name` string,
`s_number_employees` int,
`s_floor_space` int,
`s_hours` string,
`s_manager` string,
`s_market_id` int,
`s_geography_class` string,
`s_market_desc` string,
`s_market_manager` string,
`s_division_id` int,
`s_division_name` string,
`s_company_id` int,
`s_company_name` string,
`s_street_number` string,
`s_street_name` string,
`s_street_type` string,
`s_suite_number` string,
`s_city` string,
`s_county` string,
`s_state` string,
`s_zip` string,
`s_country` string,
`s_gmt_offset` decimal(5,2),
`s_tax_precentage` decimal(5,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'12\',
\'rawDataSize\'=\'25962\',
\'totalSize\'=\'8214\',
\'transient_lastDdlTime\'=\'1473167201\')
某一行:
10 AAAAAAAAAAAAAAAK 2000-11-01 6235 Avalon 254 6468537 8AM-12AM Michael Barlow 6 Unknown final sly gifts by the even final dependencies x-ray under the fluffy Barry Shaw 1 Unknown1Unknown 993 Mill Pkwy aQV Cold Springs Cumberland County TN 95692 United States -5 0.03
store_returns (37902行) 1.19MB
建表语句:
hive> show create table store_returns;
OK
CREATE TABLE `store_returns`(
`sr_returned_date_sk` bigint,
`sr_return_time_sk` bigint,
`sr_item_sk` bigint,
`sr_customer_sk` bigint,
`sr_cdemo_sk` bigint,
`sr_hdemo_sk` bigint,
`sr_addr_sk` bigint,
`sr_store_sk` bigint,
`sr_reason_sk` bigint,
`sr_ticket_number` bigint,
`sr_return_quantity` int,
`sr_return_amt` decimal(7,2),
`sr_return_tax` decimal(7,2),
`sr_return_amt_inc_tax` decimal(7,2),
`sr_fee` decimal(7,2),
`sr_return_ship_cost` decimal(7,2),
`sr_refunded_cash` decimal(7,2),
`sr_reversed_charge` decimal(7,2),
`sr_store_credit` decimal(7,2),
`sr_net_loss` decimal(7,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store_returns\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'37902\',
\'rawDataSize\'=\'41388272\',
\'totalSize\'=\'1250563\',
\'transient_lastDdlTime\'=\'1473167251\')
某一行:
37375 38182 11520 7640 242073 6754 25731 2 22 20293 33 1990.89 119.45 2110.34 34.14 2433.41 477.81 559.84 953.24 2672.31
store_sales (667579行) 23.49MB
建表语句:
hive> show create table store_sales;
OK
CREATE TABLE `store_sales`(
`ss_sold_date_sk` bigint,
`ss_sold_time_sk` bigint,
`ss_item_sk` bigint,
`ss_customer_sk` bigint,
`ss_cdemo_sk` bigint,
`ss_hdemo_sk` bigint,
`ss_addr_sk` bigint,
`ss_store_sk` bigint,
`ss_promo_sk` bigint,
`ss_ticket_number` bigint,
`ss_quantity` int,
`ss_wholesale_cost` decimal(7,2),
`ss_list_price` decimal(7,2),
`ss_sales_price` decimal(7,2),
`ss_ext_discount_amt` decimal(7,2),
`ss_ext_sales_price` decimal(7,2),
`ss_ext_wholesale_cost` decimal(7,2),
`ss_ext_list_price` decimal(7,2),
`ss_ext_tax` decimal(7,2),
`ss_coupon_amt` decimal(7,2),
`ss_net_paid` decimal(7,2),
`ss_net_paid_inc_tax` decimal(7,2),
`ss_net_profit` decimal(7,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store_sales\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'667579\',
\'rawDataSize\'=\'953293700\',
\'totalSize\'=\'24629162\',
\'transient_lastDdlTime\'=\'1473167245\')
某一行:
37115 20244 16481 98676 1211207 5239 37107 9 38 11138 93 56.47 82.45 36.28 4293.81 3374.04 5251.71 7667.85 101.22 0 3374.04 3475.26 -1877.67
time_dim (86400行) 219.49KB
建表语句:
hive> show create table time_dim;
OK
CREATE TABLE `time_dim`(
`t_time_sk` bigint,
`t_time_id` string,
`t_time` int,
`t_hour` int,
`t_minute` int,
`t_second` int,
`t_am_pm` string,
`t_shift` string,
`t_sub_shift` string,
`t_meal_time` string)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/time_dim\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'1\',
\'numRows\'=\'86400\',
\'rawDataSize\'=\'41040000\',
\'totalSize\'=\'224757\',
\'transient_lastDdlTime\'=\'1473167202\')
某一行:
2 AAAAAAAAAAAAAAAC 2 0 0 2 AM third night
warehouse (5行) 1.94KB
建表语句:
hive> show create table warehouse;
OK
CREATE TABLE `warehouse`(
`w_warehouse_sk` bigint,
`w_warehouse_id` string,
`w_warehouse_name` string,
`w_warehouse_sq_ft` int,
`w_street_number` string,
`w_street_name` string,
`w_street_type` string,
`w_suite_number` string,
`w_city` string,
`w_county` string,
`w_state` string,
`w_zip` string,
`w_country` string,
`w_gmt_offset` decimal(5,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/warehouse\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'5\',
\'rawDataSize\'=\'5695\',
\'totalSize\'=\'1980\',
\'transient_lastDdlTime\'=\'1473167204\')
某一行:
1 AAAAAAAAAAAAAAAB frets would sleep 845707 181 9th 13thCourt kQ Sherwood Forest Teton County WY 87891 United States -7
web_clickstreams (6770550行) 34.11MB
建表语句:
hive> show create table web_clickstreams;
OK
CREATE TABLE `web_clickstreams`(
`wcs_click_date_sk` bigint,
`wcs_click_time_sk` bigint,
`wcs_sales_sk` bigint,
`wcs_item_sk` bigint,
`wcs_web_page_sk` bigint,
`wcs_user_sk` bigint)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_clickstreams\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'6770550\',
\'rawDataSize\'=\'253337416\',
\'totalSize\'=\'35768476\',
\'transient_lastDdlTime\'=\'1473167298\')
某一行:
36890 27089 NULL 15465 47 NULL
web_page (60行) 6.81KB
建表语句:
hive> show create table web_page;
OK
CREATE TABLE `web_page`(
`wp_web_page_sk` bigint,
`wp_web_page_id` string,
`wp_rec_start_date` string,
`wp_rec_end_date` string,
`wp_creation_date_sk` bigint,
`wp_access_date_sk` bigint,
`wp_autogen_flag` string,
`wp_customer_sk` bigint,
`wp_url` string,
`wp_type` string,
`wp_char_count` int,
`wp_link_count` int,
`wp_image_count` int,
`wp_max_ad_count` int)
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_page\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'60\',
\'rawDataSize\'=\'38763\',
\'totalSize\'=\'6969\',
\'transient_lastDdlTime\'=\'1473167215\')
某一行:
24 AAAAAAAAAAAAAAAY 2000-12-19 38819 69008 0 44446 http://www.C8jdri37RmtbDNeFnXjmYbyBPzeO4WWK9pVYP6xtBJbaQ5yIj4s.com feedback 4340 23 5 2
web_returns (38487行) 1.40MB
建表语句:
hive> show create table web_returns;
OK
CREATE TABLE `web_returns`(
`wr_returned_date_sk` bigint,
`wr_returned_time_sk` bigint,
`wr_item_sk` bigint,
`wr_refunded_customer_sk` bigint,
`wr_refunded_cdemo_sk` bigint,
`wr_refunded_hdemo_sk` bigint,
`wr_refunded_addr_sk` bigint,
`wr_returning_customer_sk` bigint,
`wr_returning_cdemo_sk` bigint,
`wr_returning_hdemo_sk` bigint,
`wr_returning_addr_sk` bigint,
`wr_web_page_sk` bigint,
`wr_reason_sk` bigint,
`wr_order_number` bigint,
`wr_return_quantity` int,
`wr_return_amt` decimal(7,2),
`wr_return_tax` decimal(7,2),
`wr_return_amt_inc_tax` decimal(7,2),
`wr_fee` decimal(7,2),
`wr_return_ship_cost` decimal(7,2),
`wr_refunded_cash` decimal(7,2),
`wr_reversed_charge` decimal(7,2),
`wr_account_credit` decimal(7,2),
`wr_net_loss` decimal(7,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_returns\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'38487\',
\'rawDataSize\'=\'43257220\',
\'totalSize\'=\'1471571\',
\'transient_lastDdlTime\'=\'1473167269\')
某一行:
38952 68126 9590 52948 893223 2020 6942 52948 893223 2020 6942 32 13 63444 33 626.0125.04 651.05 18.74 958.4 244.14 286.4 95.47 1008.48
web_sales (668052行) 32.99MB
建表语句:
hive> show create table web_sales;
OK
CREATE TABLE `web_sales`(
`ws_sold_date_sk` bigint,
`ws_sold_time_sk` bigint,
`ws_ship_date_sk` bigint,
`ws_item_sk` bigint,
`ws_bill_customer_sk` bigint,
`ws_bill_cdemo_sk` bigint,
`ws_bill_hdemo_sk` bigint,
`ws_bill_addr_sk` bigint,
`ws_ship_customer_sk` bigint,
`ws_ship_cdemo_sk` bigint,
`ws_ship_hdemo_sk` bigint,
`ws_ship_addr_sk` bigint,
`ws_web_page_sk` bigint,
`ws_web_site_sk` bigint,
`ws_ship_mode_sk` bigint,
`ws_warehouse_sk` bigint,
`ws_promo_sk` bigint,
`ws_order_number` bigint,
`ws_quantity` int,
`ws_wholesale_cost` decimal(7,2),
`ws_list_price` decimal(7,2),
`ws_sales_price` decimal(7,2),
`ws_ext_discount_amt` decimal(7,2),
`ws_ext_sales_price` decimal(7,2),
`ws_ext_wholesale_cost` decimal(7,2),
`ws_ext_list_price` decimal(7,2),
`ws_ext_tax` decimal(7,2),
`ws_coupon_amt` decimal(7,2),
`ws_ext_ship_cost` decimal(7,2),
`ws_net_paid` decimal(7,2),
`ws_net_paid_inc_tax` decimal(7,2),
`ws_net_paid_inc_ship` decimal(7,2),
`ws_net_paid_inc_ship_tax` decimal(7,2),
`ws_net_profit` decimal(7,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_sales\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'668052\',
\'rawDataSize\'=\'1221174888\',
\'totalSize\'=\'34585166\',
\'transient_lastDdlTime\'=\'1473167263\')
某一行:
36890 26789 36993 794 85457 380790 2436 42649 29934 1583844 5251 525 2 14 1 3 120 0 93 32.62 81.22 59.29 2039.49 5513.97 3033.66 7553.46 330.84 0 1661.76 5513.97 5844.81 7175.73 7506.57 2480.31
web_site (30行)10.31KB
建表语句:
hive> show create table web_site;
OK
CREATE TABLE `web_site`(
`web_site_sk` bigint,
`web_site_id` string,
`web_rec_start_date` string,
`web_rec_end_date` string,
`web_name` string,
`web_open_date_sk` bigint,
`web_close_date_sk` bigint,
`web_class` string,
`web_manager` string,
`web_mkt_id` int,
`web_mkt_class` string,
`web_mkt_desc` string,
`web_market_manager` string,
`web_company_id` int,
`web_company_name` string,
`web_street_number` string,
`web_street_name` string,
`web_street_type` string,
`web_suite_number` string,
`web_city` string,
`web_county` string,
`web_state` string,
`web_zip` string,
`web_country` string,
`web_gmt_offset` decimal(5,2),
`web_tax_percentage` decimal(5,2))
ROW FORMAT SERDE
\'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
\'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
\'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_site\'
TBLPROPERTIES (
\'COLUMN_STATS_ACCURATE\'=\'true\',
\'numFiles\'=\'2\',
\'numRows\'=\'30\',
\'rawDataSize\'=\'62832\',
\'totalSize\'=\'10554\',
\'transient_lastDdlTime\'=\'1473167210\')
某一行:
2 AAAAAAAAAAAAAAAC 2002-07-17 site_0 16450 91500 Unknown Gregory George 1 sheaves despite the quietly sly asymp thin enticing frets except the sometimes final courts might promise blithe dino Frank Hernandez 1 ese 17 5th Ave EbDxJVL Georgetown Guadalupe County TX 75435 United States -6 0.01