一篇文章看懂TPCx-BB(大数据基准测试工具)源码

TPCx-BB是大数据基准测试工具,它通过模拟零售商的30个应用场景,执行30个查询来衡量基于Hadoop的大数据系统的包括硬件和软件的性能。其中一些场景还用到了机器学习算法(聚类、线性回归等)。为了更好地了解被测试的系统的性能,需要对TPCx-BB整个测试流程深入了解。本文详细分析了整个TPCx-BB测试工具的源码,希望能够对大家理解TPCx-BB有所帮助。

代码结构

主目录($BENCH_MARK_HOME)下有:

  • bin
  • conf
  • data-generator
  • engines
  • tools

几个子目录。

bin下有几个 module ,是执行时需要用到的脚本:bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等

conf下有两个配置文件:bigBench.propertiesuserSettings.conf

bigBench.properties 主要设置 workload(执行的benchmarkPhases)和 power_test_0POWER_TEST 阶段需要执行的SQL查询)

默认 workload

workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1

默认 power_test_01-30

userSetting.conf 是一些基本设置,包括JAVA environment 、default settings for benchmark(database、engine、map_tasks、scale_factor ...)、HADOOP environment、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)

data-generator下是跟数据生成相关的脚本及配置文件。详细内容在下面介绍。

engines下是TPCx-BB支持的4种引擎:biginsights、hive、impala、spark_sql。默认引擎为hive。实际上,只有hive目录下不为空,其他三个目录下均为空,估计是现在还未完善。

tools下有两个jar包:HadoopClusterExec.jarRunBigBench.jar 。其中 RunBigBench.jar 是执行TPCx-BB测试的一个非常重要的文件,大部分程序都在该jar包内。

数据生成

数据生成相关程序和配置都在 data-generator 目录下。该目录下有一个 pdgf.jar 包和 config、dicts、extlib 三个子目录。

pdgf.jar是数据生成的Java程序,代码量很大。config下有两个配置文件:bigbench-generation.xmlbigbench-schema.xml

bigbench-generation.xml 主要设置生成的原始数据(不是数据库表)包含哪几张表、每张表的表名、表的大小以及表输出的目录、表文件的后缀、分隔符、字符编码等。

<schema name="default">
		<tables>	
		<!-- not refreshed tables -->	
			
			<!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons  -->
			<table name="income_band"></table>
			<table name="reason"></table>
			<table name="ship_mode"></table>
			<table name="web_site"></table>		
			<!-- /tables not used in benchmark  -->
			
			<!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->
			<table name="date_dim" static="true"></table>
			<table name="time_dim" static="true"></table>
			<table name="customer_demographics" static="true"></table>
			<table name="household_demographics" static="true"></table>
			<!-- /static tables -->
			
			<!-- "normal" tables. split over all nodes. not generated during refresh -->
			<table name="store"></table>
			<table name="warehouse"></table>
			<table name="promotion"></table>
			<table name="web_page"></table>
			<!-- /"normal" tables.-->
			
		<!-- /not refreshed tables -->	
			
			<!-- 
			refreshed tables. Generated on all nodes. 
			Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)
			In "normal"-Phase  generate table rows:  [0,REFRESH_PERCENTAGE*Table.Size]; 
			In "refresh"-Phase generate table rows:  [REFRESH_PERCENTAGE*Table.Size+1, Table.Size] 
			.Has effect only if  ${REFRESH_SYSTEM_ENABLED}==1. 
			-->
			<table name="customer">
				<scheduler name="DefaultScheduler">
					<partitioner
						name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">
						<prePartition><![CDATA[
					if(${REFRESH_SYSTEM_ENABLED}>0){
						int tableID = table.getTableID();
						int timeID = 0;
						long lastTableRow=table.getSize()-1;
						long rowStart;
						long rowStop;
						boolean exclude=false;
						long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});
						if(${REFRESH_PHASE}>0){
							//Refresh part
							rowStart = lastTableRow - refreshRows +1;
							rowStop  = lastTableRow;
							if(refreshRows<=0){
								exclude=true;
							}
							
						}else{
							//"normal" part
							rowStart = 0;
							rowStop = lastTableRow - refreshRows;
						}
						return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);
					}else{
						//DEFAULT
						return getParentPartitioner().getDefaultPrePartition(project, table);				
					}
					
					]]></prePartition>
					</partitioner>
				</scheduler>
			</table>
<output name="SplitFileOutputWrapper">
  <!-- DEFAULT output for all Tables, if no table specific output is specified-->
    <output name="CSVRowOutput">
      <fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
      <outputDir>output/</outputDir>
      <fileEnding>.dat</fileEnding>
      <delimiter>|</delimiter>
      <charset>UTF-8</charset>
      <sortByRowID>true</sortByRowID>
    </output>

    <output name="StatisticsOutput" active="1">
      <size>${item_size}</size><!-- a counter per item .. initialize later-->

      <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>
      <outputDir>output/</outputDir>
      <fileEnding>.csv</fileEnding>
      <delimiter>,</delimiter>
      <header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->
      </header>
      <footer></footer>

bigbench-schema.xml 设置了很多参数,有跟表的规模有关的,比如每张表的大小(记录的条数);绝大多数是跟表的字段有关的,比如时间的起始、结束、性别比例、结婚比例、指标的上下界等。还具体定义了每个字段是怎么生成的,以及限制条件。示例如下:

生成的数据大小由 SCALE_FACTOR(-f) 决定。如 -f 1,则生成的数据总大小约为1G;-f 100,则生成的数据总大小约为100G。那么SCALE_FACTOR(-f) 是怎么精确控制生成的数据的大小呢?

原因是 SCALE_FACTOR(-f) 决定了每张表的记录数。如下,customer 表的记录数为 100000.0d * ${SF_sqrt},即如果 -f 1customer 表的记录数为 100000*sqrt(1)= 10万条 ;如果 -f 100customer 表的记录数为 100000*sqrt(100)= 100万条

<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>
<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>
<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property> 
<property name="${gender_likelihood}" type="double">0.5</property>
<property name="${married_likelihood}" type="double">0.3</property>
<property name="${WP_LINK_MIN}" type="double">2</property>
<property name="${WP_LINK_MAX}" type="double">25</property>
  <field name="d_date" size="13" type="CHAR" primary="false">
   <gen_DateTime>
     <disableRng>true</disableRng>
     <useFixedStepSize>true</useFixedStepSize>
     <startDate>${date_dim_begin_date}</startDate>
     <endDate>${date_dim_end_date}</endDate>
     <outputFormat>yyyy-MM-dd</outputFormat>
    </gen_DateTime>
  </field>
  <field name="t_time_id" size="16" type="CHAR" primary="false">
   <gen_ConvertNumberToString>
    <gen_Id/>
    <size>16.0</size>
    <characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>
   </gen_ConvertNumberToString>
  </field>
<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">
   <gen_Null probability="${NULL_CHANCE}">
   <gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />
   </gen_Null>
  </field>

dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件,表里每一条记录的各个字段应该是从这些字典里生成的。

extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jarcommons-net-3.3.jarxml-apis.jarlog4j-1.2.15.jar

总结

pdgf.jar根据bigbench-generation.xmlbigbench-schema.xml两个文件里的配置(表名、字段名、表的记录条数、每个字段生成的规则),从 dicts 目录下对应的 .dict
文件获取表中每一条记录、每个字段的值,生成原始数据。

customer 表里的某条记录如下:

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com   gdMmGdU9

如果执行 TPCx-BB 测试时指定 -f 1(SCALE_FACTOR = 1) 则最终生成的原始数据总大小约为 1G(977M+8.6M)

[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data
12.7 M   38.0 M   /user/root/benchmarks/bigbench/data/customer
5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/customer_address
74.2 M   222.5 M  /user/root/benchmarks/bigbench/data/customer_demographics
14.7 M   44.0 M   /user/root/benchmarks/bigbench/data/date_dim
151.5 K  454.4 K  /user/root/benchmarks/bigbench/data/household_demographics
327      981      /user/root/benchmarks/bigbench/data/income_band
405.3 M  1.2 G    /user/root/benchmarks/bigbench/data/inventory
6.5 M    19.5 M   /user/root/benchmarks/bigbench/data/item
4.0 M    12.0 M   /user/root/benchmarks/bigbench/data/item_marketprices
53.7 M   161.2 M  /user/root/benchmarks/bigbench/data/product_reviews
45.3 K   135.9 K  /user/root/benchmarks/bigbench/data/promotion
3.0 K    9.1 K    /user/root/benchmarks/bigbench/data/reason
1.2 K    3.6 K    /user/root/benchmarks/bigbench/data/ship_mode
3.3 K    9.9 K    /user/root/benchmarks/bigbench/data/store
4.1 M    12.4 M   /user/root/benchmarks/bigbench/data/store_returns
88.5 M   265.4 M  /user/root/benchmarks/bigbench/data/store_sales
4.9 M    14.6 M   /user/root/benchmarks/bigbench/data/time_dim
584      1.7 K    /user/root/benchmarks/bigbench/data/warehouse
170.4 M  511.3 M  /user/root/benchmarks/bigbench/data/web_clickstreams
7.9 K    23.6 K   /user/root/benchmarks/bigbench/data/web_page
5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/web_returns
127.6 M  382.8 M  /user/root/benchmarks/bigbench/data/web_sales
8.6 K    25.9 K   /user/root/benchmarks/bigbench/data/web_site

执行流程

要执行TPCx-BB测试,首先需要切换到TPCx-BB源程序的目录下,然后进入bin目录,执行以下语句:

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5 

其中,-f、-m、-s、-j都是参数,用户可根据集群的性能以及自己的需求来设置。如果不指定,则使用默认值,默认值在 conf 目录下的 userSetting.conf 文件指定,如下:

export BIG_BENCH_DEFAULT_DATABASE="bigbench"
export BIG_BENCH_DEFAULT_ENGINE="hive"
export BIG_BENCH_DEFAULT_MAP_TASKS="80"
export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"
export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"
export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"

默认 MAP_TASKS80(-m 80)SCALE_FACTOR1000(-f 1000)NUMBER_OF_PARALLEL_STREAMS2(-s 2)

所有可选参数及其意义如下:

General options:
-d  使用的数据库 (默认: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)
-e  使用的引擎 (默认: $BIG_BENCH_DEFAULT_ENGINE -> hive)
-f  数据集的规模因子(scale factor) (默认: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)
-h  显示帮助
-m  数据生成的`map tasks`数 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"
-s  并行的`stream`数 (默认: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)

Driver specific options:
-a  伪装模式执行
-b  执行期间将调用的bash脚本在标准输出中打印出来
-i  指定需要执行的阶段 (详情见$BIG_BENCH_CONF_DIR/bigBench.properties)
-j  指定需要执行的查询 (默认:1-30共30个查询均执行)"
-U  解锁专家模式 

若指定了-U,即解锁了专家模式,则:

echo "EXPERT MODE ACTIVE"
echo "WARNING - INTERNAL USE ONLY:"
echo "Only set manually if you know what you are doing!"
echo "Ignoring them is probably the best solution" 
echo "Running individual modules:"
echo "Usage: `basename $0` module [options]"

-D  指定需要debug的查询部分. 大部分查询都只有一个单独的部分
-p  需要执行的benchmark phase (默认: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"
-q  指定需要执行哪个查询(只能指定一个)
-t  指定执行该查询时用第哪个stream
-v  metastore population的sql脚本 (默认: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"
-w  metastore refresh的sql脚本 (默认: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"
-y  含额外的用户自定义查询参数的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"
-z  含额外的用户自定义引擎设置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"

List of available modules:
    $BIG_BENCH_ENGINE_BIN_DIR

回到刚刚执行TPCx-BB测试的语句:

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5 

bigBench

bigBench是主脚本,runBenchmark是module。

bigBench 里设置了很多环境变量(包括路径、引擎、STREAM数等等),因为后面调用 runBigBench.jar 的时候需要在Java程序里读取这些环境变量。

bigBench 前面都是在做一些基本工作,如设置环境变量、解析用户输入参数、赋予文件权限、设置路径等等。到最后一步调用 runBenchmarkrunModule() 方法:

  1. 设置基本路径

    export BIG_BENCH_VERSION="1.0"
    export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin"
    export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf"
    export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator"
    export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools"
    export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"
    
  2. 指定 core-site.xmlhdfs-site.xml 的路径

    数据生成时要用到Hadoop集群,生成在hdfs上

    export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"
    

export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"
```

  1. 赋予整个包下所有可执行文件权限(.sh/.jar/.py)

    find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +
    

find "$BIG_BENCH_HOME" -name '.jar' -exec chmod 755 {} +
find "$BIG_BENCH_HOME" -name '
.py' -exec chmod 755 {} +
```

  1. 设置 userSetting.conf 的路径并 source

    USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf"
    if [ ! -f "$USER_SETTINGS" ]
    then
      echo "User settings file $USER_SETTINGS not found"
      exit 1
    else
      source "$USER_SETTINGS"
    fi
    
  2. 解析输入参数和选项并根据选项的内容作设置

    第一个参数必须是module_name

    如果没有输入参数或者第一个参数以"-"开头,说明用户没有输入需要运行的module。

    if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]]
    then
      export MODULE_NAME=""
      SHOW_HELP="1"
    else
      export MODULE_NAME="$1"
      shift
    fi
    export LIST_OF_USER_OPTIONS="$@"
    

解析用户输入的参数

根据用户输入的参数来设置环境变量

```bash
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do

case "$OPT" in
# script options
d)
#echo "-d was triggered, Parameter: $OPTARG" >&2
USER_DATABASE="$OPTARG"
;;
D)
#echo "-D was triggered, Parameter: $OPTARG" >&2
DEBUG_QUERY_PART="$OPTARG"
;;
e)
#echo "-e was triggered, Parameter: $OPTARG" >&2
USER_ENGINE="$OPTARG"
;;
f)
#echo "-f was triggered, Parameter: $OPTARG" >&2
USER_SCALE_FACTOR="$OPTARG"
;;
h)
#echo "-h was triggered, Parameter: $OPTARG" >&2
SHOW_HELP="1"
;;
m)
#echo "-m was triggered, Parameter: $OPTARG" >&2
USER_MAP_TASKS="$OPTARG"
;;
p)
#echo "-p was triggered, Parameter: $OPTARG" >&2
USER_BENCHMARK_PHASE="$OPTARG"
;;
q)
#echo "-q was triggered, Parameter: $OPTARG" >&2
QUERY_NUMBER="$OPTARG"
;;
s)
#echo "-t was triggered, Parameter: $OPTARG" >&2
USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"
;;
t)
#echo "-s was triggered, Parameter: $OPTARG" >&2
USER_STREAM_NUMBER="$OPTARG"
;;
U)
#echo "-U was triggered, Parameter: $OPTARG" >&2
USER_EXPERT_MODE="1"
;;
v)
#echo "-v was triggered, Parameter: $OPTARG" >&2
USER_POPULATE_FILE="$OPTARG"
;;
w)
#echo "-w was triggered, Parameter: $OPTARG" >&2
USER_REFRESH_FILE="$OPTARG"
;;
y)
#echo "-y was triggered, Parameter: $OPTARG" >&2
USER_QUERY_PARAMS_FILE="$OPTARG"
;;
z)
#echo "-z was triggered, Parameter: $OPTARG" >&2
USER_ENGINE_SETTINGS_FILE="$OPTARG"
;;
# driver options
a)
#echo "-a was triggered, Parameter: $OPTARG" >&2
export USER_PRETEND_MODE="1"
;;
b)
#echo "-b was triggered, Parameter: $OPTARG" >&2
export USER_PRINT_STD_OUT="1"
;;
i)
#echo "-i was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_WORKLOAD="$OPTARG"
;;
j)
#echo "-j was triggered, Parameter: $OPTARG" >&2
export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"
;;
?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
😃
echo "Option -$OPTARG requires an argument." >&2
exit 1
;;
esac
done
```

设置全局变量。如果用户指定了某个参数的值,则采用该值,否则使用默认值。

```bash
export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}"

export SHOW_HELP="${SHOW_HELP:-"0"}"
export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}"
export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}"
export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}"
export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}"
export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}"
export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}"
export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}"
export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE"
export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf"
```

  1. 检测 -s -m -f -j的选项是否为数字

    if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ]
    then
      echo "$BIG_BENCH_MAP_TASKS is not a number"
    fi
    if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ]
    then
      echo "$BIG_BENCH_SCALE_FACTOR is not a number"
    fi
    if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ]
    then
      echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number"
    fi
    if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ]
    then
      echo "$BIG_BENCH_STREAM_NUMBER is not a number"
    fi
    
  2. 检查引擎是否存在

    if [ ! -d "$BIG_BENCH_ENGINE_DIR" ]
    then
      echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..."
      exit 1
    fi
    if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ]
    then
      echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..."
      exit 1
    fi
    
  3. 设置 engineSetting.conf 路径并 source

    ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf"
    if [ ! -f "$ENGINE_SETTINGS" ]
    then
      echo "Engine settings file $ENGINE_SETTINGS not found"
      exit 1
    else
      source "$ENGINE_SETTINGS"
    fi
    
  4. 检查module是否存在

    当输入某个module时,系统会先到$BIG_BENCH_ENGINE_BIN_DIR/目录下去找该module是否存在,如果存在,就source "$MODULE";如果该目录下不存在指定的module,再到export MODULE="$BIG_BENCH_BIN_DIR/"目录下找该module,如果存在,就source "$MODULE";否则,输出Module $MODULE not found, aborting script.

    export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME"
    if [ -f "$MODULE" ]
    then
      source "$MODULE"
    else
      export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME"
      if [ -f "$MODULE" ]
      then
        source "$MODULE"
      else
        echo "Module $MODULE not found, aborting script."
        exit 1
      fi
    fi
    
  5. 检查module里的runModule()、helpModule ( )、runEngineCmd()方法是否有定义

    MODULE_RUN_METHOD="runModule"
    if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1
    then
      echo "$MODULE_RUN_METHOD was not implemented, aborting script"
      exit 1
    fi
    
  6. 运行module

    如果module是runBenchmark,执行
    runCmdWithErrorCheck "$MODULE_RUN_METHOD"
    也就是
    runCmdWithErrorCheck runModule()

由上可以看出,bigBench脚本主要执行一些如设置环境变量、赋予权限、检查并解析输入参数等基础工作,最终调用runBenchmarkrunModule()方法继续往下执行。

runBenchmark

接下来看看runBenchmark脚本。

runBenchmark里有两个函数:helpModule ()runModule ()

helpModule ()就是显示帮助。

runModule ()是运行runBenchmark模块时真正调用的函数。该函数主要做四件事:

  1. 清除之前生成的日志
  2. 调用RunBigBench.jar来执行
  3. logEnvInformation
  4. 将日志文件夹打包成zip

源码如下:

runModule () {
  #check input parameters
  if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ]
  then
    echo "The number of parallel streams -s must be greater than 0"
    return 1
  fi

  "${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS
  "$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar"
  "${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS
  "${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS
  return $?
}

相当于运行runBenchmark模块时又调用了cleanLogslogEnvInformationzipLogs三个模块以及RunBigBench.jar。其中RunBigBench.jar是TCPx-BB测试执行的核心代码,用Java语言编写。接下来分析RunBigBench.jar源码。

runModule()

runModule()函数用来执行某个module。我们已知,执行某个module需要切换到主目录下的bin目录,然后执行:

./bigBench module_name arguments

在runModule()函数里,cmdLine用来生成如上命令。

ArrayList cmdLine = new ArrayList();
cmdLine.add("bash");
cmdLine.add(this.runScript);
cmdLine.add(benchmarkPhase.getRunModule());
cmdLine.addAll(arguments);

其中,this.runScript为:

this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";

benchmarkPhase.getRunModule()用来获得需要执行的module。

arguments为用户输入的参数。

至此,cmdLine为:

bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments

那么,怎么让系统执行该bash命令呢?答案是调用runCmd()方法。

boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));

接下来介绍rumCmd()方法

runCmd()

runCmd()方法通过ProcessBuilder来创建一个操作系统进程,并用该进程执行以上的bash命令。

ProcessBuilder还可以设置工作目录和环境。

ProcessBuilder pb = new ProcessBuilder(command);
pb.directory(new File(workingDirectory));
Process p = null;
---
p = pb.start();

getQueryList()

getQueryList()用来获得需要执行的查询列表。从$BIG_BENCH_LOGS_DIR/bigBench.properties文件中读取。与$BIG_BENCH_HOME/conf/bigBench.properties内容一致。

bigBench.propertiespower_test_0=1-30规定了powter_test_0阶段需要执行的查询及其顺序。

可以用区间如 5-12 或者单个数字如 21 表示,中间用 , 隔开。

power_test_0=28-25,2-5,10,22,30

表示powter_test_0阶段需要执行的查询及其顺序为:28,27,26,25,2,3,4,5,10,22,30

如果想让30个查询按顺序执行,则:

power_test_0=1-30

获得查询列表的源码如下:

    private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) {
        String SHUFFLED_NAME_PATTERN = "shuffledQueryList";
        BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST;
        String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber);
        boolean queryOrderCached = benchmarkPhase.isQueryOrderCached();
        if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) {
            return new ArrayList((Collection)this.queryListCache.get(propertyKey));
        } else {
            Object queryList;
            String basicPhaseNamePattern;
            if(!this.properties.containsKey(propertyKey)) {
                if(benchmarkPhase.isQueryOrderRandom()) {
                    if(!this.queryListCache.containsKey("shuffledQueryList")) {
                        basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0);
                        if(!this.properties.containsKey(basicPhaseNamePattern)) {
                            throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list.");
                        }

                        this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0));
                    }

                    queryList = (List)this.queryListCache.get("shuffledQueryList");
                    this.shuffleList((List)queryList, this.rnd);
                } else {
                    queryList = this.getQueryList(queryOrderBasicPhase, 0);
                }
            } else {
                queryList = new ArrayList();
                String[] var11;
                int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length;

                label65:
                for(int var9 = 0; var9 < var10; ++var9) {
                    basicPhaseNamePattern = var11[var9];
                    String[] queryRange = basicPhaseNamePattern.trim().split("-");
                    switch(queryRange.length) {
                    case 1:
                        ((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim())));
                        break;
                    case 2:
                        int startQuery = Integer.parseInt(queryRange[0]);
                        int endQuery = Integer.parseInt(queryRange[1]);
                        int i;
                        if(startQuery > endQuery) {
                            i = startQuery;

                            while(true) {
                                if(i < endQuery) {
                                    continue label65;
                                }

                                ((List)queryList).add(Integer.valueOf(i));
                                --i;
                            }
                        } else {
                            i = startQuery;

                            while(true) {
                                if(i > endQuery) {
                                    continue label65;
                                }

                                ((List)queryList).add(Integer.valueOf(i));
                                ++i;
                            }
                        }
                    default:
                        throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated.");
                    }
                }
            }

            if(queryOrderCached) {
                this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList));
            }

            return new ArrayList((Collection)queryList);
        }
    }

parseEnvironment()

parseEnvironment()读取系统的环境变量并解析。

Map env = System.getenv();
this.version = (String)env.get("BIG_BENCH_VERSION");
this.homeDir = (String)env.get("BIG_BENCH_HOME");
this.confDir = (String)env.get("BIG_BENCH_CONF_DIR");
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR");
this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR");
this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG");
this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG");
this.engine = (String)env.get("BIG_BENCH_ENGINE");
this.database = (String)env.get("BIG_BENCH_DATABASE");
this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS");
this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS"));
this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR"));
this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1");

并自动在用户指定的参数后面加上 -U (解锁专家模式)

this.userArguments.add("-U");

如果用户指定了 PRETEND_MODEPRINT_STD_OUTWORKLOADQUERIES_TO_RUN,则以用户指定的参数为准,否则使用默认值。

if(env.containsKey("USER_PRETEND_MODE")) {
	this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE"));
}

if(env.containsKey("USER_PRINT_STD_OUT")) {
   this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT"));
}

if(env.containsKey("USER_DRIVER_WORKLOAD")) {
   this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD"));
}

if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) {
	this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN"));
}

读取 workload 并赋值 benchmarkPhases。如果 workload 里不包含 BENCHMARK_STARTBENCHMARK_STOP,自动在 benchmarkPhases 的首位和末位分别加上 BENCHMARK_STARTBENCHMARK_STOP

this.benchmarkPhases = new ArrayList();
Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator();
	
while(var7.hasNext()) {
    String benchmarkPhase = (String)var7.next();
    this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim()));
}
	
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) {
    this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START);
}
	
if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) {
    this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP);
}

run()

run() 方法是 RunBigBench.jar 里核心的方法。所有的执行都是通过 run() 方法调用的。比如 runQueries()runModule()generateData()等。runQueries()runModule()generateData() 又通过调用 runCmd() 方法来创建操作系统进程,执行bash命令,调用bash脚本。

run() 方法里通过一个 while 循环来逐一执行 workload 里的每一个 benchmarkPhase。 不同的 benchmarkPhase 会调用 runQueries()runModule()generateData()...中的不同方法。

try {
long e = 0L;
this.log.finer("Benchmark phases: " + this.benchmarkPhases);
Iterator startCheckpoint = this.benchmarkPhases.iterator();

long throughputStart;
while(startCheckpoint.hasNext()) {
    BigBench.BenchmarkPhase children = (BigBench.BenchmarkPhase)startCheckpoint.next();
    if(children.isPhaseDone()) {
        this.log.info("The phase " + children.name() + " was already performed earlier. Skipping this phase");
    } else {
        try {
            switch($SWITCH_TABLE$io$bigdatabenchmark$v1$driver$BigBench$BenchmarkPhase()[children.ordinal()]) {
            case 1:
            case 20:
                throw new IllegalArgumentException("The value " + children.name() + " is only used internally.");
            case 2:
                this.log.info(children.getConsoleMessage());
                e = System.currentTimeMillis();
                break;
            case 3:
                if(!BigBench.BenchmarkPhase.BENCHMARK_START.isPhaseDone()) {
                    throw new IllegalArgumentException("Error: Cannot stop the benchmark before starting it");
                }

                throughputStart = System.currentTimeMillis();
                this.log.info(String.format("%-55s finished. Time: %25s", new Object[]{children.getConsoleMessage(), BigBench.Helper.formatTime(throughputStart - e)}));
                this.logTreeRoot.setCheckpoint(new BigBench.Checkpoint(BigBench.BenchmarkPhase.BENCHMARK, -1L, -1L, e, throughputStart, this.logTreeRoot.isSuccessful()));
                break;
            case 4:
            case 15:
            case 18:
            case 22:
            case 27:
            case 28:
            case 29:
                this.runModule(children, this.userArguments);
                break;
            case 5:
            case 10:
            case 11:
                this.runQueries(children, 1, validationArguments);
                break;
            case 6:
            case 9:
                this.runModule(children, validationArguments);
                break;
            case 7:
                this.generateData(children, false, validationArguments);
                break;
            case 8:
                this.generateData(children, true, validationArguments);
                break;
            case 12:
            case 19:
            case 24:
                this.runQueries(children, 1, this.userArguments);
                break;
            case 13:
            case 14:
            case 21:
            case 23:
            case 25:
            case 26:
                this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
                break;
            case 16:
                this.generateData(children, false, this.userArguments);
                break;
            case 17:
                this.generateData(children, true, this.userArguments);
            }

            children.setPhaseDone(true);
        } catch (IOException var21) {
            this.log.info("==============\nBenchmark run terminated\nReason: An error occured while running a command in phase " + children + "\n==============");
            var21.printStackTrace();
            if(this.stopAfterFailure || children.mustSucceed()) {
                break;
            }
        }
    }
}

这里的 case 1-29 并不是 1-29 条查询,而是枚举类型里的 1-29benmarkPhase 。如下所示:

private static enum BenchmarkPhase {
BENCHMARK((String)null, "benchmark", false, false, false, false, "BigBench benchmark"),
BENCHMARK_START((String)null, "benchmark_start", false, false, false, false, "BigBench benchmark: Start"),
BENCHMARK_STOP((String)null, "benchmark_stop", false, false, false, false, "BigBench benchmark: Stop"),
CLEAN_ALL("cleanAll", "clean_all", false, false, false, false, "BigBench clean all"),
ENGINE_VALIDATION_CLEAN_POWER_TEST("cleanQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Clean power test queries"),
ENGINE_VALIDATION_CLEAN_LOAD_TEST("cleanMetastore", "engine_validation_metastore", false, false, false, false, "BigBench engine validation: Clean metastore"),
ENGINE_VALIDATION_CLEAN_DATA("cleanData", "engine_validation_data", false, false, false, false, "BigBench engine validation: Clean data"),
ENGINE_VALIDATION_DATA_GENERATION("dataGen", "engine_validation_data", false, false, false, true, "BigBench engine validation: Data generation"),
ENGINE_VALIDATION_LOAD_TEST("populateMetastore", "engine_validation_metastore", false, false, false, true, "BigBench engine validation: Populate metastore"),
ENGINE_VALIDATION_POWER_TEST("runQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Power test"),
ENGINE_VALIDATION_RESULT_VALIDATION("validateQuery", "engine_validation_power_test", false, false, true, false, "BigBench engine validation: Check all query results"),
CLEAN_POWER_TEST("cleanQuery", "power_test", false, false, false, false, "BigBench clean: Clean power test queries"),
CLEAN_THROUGHPUT_TEST_1("cleanQuery", "throughput_test_1", false, false, false, false, "BigBench clean: Clean first throughput test queries"),
CLEAN_THROUGHPUT_TEST_2("cleanQuery", "throughput_test_2", false, false, false, false, "BigBench clean: Clean second throughput test queries"),
CLEAN_LOAD_TEST("cleanMetastore", "metastore", false, false, false, false, "BigBench clean: Load test"),
CLEAN_DATA("cleanData", "data", false, false, false, false, "BigBench clean: Data"),
DATA_GENERATION("dataGen", "data", false, false, false, true, "BigBench preparation: Data generation"),
LOAD_TEST("populateMetastore", "metastore", false, false, false, true, "BigBench phase 1: Load test"),
POWER_TEST("runQuery", "power_test", false, true, false, false, "BigBench phase 2: Power test"),
THROUGHPUT_TEST((String)null, "throughput_test", false, false, false, false, "BigBench phase 3: Throughput test"),
THROUGHPUT_TEST_1("runQuery", "throughput_test_1", true, true, false, false, "BigBench phase 3: First throughput test run"),
THROUGHPUT_TEST_REFRESH("refreshMetastore", "throughput_test_refresh", false, false, false, false, "BigBench phase 3: Throughput test data refresh"),
THROUGHPUT_TEST_2("runQuery", "throughput_test_2", true, true, false, false, "BigBench phase 3: Second throughput test run"),
VALIDATE_POWER_TEST("validateQuery", "power_test", false, false, true, false, "BigBench validation: Power test results"),
VALIDATE_THROUGHPUT_TEST_1("validateQuery", "throughput_test_1", false, false, true, false, "BigBench validation: First throughput test results"),
VALIDATE_THROUGHPUT_TEST_2("validateQuery", "throughput_test_2", false, false, true, false, "BigBench validation: Second throughput test results"),
SHOW_TIMES("showTimes", "show_times", false, false, true, false, "BigBench: show query times"),
SHOW_ERRORS("showErrors", "show_errors", false, false, true, false, "BigBench: show query errors"),
SHOW_VALIDATION("showValidation", "show_validation", false, false, true, false, "BigBench: show query validation results");

private String runModule;
private String namePattern;
private boolean queryOrderRandom;
private boolean queryOrderCached;
private boolean printStdOut;
private boolean mustSucceed;
private String consoleMessage;
private boolean phaseDone;

private BenchmarkPhase(String runModule, String namePattern, boolean queryOrderRandom, boolean queryOrderCached, boolean printStdOut, boolean mustSucceed, String consoleMessage) {
    this.runModule = runModule;
    this.namePattern = namePattern;
    this.queryOrderRandom = queryOrderRandom;
    this.queryOrderCached = queryOrderCached;
    this.printStdOut = printStdOut;
    this.mustSucceed = mustSucceed;
    this.consoleMessage = consoleMessage;
    this.phaseDone = false;
}		

3对应 BENCHMARK_STOP,4对应 CLEAN_ALL,29对应 SHOW_VALIDATION,依此类推...

可以看出:

CLEAN_ALL、CLEAN_LOAD_TEST、LOAD_TEST、THROUGHPUT_TEST_REFRESH、SHOW_TIMES、SHOW_ERRORS、SHOW_VALIDATION等benchmarkPhases调用的是

this.runModule(children, this.userArguments);

方法是 runModule ,参数是 this.userArguments

ENGINE_VALIDATION_CLEAN_POWER_TEST、ENGINE_VALIDATION_POWER_TEST、ENGINE_VALIDATION_RESULT_VALIDATION 调用的是

this.runQueries(children, 1, validationArguments);

方法是 runQueries ,参数是 1(stream number) 和 validationArguments

ENGINE_VALIDATION_CLEAN_LOAD_TESTENGINE_VALIDATION_LOAD_TEST 调用的是

this.runModule(children, validationArguments);

ENGINE_VALIDATION_CLEAN_DATA 调用的是

this.generateData(children, false, validationArguments);

ENGINE_VALIDATION_DATA_GENERATION 调用的是

this.generateData(children, true, validationArguments);

CLEAN_POWER_TESTPOWER_TESTVALIDATE_POWER_TEST 调用的是

this.runQueries(children, 1, this.userArguments);

CLEAN_THROUGHPUT_TEST_1``CLEAN_THROUGHPUT_TEST_2``THROUGHPUT_TEST_1``THROUGHPUT_TEST_2``VALIDATE_THROUGHPUT_TEST_1 VALIDATE_THROUGHPUT_TEST_2 调用的是

this.runQueries(children, this.numberOfParallelStreams, this.userArguments);

CLEAN_DATA 调用的是

this.generateData(children, false, this.userArguments);

DATA_GENERATION 调用的是

this.generateData(children, true, this.userArguments);

总结一下以上的方法调用可以发现:

  • ENGINE_VALIDATION 相关的benchmarkPhase用的参数都是 validationArguments。其余用的是 userArguments( validationArguments 和 userArguments 唯一的区别是 validationArguments 的 SCALE_FACTOR 恒为1)
  • POWER_TEST 相关的都是调用 runQueries() 方法,因为 POWER_TEST 就是执行SQL查询
  • CLEAN_DATA DATA_GENERATION 相关的都是调用 generateData() 方法
  • LOAD_TEST SHOW 相关的都是调用 runModule() 方法

benchmarkPhase 和 module 对应关系

具体每个 benchmarkPhasemodule(执行的脚本)的对应关系如下:

CLEAN_ALL -> "cleanAll"
ENGINE_VALIDATION_CLEAN_POWER_TEST -> "cleanQuery"
ENGINE_VALIDATION_CLEAN_LOAD_TEST -> "cleanMetastore",
ENGINE_VALIDATION_CLEAN_DATA -> "cleanData"
ENGINE_VALIDATION_DATA_GENERATION -> "dataGen"
ENGINE_VALIDATION_LOAD_TEST -> "populateMetastore"
ENGINE_VALIDATION_POWER_TEST -> "runQuery"
ENGINE_VALIDATION_RESULT_VALIDATION -> "validateQuery"
CLEAN_POWER_TEST -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_1 -> "cleanQuery"
CLEAN_THROUGHPUT_TEST_2 -> "cleanQuery"
CLEAN_LOAD_TEST -> "cleanMetastore"
CLEAN_DATA -> "cleanData"
DATA_GENERATION -> "dataGen"
LOAD_TEST -> "populateMetastore"
POWER_TEST -> "runQuery"
THROUGHPUT_TEST -> (String)null
THROUGHPUT_TEST_1 -> "runQuery"
THROUGHPUT_TEST_REFRESH -> "refreshMetastore"
THROUGHPUT_TEST_2 -> "runQuery"
VALIDATE_POWER_TEST -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_1 -> "validateQuery"
VALIDATE_THROUGHPUT_TEST_2 -> "validateQuery"
SHOW_TIMES -> "showTimes"
SHOW_ERRORS -> "showErrors"
SHOW_VALIDATION -> "showValidation"

当执行某个 benchmarkPhase 时会去调用如上该 benchmarkPhase 对应的 module (脚本位于 $BENCH_MARK_HOME/engines/hive/bin 目录下)

cmdLine.add(benchmarkPhase.getRunModule());

程序调用流程

Alt text

接下来介绍每个module的功能

module

cleanAll

1. DROP DATABASE
2. 删除hdfs上的源数据
echo "dropping database (with all tables)"
runCmdWithErrorCheck runEngineCmd -e "DROP DATABASE IF EXISTS $BIG_BENCH_DATABASE CASCADE;"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_HOME}"

cleanQuery

1. 删除对应的 Query 生成的临时表
2. 删除对应的 Query 生成的结果表
runCmdWithErrorCheck runEngineCmd -e "DROP TABLE IF EXISTS $TEMP_TABLE1; DROP TABLE IF EXISTS $TEMP_TABLE2; DROP TABLE IF EXISTS $RESULT_TABLE;"
return $?

cleanMetastore

1. 调用 `dropTables.sql` 将23张表依次DROP
echo "cleaning metastore tables"
runCmdWithErrorCheck runEngineCmd -f "$BIG_BENCH_CLEAN_METASTORE_FILE"
export BIG_BENCH_CLEAN_METASTORE_FILE="$BIG_BENCH_CLEAN_DIR/dropTables.sql"

dropTables.sql 将23张表依次DROP,源码如下:

DROP TABLE IF EXISTS ${hiveconf:customerTableName};
DROP TABLE IF EXISTS ${hiveconf:customerAddressTableName};
DROP TABLE IF EXISTS ${hiveconf:customerDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:dateTableName};
DROP TABLE IF EXISTS ${hiveconf:householdDemographicsTableName};
DROP TABLE IF EXISTS ${hiveconf:incomeTableName};
DROP TABLE IF EXISTS ${hiveconf:itemTableName};
DROP TABLE IF EXISTS ${hiveconf:promotionTableName};
DROP TABLE IF EXISTS ${hiveconf:reasonTableName};
DROP TABLE IF EXISTS ${hiveconf:shipModeTableName};
DROP TABLE IF EXISTS ${hiveconf:storeTableName};
DROP TABLE IF EXISTS ${hiveconf:timeTableName};
DROP TABLE IF EXISTS ${hiveconf:warehouseTableName};
DROP TABLE IF EXISTS ${hiveconf:webSiteTableName};
DROP TABLE IF EXISTS ${hiveconf:webPageTableName};
DROP TABLE IF EXISTS ${hiveconf:inventoryTableName};
DROP TABLE IF EXISTS ${hiveconf:storeSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:storeReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:webSalesTableName};
DROP TABLE IF EXISTS ${hiveconf:webReturnsTableName};
DROP TABLE IF EXISTS ${hiveconf:marketPricesTableName};
DROP TABLE IF EXISTS ${hiveconf:clickstreamsTableName};
DROP TABLE IF EXISTS ${hiveconf:reviewsTableName};

cleanData

1. 删除hdfs上 /user/root/benchmarks/bigbench/data 目录
2. 删除hdfs上 /user/root/benchmarks/bigbench/data_refresh 目录
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

dataGen

1. 创建目录 /user/root/benchmarks/bigbench/data 并赋予权限
2. 创建目录 /user/root/benchmarks/bigbench/data_refresh 并赋予权限
3. 调用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data 到 /user/root/benchmarks/bigbench/data 目录下
4. 调用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data 到 /user/root/benchmarks/bigbench/data_refresh 目录下

创建目录 /user/root/benchmarks/bigbench/data 并赋予权限

runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"

创建目录 /user/root/benchmarks/bigbench/data_refresh 并赋予权限

runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

调用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data

runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 0 -o "'${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1

调用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data

runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 1 -o "'${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1 

populateMetastore

该过程是真正的创建数据库表的过程。建表的过程调用的是 $BENCH_MARK_HOME/engines/hive/population/ 目录下的 hiveCreateLoad.sql ,通过该sql文件来建数据库表。

  1. 从 /user/root/benchmarks/bigbench/data 路径下读取 .dat 的原始数据,生成 TEXTFILE 格式的外部临时表
  2. select * from 临时表 来创建最终的 ORC 格式的数据库表
  3. 删除外部临时表。

从 /user/root/benchmarks/bigbench/data 路径下读取 .dat 的原始数据,生成 TEXTFILE 格式的外部临时表

DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
  ( c_customer_sk             bigint              --not null
  , c_customer_id             string              --not null
  , c_current_cdemo_sk        bigint
  , c_current_hdemo_sk        bigint
  , c_current_addr_sk         bigint
  , c_first_shipto_date_sk    bigint
  , c_first_sales_date_sk     bigint
  , c_salutation              string
  , c_first_name              string
  , c_last_name               string
  , c_preferred_cust_flag     string
  , c_birth_day               int
  , c_birth_month             int
  , c_birth_year              int
  , c_birth_country           string
  , c_login                   string
  , c_email_address           string
  , c_last_review_date        string
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
  STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;

select * from 临时表 来创建最终的 ORC 格式的数据库表

DROP TABLE IF EXISTS ${hiveconf:customerTableName};
CREATE TABLE ${hiveconf:customerTableName}
STORED AS ${hiveconf:tableFormat}
AS
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;

删除外部临时表

DROP TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};

runQuery

1. runQuery 调用每个query下的 run.sh 里的 `query_run_main_method()` 方法
2. `query_run_main_method()` 调用 `runEngineCmd` 来执行query脚本(qxx.sql)

runQuery 调用每个query下的 run.sh 里的 query_run_main_method() 方法

QUERY_MAIN_METHOD="query_run_main_method"
-----------------------------------------
"$QUERY_MAIN_METHOD" 2>&1 | tee -a "$LOG_FILE_NAME" 2>&1

query_run_main_method() 调用 runEngineCmd 来执行query脚本(qxx.sql)

query_run_main_method () {
	QUERY_SCRIPT="$QUERY_DIR/$QUERY_NAME.sql"
	if [ ! -r "$QUERY_SCRIPT" ]
	then
		echo "SQL file $QUERY_SCRIPT can not be read."
		exit 1
	fi

	runCmdWithErrorCheck runEngineCmd -f "$QUERY_SCRIPT"
	return $?
}

一般情况下 query_run_main_method () 方法只是执行对应的query脚本,但是像 q05、q20... 这些查询,用到了机器学习算法,所以在执行对应的query脚本后会把生成的结果表作为输入,然后调用执行机器学习算法(如聚类、逻辑回归)的jar包继续执行,得到最终的结果。

runEngineCmd () {
  if addInitScriptsToParams
  then
    "$BINARY" "${BINARY_PARAMS[@]}" "${INIT_PARAMS[@]}" "$@"
  else
    return 1
  fi
}
--------------------------
BINARY="/usr/bin/hive"
BINARY_PARAMS+=(--hiveconf BENCHMARK_PHASE=$BIG_BENCH_BENCHMARK_PHASE --hiveconf STREAM_NUMBER=$BIG_BENCH_STREAM_NUMBER --hiveconf QUERY_NAME=$QUERY_NAME --hiveconf QUERY_DIR=$QUERY_DIR --hiveconf RESULT_TABLE=$RESULT_TABLE --hiveconf RESULT_DIR=$RESULT_DIR --hiveconf TEMP_TABLE=$TEMP_TABLE --hiveconf TEMP_DIR=$TEMP_DIR --hiveconf TABLE_PREFIX=$TABLE_PREFIX)
INIT_PARAMS=(-i "$BIG_BENCH_QUERY_PARAMS_FILE" -i "$BIG_BENCH_ENGINE_SETTINGS_FILE")
INIT_PARAMS+=(-i "$LOCAL_QUERY_ENGINE_SETTINGS_FILE")

if [ -n "$USER_QUERY_PARAMS_FILE" ]
then
if [ -r "$USER_QUERY_PARAMS_FILE" ]
then
  echo "User defined query parameter file found. Adding $USER_QUERY_PARAMS_FILE to hive init."
  INIT_PARAMS+=(-i "$USER_QUERY_PARAMS_FILE")
else
  echo "User query parameter file $USER_QUERY_PARAMS_FILE can not be read."
  return 1
fi
fi

if [ -n "$USER_ENGINE_SETTINGS_FILE" ]
then
if [ -r "$USER_ENGINE_SETTINGS_FILE" ]
then
  echo "User defined engine settings file found. Adding $USER_ENGINE_SETTINGS_FILE to hive init."
  INIT_PARAMS+=(-i "$USER_ENGINE_SETTINGS_FILE")
else
  echo "User hive settings file $USER_ENGINE_SETTINGS_FILE can not be read."
  return 1
fi
fi
return 0

validateQuery

1. 调用每个query下的 run.sh 里的 `query_run_validate_method()` 方法
2. `query_run_validate_method()` 比较 `$BENCH_MARK_HOME/engines/hive/queries/qxx/results/qxx-result` 和hdfs上 `/user/root/benchmarks/bigbench/queryResults/qxx_hive_${BIG_BENCH_BENCHMARK_PHASE}_${BIG_BENCH_STREAM_NUMBER}_result` 两个文件,如果一样,则验证通过,否则验证失败。
if diff -q "$VALIDATION_RESULTS_FILENAME" <(hadoop fs -cat "$RESULT_DIR/*")
then
	echo "Validation of $VALIDATION_RESULTS_FILENAME passed: Query returned correct results"
else
	echo "Validation of $VALIDATION_RESULTS_FILENAME failed: Query returned incorrect results"
	VALIDATION_PASSED="0"
fi

SF为1时(-f 1),用上面的方法比较,SF不为1(>1)时,只要hdfs上的结果表中行数大于等于1即验证通过

if [ `hadoop fs -cat "$RESULT_DIR/*" | head -n 10 | wc -l` -ge 1 ]
then
	echo "Validation passed: Query returned results"
else
	echo "Validation failed: Query did not return results"
	return 1
fi

refreshMetastore

1. 调用 `$BENCH_MARK_HOME/engines/hive/refresh/` 目录下的 `hiveRefreshCreateLoad.sql` 脚本
2. `hiveRefreshCreateLoad.sql` 将hdfs上 `/user/root/benchmarks/bigbench/data_refresh/` 目录下每个表数据插入外部临时表
3. 外部临时表再将每个表的数据插入Hive数据库对应的表中

hiveRefreshCreateLoad.sql 将hdfs上 /user/root/benchmarks/bigbench/data_refresh/ 目录下每个表数据插入外部临时表

DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
  ( c_customer_sk             bigint              --not null
  , c_customer_id             string              --not null
  , c_current_cdemo_sk        bigint
  , c_current_hdemo_sk        bigint
  , c_current_addr_sk         bigint
  , c_first_shipto_date_sk    bigint
  , c_first_sales_date_sk     bigint
  , c_salutation              string
  , c_first_name              string
  , c_last_name               string
  , c_preferred_cust_flag     string
  , c_birth_day               int
  , c_birth_month             int
  , c_birth_year              int
  , c_birth_country           string
  , c_login                   string
  , c_email_address           string
  , c_last_review_date        string
  )
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'
  STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'
;
-----------------
set hdfsDataPath=${env:BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR};

外部临时表再将每个表的数据插入Hive数据库对应的表中

INSERT INTO TABLE ${hiveconf:customerTableName}
SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}
;

附录

在测试之前会先用PDGF(并行数据生成框架)生成数据。指定Scale Factor为1(1GB),生成以下共23张表。以下是每张表的建表语句及每张表里的某一条记录。

23张表

hive> show tables;
OK
customer
customer_address
customer_demographics
date_dim
household_demographics
income_band
inventory
item
item_marketprices
product_reviews
promotion
reason
ship_mode
store
store_returns
store_sales
time_dim
warehouse
web_clickstreams
web_page
web_returns
web_sales
web_site
Time taken: 0.017 seconds, Fetched: 23 row(s)

customer (99000行)4.03MB

建表语句:

hive> show create table customer;
OK
CREATE TABLE `customer`(
 `c_customer_sk` bigint,
 `c_customer_id` string,
 `c_current_cdemo_sk` bigint,
 `c_current_hdemo_sk` bigint,
 `c_current_addr_sk` bigint,
 `c_first_shipto_date_sk` bigint,
 `c_first_sales_date_sk` bigint,
 `c_salutation` string,
 `c_first_name` string,
 `c_last_name` string,
 `c_preferred_cust_flag` string,
 `c_birth_day` int,
 `c_birth_month` int,
 `c_birth_year` int,
 `c_birth_country` string,
 `c_login` string,
 `c_email_address` string,
 `c_last_review_date` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'99000\',
 \'rawDataSize\'=\'88084062\',
 \'totalSize\'=\'4221267\',
 \'transient_lastDdlTime\'=\'1473167154\')
Time taken: 0.116 seconds, Fetched: 34 row(s)

某一行:

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com 	gdMmGdU9

customer_address(49500行) 0.92MB

建表语句:

hive> show create table customer_address;
OK
CREATE TABLE `customer_address`(
 `ca_address_sk` bigint,
 `ca_address_id` string,
 `ca_street_number` string,
 `ca_street_name` string,
 `ca_street_type` string,
 `ca_suite_number` string,
 `ca_city` string,
 `ca_county` string,
 `ca_state` string,
 `ca_zip` string,
 `ca_country` string,
 `ca_gmt_offset` decimal(5,2),
 `ca_location_type` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer_address\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'49500\',
 \'rawDataSize\'=\'55836000\',
 \'totalSize\'=\'966548\',
 \'transient_lastDdlTime\'=\'1473167160\')
Time taken: 0.043 seconds, Fetched: 29 row(s)

某一行:

6187 AAAAAAAAAAAAAJDZ 536 6th Lake Drive UKL8bE5C Lowell Brule County SD 18464 United States -6 apartment

customer_demographics (19200800行)6.84MB

建表语句:

hive> show create table customer_demographics;
OK
CREATE TABLE `customer_demographics`(
 `cd_demo_sk` bigint,
 `cd_gender` string,
 `cd_marital_status` string,
 `cd_education_status` string,
 `cd_purchase_estimate` int,
 `cd_credit_rating` string,
 `cd_dep_count` int,
 `cd_dep_employed_count` int,
 `cd_dep_college_count` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/customer_demographics\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'1920800\',
 \'rawDataSize\'=\'718371044\',
 \'totalSize\'=\'7167503\',
 \'transient_lastDdlTime\'=\'1473167169\')

某一行:

0 F 	U Primary 6000 Good 0 5 0

date_dim (109573行) 1.67MB

建表语句:

hive> show create table date_dim;
OK
CREATE TABLE `date_dim`(
 `d_date_sk` bigint,
 `d_date_id` string,
 `d_date` string,
 `d_month_seq` int,
 `d_week_seq` int,
 `d_quarter_seq` int,
 `d_year` int,
 `d_dow` int,
 `d_moy` int,
 `d_dom` int,
 `d_qoy` int,
 `d_fy_year` int,
 `d_fy_quarter_seq` int,
 `d_fy_week_seq` int,
 `d_day_name` string,
 `d_quarter_name` string,
 `d_holiday` string,
 `d_weekend` string,
 `d_following_holiday` string,
 `d_first_dom` int,
 `d_last_dom` int,
 `d_same_day_ly` int,
 `d_same_day_lq` int,
 `d_current_day` string,
 `d_current_week` string,
 `d_current_month` string,
 `d_current_quarter` string,
 `d_current_year` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/date_dim\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'109573\',
 \'rawDataSize\'=\'123050479\',
 \'totalSize\'=\'1748707\',
 \'transient_lastDdlTime\'=\'1473167172\')

某一行:

0 AAAAAAAAAAAAAAAA 1900-01-01 0 0 0 1900 1 1 1 1 1900 0 0 Monday 1900Q1 Y 	N 	N 2448812 2458802 2472542 2420941 N 	N 	NN 	N

household_demographics (7200行) 14.31KB

建表语句:

hive> show create table household_demographics;
OK
CREATE TABLE `household_demographics`(
 `hd_demo_sk` bigint,
 `hd_income_band_sk` bigint,
 `hd_buy_potential` string,
 `hd_dep_count` int,
 `hd_vehicle_count` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/household_demographics\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'7200\',
 \'rawDataSize\'=\'835168\',
 \'totalSize\'=\'14655\',
 \'transient_lastDdlTime\'=\'1473167173\')

某一行:

0 3 1001-5000 0 0

income_band (20行) 754B

建表语句:

hive> show create table income_band;
OK
CREATE TABLE `income_band`(
 `ib_income_band_sk` bigint,
 `ib_lower_bound` int,
 `ib_upper_bound` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/income_band\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'20\',
 \'rawDataSize\'=\'320\',
 \'totalSize\'=\'754\',
 \'transient_lastDdlTime\'=\'1473167179\')
Time taken: 0.032 seconds, Fetched: 19 row(s)

某一行:

0 1 10000

inventory (23255100行) 34.55MB

建表语句:

hive> show create table inventory;
OK
CREATE TABLE `inventory`(
 `inv_date_sk` bigint,
 `inv_item_sk` bigint,
 `inv_warehouse_sk` bigint,
 `inv_quantity_on_hand` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/inventory\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'3\',
 \'numRows\'=\'23255100\',
 \'rawDataSize\'=\'651118804\',
 \'totalSize\'=\'36234106\',
 \'transient_lastDdlTime\'=\'1473167235\')
Time taken: 0.031 seconds, Fetched: 20 row(s)

某一行:

36890 0 0 503

item (17820行) 2.36MB

建表语句:

hive> show create table item;
OK
CREATE TABLE `item`(
 `i_item_sk` bigint,
 `i_item_id` string,
 `i_rec_start_date` string,
 `i_rec_end_date` string,
 `i_item_desc` string,
 `i_current_price` decimal(7,2),
 `i_wholesale_cost` decimal(7,2),
 `i_brand_id` int,
 `i_brand` string,
 `i_class_id` int,
 `i_class` string,
 `i_category_id` int,
 `i_category` strong,
 `i_manufact_id` int,
 `i_manufact` string,
 `i_size` string,
 `i_formulation` string,
 `i_color` string,
 `i_units` string,
 `i_container` string,
 `i_manager_id` int,
 `i_product_name` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/item\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'17820\',
 \'rawDataSize\'=\'31238428\',
 \'totalSize\'=\'2472973\',
 \'transient_lastDdlTime\'=\'1473167181\')

某一行:

0 AAAAAAAAAAAAAAAA 2000-01-14 quickly even dinos beneath the frays must have to boost boldly careful bold escapades: stealthily even forges over the dependencies integrate always past the quiet sly decoys-- notornis sol 72.29 64.96 3898712 71TS7NSbvH1YbdiQMG6ttBHKAljiNoIRB 1 Fan Shop 9 Sports & Outdoors 995 2VOxvrIWwlJQTSk6 small 99Ee1r6uFbZSSClAX3 dodger Oz Unknown 18 8m9n5Q7T33DNWidoA6nWlg6ydmpA1SKOoOJLXiLVb

item_marketprices (89100行) 0.63MB

建表语句:

hive> show create table item_marketprices;
OK
CREATE TABLE `item_marketprices`(
 `imp_sk` bigint,
 `imp_item_sk` bigint,
 `imp_competitor` string,
 `imp_competitor_price` decimal(7,2),
 `imp_start_date` bigint,
 `imp_end_date` bigint)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/item_marketprices\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'89100\',
 \'rawDataSize\'=\'21736912\',
 \'totalSize\'=\'657024\',
 \'transient_lastDdlTime\'=\'1473167275\')

某一行:

5 4737 AAAAAAAAAAAAAAIN 66.4 36890 36958

product_reviews (89991行) 24.36MB

建表语句:

hive> show create table product_reviews;
OK
CREATE TABLE `product_reviews`(
 `pr_review_sk` bigint,
 `pr_review_date` string,
 `pr_review_time` string,
 `pr_review_rating` int,
 `pr_item_sk` bigint,
 `pr_user_sk` bigint,
 `pr_order_sk` bigint,
 `pr_review_content` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/product_reviews\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'89991\',
 \'rawDataSize\'=\'79597043\',
 \'totalSize\'=\'25546821\',
 \'transient_lastDdlTime\'=\'1473167305\')

某一行:

4 2004-02-25 21:43:07 1 3683 88783 41241 Had to offset the praise handed down by so many years is probably the result of some sin committed (obviously so grievous don\'t "get" it. Why all the original artwork by Stephen Gammel.

promotion (300行) 15.83KB

建表语句:

hive> show create table promotion;
OK
CREATE TABLE `promotion`(
 `p_promo_sk` bigint,
 `p_promo_id` string,
 `p_start_date_sk` bigint,
 `p_end_date_sk` bigint,
 `p_item_sk` bigint,
 `p_cost` decimal(15,2),
 `p_response_target` int,
 `p_promo_name` string,
 `p_channel_dmail` string,
 `p_channel_email` string,
 `p_channel_catalog` string,
 `p_channel_tv` string,
 `p_channel_radio` string,
 `p_channel_press` string,
 `p_channel_event` string,
 `p_channel_demo` string,
 `p_channel_details` string,
 `p_purpose` string,
 `p_discount_active` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/promotion\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'300\',
 \'rawDataSize\'=\'404926\',
 \'totalSize\'=\'16204\',
 \'transient_lastDdlTime\'=\'1473167186\')

某一行:

7 AAAAAAAAAAAAAAAH 94455 108529 13511 427.76 1 bar 	YN 	Y 	N 	N 	N 	N 	N 	blithe grouches past the blithe quick epitaphs print rut Unknown N

reason (35行) 3.17KB

建表语句:

hive> show create table reason;
OK
CREATE TABLE `reason`(
 `r_reason_sk` bigint,
 `r_reason_id` string,
 `r_reason_desc` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/reason\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'35\',
 \'rawDataSize\'=\'9027\',
 \'totalSize\'=\'3240\',
 \'transient_lastDdlTime\'=\'1473167190\')

某一行:

5 uy 	busily sly excuses hang: slow braids to the daring somas was toward the epitaphs-- gifts betw

ship_mode (20行) 2.93KB

建表语句:

hive> show create table ship_mode;
OK
CREATE TABLE `ship_mode`(
 `sm_ship_mode_sk` bigint,
 `sm_ship_mode_id` string,
 `sm_type` string,
 `sm_code` string,
 `sm_carrier` strong,
 `sm_contract` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/ship_mode\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'20\',
 \'rawDataSize\'=\'9576\',
 \'totalSize\'=\'3001\',
 \'transient_lastDdlTime\'=\'1473167196\')

某一行:

12 wGyozLRZ3uL 	iCTZsMlNzsQ 	cBRc FlWM9v tm0ehuQ2

store (12行) 8.03KB

建表语句:

hive> show create table store;
OK
CREATE TABLE `store`(
 `s_store_sk` bigint,
 `s_store_id` string,
 `s_rec_start_date` string,
 `s_rec_end_date` string,
 `s_closed_date_sk` bigint,
 `s_store_name` string,
 `s_number_employees` int,
 `s_floor_space` int,
 `s_hours` string,
 `s_manager` string,
 `s_market_id` int,
 `s_geography_class` string,
 `s_market_desc` string,
 `s_market_manager` string,
 `s_division_id` int,
 `s_division_name` string,
 `s_company_id` int,
 `s_company_name` string,
 `s_street_number` string,
 `s_street_name` string,
 `s_street_type` string,
 `s_suite_number` string,
 `s_city` string,
 `s_county` string,
 `s_state` string,
 `s_zip` string,
 `s_country` string,
 `s_gmt_offset` decimal(5,2),
 `s_tax_precentage` decimal(5,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'12\',
 \'rawDataSize\'=\'25962\',
 \'totalSize\'=\'8214\',
 \'transient_lastDdlTime\'=\'1473167201\')

某一行:

10 AAAAAAAAAAAAAAAK 2000-11-01 6235 Avalon 254 6468537 8AM-12AM Michael Barlow 6 Unknown final sly gifts by the even final dependencies x-ray under the fluffy Barry Shaw 1 Unknown1Unknown 993 Mill Pkwy aQV Cold Springs Cumberland County TN 95692 United States -5 0.03

store_returns (37902行) 1.19MB

建表语句:

hive> show create table store_returns;
OK
CREATE TABLE `store_returns`(
 `sr_returned_date_sk` bigint,
 `sr_return_time_sk` bigint,
 `sr_item_sk` bigint,
 `sr_customer_sk` bigint,
 `sr_cdemo_sk` bigint,
 `sr_hdemo_sk` bigint,
 `sr_addr_sk` bigint,
 `sr_store_sk` bigint,
 `sr_reason_sk` bigint,
 `sr_ticket_number` bigint,
 `sr_return_quantity` int,
 `sr_return_amt` decimal(7,2),
 `sr_return_tax` decimal(7,2),
 `sr_return_amt_inc_tax` decimal(7,2),
 `sr_fee` decimal(7,2),
 `sr_return_ship_cost` decimal(7,2),
 `sr_refunded_cash` decimal(7,2),
 `sr_reversed_charge` decimal(7,2),
 `sr_store_credit` decimal(7,2),
 `sr_net_loss` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store_returns\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'37902\',
 \'rawDataSize\'=\'41388272\',
 \'totalSize\'=\'1250563\',
 \'transient_lastDdlTime\'=\'1473167251\')

某一行:

37375 38182 11520 7640 242073 6754 25731 2 22 20293 33 1990.89 119.45 2110.34 34.14 2433.41 477.81 559.84 953.24 2672.31

store_sales (667579行) 23.49MB

建表语句:

hive> show create table store_sales;
OK
CREATE TABLE `store_sales`(
 `ss_sold_date_sk` bigint,
 `ss_sold_time_sk` bigint,
 `ss_item_sk` bigint,
 `ss_customer_sk` bigint,
 `ss_cdemo_sk` bigint,
 `ss_hdemo_sk` bigint,
 `ss_addr_sk` bigint,
 `ss_store_sk` bigint,
 `ss_promo_sk` bigint,
 `ss_ticket_number` bigint,
 `ss_quantity` int,
 `ss_wholesale_cost` decimal(7,2),
 `ss_list_price` decimal(7,2),
 `ss_sales_price` decimal(7,2),
 `ss_ext_discount_amt` decimal(7,2),
 `ss_ext_sales_price` decimal(7,2),
 `ss_ext_wholesale_cost` decimal(7,2),
 `ss_ext_list_price` decimal(7,2),
 `ss_ext_tax` decimal(7,2),
 `ss_coupon_amt` decimal(7,2),
 `ss_net_paid` decimal(7,2),
 `ss_net_paid_inc_tax` decimal(7,2),
 `ss_net_profit` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/store_sales\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'667579\',
 \'rawDataSize\'=\'953293700\',
 \'totalSize\'=\'24629162\',
 \'transient_lastDdlTime\'=\'1473167245\')

某一行:

37115 20244 16481 98676 1211207 5239 37107 9 38 11138 93 56.47 82.45 36.28 4293.81 3374.04 5251.71 7667.85 101.22 0 3374.04 3475.26 -1877.67

time_dim (86400行) 219.49KB

建表语句:

hive> show create table time_dim;
OK
CREATE TABLE `time_dim`(
 `t_time_sk` bigint,
 `t_time_id` string,
 `t_time` int,
 `t_hour` int,
 `t_minute` int,
 `t_second` int,
 `t_am_pm` string,
 `t_shift` string,
 `t_sub_shift` string,
 `t_meal_time` string)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/time_dim\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'1\',
 \'numRows\'=\'86400\',
 \'rawDataSize\'=\'41040000\',
 \'totalSize\'=\'224757\',
 \'transient_lastDdlTime\'=\'1473167202\')

某一行:

2 AAAAAAAAAAAAAAAC 2 0 0 2 AM 	third 	night

warehouse (5行) 1.94KB

建表语句:

hive> show create table warehouse;
OK
CREATE TABLE `warehouse`(
 `w_warehouse_sk` bigint,
 `w_warehouse_id` string,
 `w_warehouse_name` string,
 `w_warehouse_sq_ft` int,
 `w_street_number` string,
 `w_street_name` string,
 `w_street_type` string,
 `w_suite_number` string,
 `w_city` string,
 `w_county` string,
 `w_state` string,
 `w_zip` string,
 `w_country` string,
 `w_gmt_offset` decimal(5,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/warehouse\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'5\',
 \'rawDataSize\'=\'5695\',
 \'totalSize\'=\'1980\',
 \'transient_lastDdlTime\'=\'1473167204\')

某一行:

1 AAAAAAAAAAAAAAAB 	frets would sleep 845707 181 9th 13thCourt kQ Sherwood Forest Teton County WY 87891 United States -7

web_clickstreams (6770550行) 34.11MB

建表语句:

hive> show create table web_clickstreams;
OK
CREATE TABLE `web_clickstreams`(
 `wcs_click_date_sk` bigint,
 `wcs_click_time_sk` bigint,
 `wcs_sales_sk` bigint,
 `wcs_item_sk` bigint,
 `wcs_web_page_sk` bigint,
 `wcs_user_sk` bigint)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_clickstreams\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'6770550\',
 \'rawDataSize\'=\'253337416\',
 \'totalSize\'=\'35768476\',
 \'transient_lastDdlTime\'=\'1473167298\')

某一行:

36890 27089 NULL 15465 47 NULL

web_page (60行) 6.81KB

建表语句:

hive> show create table web_page;
OK
CREATE TABLE `web_page`(
 `wp_web_page_sk` bigint,
 `wp_web_page_id` string,
 `wp_rec_start_date` string,
 `wp_rec_end_date` string,
 `wp_creation_date_sk` bigint,
 `wp_access_date_sk` bigint,
 `wp_autogen_flag` string,
 `wp_customer_sk` bigint,
 `wp_url` string,
 `wp_type` string,
 `wp_char_count` int,
 `wp_link_count` int,
 `wp_image_count` int,
 `wp_max_ad_count` int)
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_page\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'60\',
 \'rawDataSize\'=\'38763\',
 \'totalSize\'=\'6969\',
 \'transient_lastDdlTime\'=\'1473167215\')

某一行:

24 AAAAAAAAAAAAAAAY 2000-12-19 38819 69008 0 44446 http://www.C8jdri37RmtbDNeFnXjmYbyBPzeO4WWK9pVYP6xtBJbaQ5yIj4s.com 	feedback	4340 	23 	5 	2

web_returns (38487行) 1.40MB

建表语句:

hive> show create table web_returns;
OK
CREATE TABLE `web_returns`(
 `wr_returned_date_sk` bigint,
 `wr_returned_time_sk` bigint,
 `wr_item_sk` bigint,
 `wr_refunded_customer_sk` bigint,
 `wr_refunded_cdemo_sk` bigint,
 `wr_refunded_hdemo_sk` bigint,
 `wr_refunded_addr_sk` bigint,
 `wr_returning_customer_sk` bigint,
 `wr_returning_cdemo_sk` bigint,
 `wr_returning_hdemo_sk` bigint,
 `wr_returning_addr_sk` bigint,
 `wr_web_page_sk` bigint,
 `wr_reason_sk` bigint,
 `wr_order_number` bigint,
 `wr_return_quantity` int,
 `wr_return_amt` decimal(7,2),
 `wr_return_tax` decimal(7,2),
 `wr_return_amt_inc_tax` decimal(7,2),
 `wr_fee` decimal(7,2),
 `wr_return_ship_cost` decimal(7,2),
 `wr_refunded_cash` decimal(7,2),
 `wr_reversed_charge` decimal(7,2),
 `wr_account_credit` decimal(7,2),
 `wr_net_loss` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_returns\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'38487\',
 \'rawDataSize\'=\'43257220\',
 \'totalSize\'=\'1471571\',
 \'transient_lastDdlTime\'=\'1473167269\')

某一行:

38952 68126 9590 52948 893223 2020 6942 52948 893223 2020 6942 32 13 63444 33 626.0125.04 651.05 18.74 958.4 244.14 286.4 95.47 1008.48

web_sales (668052行) 32.99MB

建表语句:

hive> show create table web_sales;
OK
CREATE TABLE `web_sales`(
 `ws_sold_date_sk` bigint,
 `ws_sold_time_sk` bigint,
 `ws_ship_date_sk` bigint,
 `ws_item_sk` bigint,
 `ws_bill_customer_sk` bigint,
 `ws_bill_cdemo_sk` bigint,
 `ws_bill_hdemo_sk` bigint,
 `ws_bill_addr_sk` bigint,
 `ws_ship_customer_sk` bigint,
 `ws_ship_cdemo_sk` bigint,
 `ws_ship_hdemo_sk` bigint,
 `ws_ship_addr_sk` bigint,
 `ws_web_page_sk` bigint,
 `ws_web_site_sk` bigint,
 `ws_ship_mode_sk` bigint,
 `ws_warehouse_sk` bigint,
 `ws_promo_sk` bigint,
 `ws_order_number` bigint,
 `ws_quantity` int,
 `ws_wholesale_cost` decimal(7,2),
 `ws_list_price` decimal(7,2),
 `ws_sales_price` decimal(7,2),
 `ws_ext_discount_amt` decimal(7,2),
 `ws_ext_sales_price` decimal(7,2),
 `ws_ext_wholesale_cost` decimal(7,2),
 `ws_ext_list_price` decimal(7,2),
 `ws_ext_tax` decimal(7,2),
 `ws_coupon_amt` decimal(7,2),
 `ws_ext_ship_cost` decimal(7,2),
 `ws_net_paid` decimal(7,2),
 `ws_net_paid_inc_tax` decimal(7,2),
 `ws_net_paid_inc_ship` decimal(7,2),
 `ws_net_paid_inc_ship_tax` decimal(7,2),
 `ws_net_profit` decimal(7,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_sales\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'668052\',
 \'rawDataSize\'=\'1221174888\',
 \'totalSize\'=\'34585166\',
 \'transient_lastDdlTime\'=\'1473167263\')

某一行:

36890 26789 36993 794 85457 380790 2436 42649 29934 1583844 5251 525 2 14 1 3 120 0 93 32.62 81.22 59.29 2039.49 5513.97 3033.66 7553.46 330.84 0 1661.76 5513.97 5844.81 7175.73 7506.57 2480.31

web_site (30行)10.31KB

建表语句:

hive> show create table web_site;
OK
CREATE TABLE `web_site`(
 `web_site_sk` bigint,
 `web_site_id` string,
 `web_rec_start_date` string,
 `web_rec_end_date` string,
 `web_name` string,
 `web_open_date_sk` bigint,
 `web_close_date_sk` bigint,
 `web_class` string,
 `web_manager` string,
 `web_mkt_id` int,
 `web_mkt_class` string,
 `web_mkt_desc` string,
 `web_market_manager` string,
 `web_company_id` int,
 `web_company_name` string,
 `web_street_number` string,
 `web_street_name` string,
 `web_street_type` string,
 `web_suite_number` string,
 `web_city` string,
 `web_county` string,
 `web_state` string,
 `web_zip` string,
 `web_country` string,
 `web_gmt_offset` decimal(5,2),
 `web_tax_percentage` decimal(5,2))
ROW FORMAT SERDE
 \'org.apache.hadoop.hive.ql.io.orc.OrcSerde\'
STORED AS INPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat\'
OUTPUTFORMAT
 \'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat\'
LOCATION
 \'hdfs://mycluster/usr/hive/warehouse/bigbench.db/web_site\'
TBLPROPERTIES (
 \'COLUMN_STATS_ACCURATE\'=\'true\',
 \'numFiles\'=\'2\',
 \'numRows\'=\'30\',
 \'rawDataSize\'=\'62832\',
 \'totalSize\'=\'10554\',
 \'transient_lastDdlTime\'=\'1473167210\')

某一行:

2 AAAAAAAAAAAAAAAC 2002-07-17 site_0 16450 91500 Unknown Gregory George 1 sheaves despite the quietly sly asymp 	thin enticing frets except the sometimes final courts might promise blithe dino Frank Hernandez 1 ese 17 5th Ave EbDxJVL Georgetown Guadalupe County TX 75435 United States -6 0.01
posted @ 2016-09-21 09:16  lzslbd  阅读(3166)  评论(0编辑  收藏  举报