TPCx-BB源码分析

Copy from: 一篇文章看懂TPCx-BB(大数据基准测试工具）源码

　　TPCx-BB是大数据基准测试工具,它通过模拟零售商的30个应用场景，执行30个查询来衡量基于Hadoop的大数据系统的包括硬件和软件的性能。其中一些场景还用到了机器学习算法（聚类、线性回归等）。为了更好地了解被测试的系统的性能，需要对TPCx-BB整个测试流程深入了解。本文详细分析了整个TPCx-BB测试工具的源码，希望能够对大家理解TPCx-BB有所帮助。

代码结构

主目录（$BENCH_MARK_HOME）下有：

bin
conf
data-generator
engines
tools

几个子目录。

bin下有几个 module ,是执行时需要用到的脚本：bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等

conf下有两个配置文件：bigBench.properties 和 userSettings.conf

bigBench.properties 主要设置 workload（执行的benchmarkPhases）和 power_test_0（POWER_TEST 阶段需要执行的SQL查询）

默认 workload ：

workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1

默认 power_test_0 ：1-30

userSetting.conf 是一些基本设置，包括JAVA environment 、default settings for benchmark（database、engine、map_tasks、scale_factor ...）、HADOOP environment、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)

data-generator下是跟数据生成相关的脚本及配置文件。详细内容在下面介绍。

engines下是TPCx-BB支持的4种引擎：biginsights、hive、impala、spark_sql。默认引擎为hive。实际上，只有hive目录下不为空，其他三个目录下均为空，估计是现在还未完善。

tools下有两个jar包：HadoopClusterExec.jar 和 RunBigBench.jar 。其中 RunBigBench.jar 是执行TPCx-BB测试的一个非常重要的文件，大部分程序都在该jar包内。

数据生成

数据生成相关程序和配置都在 data-generator 目录下。该目录下有一个 pdgf.jar 包和 config、dicts、extlib 三个子目录。

pdgf.jar是数据生成的Java程序。config下有两个配置文件：bigbench-generation.xml 和 bigbench-schema.xml 。

bigbench-generation.xml 主要设置生成的原始数据（不是数据库表）包含哪几张表、每张表的表名、表的大小以及表输出的目录、表文件的后缀、分隔符、字符编码等。

<schema name="default">

        <tables>

        <!-- not refreshed tables -->    

            <!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons  -->

            <table name="income_band"></table>

            <table name="reason"></table>

            <table name="ship_mode"></table>

            <table name="web_site"></table>

            <!-- /tables not used in benchmark  -->

            <!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) -->

            <table name="date_dim" static="true"></table>

            <table name="time_dim" static="true"></table>

            <table name="customer_demographics" static="true"></table>

            <table name="household_demographics" static="true"></table>

            <!-- /static tables -->

            <!-- "normal" tables. split over all nodes. not generated during refresh -->

            <table name="store"></table>

            <table name="warehouse"></table>

            <table name="promotion"></table>

            <table name="web_page"></table>

            <!-- /"normal" tables.-->

        <!-- /not refreshed tables -->    

            <!--

            refreshed tables. Generated on all nodes.

            Refresh tables generate extra data during refresh (e.g. add new data to the existing tables)

            In "normal"-Phase  generate table rows:  [0,REFRESH_PERCENTAGE*Table.Size];

            In "refresh"-Phase generate table rows:  [REFRESH_PERCENTAGE*Table.Size+1, Table.Size]

            .Has effect only if  ${REFRESH_SYSTEM_ENABLED}==1.

            -->

            <table name="customer">

                <scheduler name="DefaultScheduler">

                    <partitioner

                        name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner">

                        <prePartition><![CDATA[

                    if(${REFRESH_SYSTEM_ENABLED}>0){

                        int tableID = table.getTableID();

                        int timeID = 0;

                        long lastTableRow=table.getSize()-1;

                        long rowStart;

                        long rowStop;

                        boolean exclude=false;

                        long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE});

                        if(${REFRESH_PHASE}>0){

                            //Refresh part

                            rowStart = lastTableRow - refreshRows +1;

                            rowStop  = lastTableRow;

                            if(refreshRows<=0){

                                exclude=true;

                            }

                        }else{

                            //"normal" part

                            rowStart = 0;

                            rowStop = lastTableRow - refreshRows;

                        }

                        return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude);

                    }else{

                        //DEFAULT

                        return getParentPartitioner().getDefaultPrePartition(project, table);

                    }

                    ]]></prePartition>

                    </partitioner>

                </scheduler>

            </table>

<output name="SplitFileOutputWrapper">

  <!-- DEFAULT output for all Tables, if no table specific output is specified-->

    <output name="CSVRowOutput">

      <fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>

      <outputDir>output/</outputDir>

      <fileEnding>.dat</fileEnding>

      <delimiter>|</delimiter>

      <charset>UTF-8</charset>

      <sortByRowID>true</sortByRowID>

    </output>

    <output name="StatisticsOutput" active="1">

      <size>${item_size}</size><!-- a counter per item .. initialize later-->

      <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate>

      <outputDir>output/</outputDir>

      <fileEnding>.csv</fileEnding>

      <delimiter>,</delimiter>

      <header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR-->

      </header>

      <footer></footer>

bigbench-schema.xml 设置了很多参数，有跟表的规模有关的，比如每张表的大小（记录的条数）;绝大多数是跟表的字段有关的，比如时间的起始、结束、性别比例、结婚比例、指标的上下界等。还具体定义了每个字段是怎么生成的，以及限制条件。示例如下：

生成的数据大小由 SCALE_FACTOR（-f） 决定。如 -f 1，则生成的数据总大小约为1G；-f 100，则生成的数据总大小约为100G。那么SCALE_FACTOR（-f） 是怎么精确控制生成的数据的大小呢？

原因是 SCALE_FACTOR（-f） 决定了每张表的记录数。如下，customer 表的记录数为 100000.0d * ${SF_sqrt}，即如果 -f 1 则 customer 表的记录数为 100000*sqrt(1)= 10万条 ;如果 -f 100 则 customer 表的记录数为 100000*sqrt(100)= 100万条

<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>

<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property>

<property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>

<property name="${gender_likelihood}" type="double">0.5</property>

<property name="${married_likelihood}" type="double">0.3</property>

<property name="${WP_LINK_MIN}" type="double">2</property>

<property name="${WP_LINK_MAX}" type="double">25</property>

  <field name="d_date" size="13" type="CHAR" primary="false">

   <gen_DateTime>

     <disableRng>true</disableRng>

     <useFixedStepSize>true</useFixedStepSize>

     <startDate>${date_dim_begin_date}</startDate>

     <endDate>${date_dim_end_date}</endDate>

     <outputFormat>yyyy-MM-dd</outputFormat>

    </gen_DateTime>

  </field>

  <field name="t_time_id" size="16" type="CHAR" primary="false">

   <gen_ConvertNumberToString>

    <gen_Id/>

    <size>16.0</size>

    <characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters>

   </gen_ConvertNumberToString>

  </field>

<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false">

   <gen_Null probability="${NULL_CHANCE}">

   <gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" />

   </gen_Null>

  </field>

dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件，表里每一条记录的各个字段应该是从这些字典里生成的。

extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jar、commons-net-3.3.jar、xml-apis.jar和log4j-1.2.15.jar等

总结：

pdgf.jar根据bigbench-generation.xml 和 bigbench-schema.xml两个文件里的配置（表名、字段名、表的记录条数、每个字段生成的规则），从 dicts 目录下对应的 .dict
文件获取表中每一条记录、每个字段的值，生成原始数据。

customer 表里的某条记录如下：

0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com   gdMmGdU9

如果执行 TPCx-BB 测试时指定 -f 1（SCALE_FACTOR = 1） 则最终生成的原始数据总大小约为 1G(977M+8.6M)

[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data

12.7 M   38.0 M   /user/root/benchmarks/bigbench/data/customer

5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/customer_address

74.2 M   222.5 M  /user/root/benchmarks/bigbench/data/customer_demographics

14.7 M   44.0 M   /user/root/benchmarks/bigbench/data/date_dim

151.5 K  454.4 K  /user/root/benchmarks/bigbench/data/household_demographics

327      981      /user/root/benchmarks/bigbench/data/income_band

405.3 M  1.2 G    /user/root/benchmarks/bigbench/data/inventory

6.5 M    19.5 M   /user/root/benchmarks/bigbench/data/item

4.0 M    12.0 M   /user/root/benchmarks/bigbench/data/item_marketprices

53.7 M   161.2 M  /user/root/benchmarks/bigbench/data/product_reviews

45.3 K   135.9 K  /user/root/benchmarks/bigbench/data/promotion

3.0 K    9.1 K    /user/root/benchmarks/bigbench/data/reason

1.2 K    3.6 K    /user/root/benchmarks/bigbench/data/ship_mode

3.3 K    9.9 K    /user/root/benchmarks/bigbench/data/store

4.1 M    12.4 M   /user/root/benchmarks/bigbench/data/store_returns

88.5 M   265.4 M  /user/root/benchmarks/bigbench/data/store_sales

4.9 M    14.6 M   /user/root/benchmarks/bigbench/data/time_dim

584      1.7 K    /user/root/benchmarks/bigbench/data/warehouse

170.4 M  511.3 M  /user/root/benchmarks/bigbench/data/web_clickstreams

7.9 K    23.6 K   /user/root/benchmarks/bigbench/data/web_page

5.1 M    15.4 M   /user/root/benchmarks/bigbench/data/web_returns

127.6 M  382.8 M  /user/root/benchmarks/bigbench/data/web_sales

8.6 K    25.9 K   /user/root/benchmarks/bigbench/data/web_site

执行流程

要执行TPCx-BB测试，首先需要切换到TPCx-BB源程序的目录下，然后进入bin目录，执行以下语句：

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

其中，-f、-m、-s、-j都是参数，用户可根据集群的性能以及自己的需求来设置。如果不指定，则使用默认值，默认值在 conf 目录下的 userSetting.conf 文件指定，如下：

export BIG_BENCH_DEFAULT_DATABASE="bigbench"

export BIG_BENCH_DEFAULT_ENGINE="hive"

export BIG_BENCH_DEFAULT_MAP_TASKS="80"

export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000"

export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2"

export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"

默认 MAP_TASKS 为 80（-m 80）、SCALE_FACTOR 为 1000（-f 1000）、NUMBER_OF_PARALLEL_STREAMS 为 2（-s 2）。

所有可选参数及其意义如下：

General options:

-d  使用的数据库 (默认: $BIG_BENCH_DEFAULT_DATABASE -> bigbench)

-e  使用的引擎 (默认: $BIG_BENCH_DEFAULT_ENGINE -> hive)

-f  数据集的规模因子（scale factor） (默认: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000)

-h  显示帮助

-m  数据生成的`map tasks`数 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)"

-s  并行的`stream`数 (默认: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2)

Driver specific options:

-a  伪装模式执行

-b  执行期间将调用的bash脚本在标准输出中打印出来

-i  指定需要执行的阶段 (详情见$BIG_BENCH_CONF_DIR/bigBench.properties)

-j  指定需要执行的查询 (默认：1-30共30个查询均执行)"

-U  解锁专家模式

若指定了-U,即解锁了专家模式，则：

echo "EXPERT MODE ACTIVE"

echo "WARNING - INTERNAL USE ONLY:"

echo "Only set manually if you know what you are doing!"

echo "Ignoring them is probably the best solution"

echo "Running individual modules:"

echo "Usage: `basename $0` module [options]"

-D  指定需要debug的查询部分. 大部分查询都只有一个单独的部分

-p  需要执行的benchmark phase (默认: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)"

-q  指定需要执行哪个查询（只能指定一个）

-t  指定执行该查询时用第哪个stream

-v  metastore population的sql脚本 (默认: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})"

-w  metastore refresh的sql脚本 (默认: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})"

-y  含额外的用户自定义查询参数的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)"

-z  含额外的用户自定义引擎设置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)"

List of available modules:

    $BIG_BENCH_ENGINE_BIN_DIR

回到刚刚执行TPCx-BB测试的语句：

./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5

bigBench

bigBench是主脚本，runBenchmark是module。

bigBench 里设置了很多环境变量（包括路径、引擎、STREAM数等等），因为后面调用 runBigBench.jar 的时候需要在Java程序里读取这些环境变量。

bigBench 前面都是在做一些基本工作，如设置环境变量、解析用户输入参数、赋予文件权限、设置路径等等。到最后一步调用 runBenchmark 的 runModule() 方法：

设置基本路径

 export BIG_BENCH_VERSION="1.0"

 export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin"

 export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf"

 export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator"

 export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools"

 export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"

指定 core-site.xml 和 hdfs-site.xml 的路径

数据生成时要用到Hadoop集群，生成在hdfs上

 export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml"

export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"

赋予整个包下所有可执行文件权限（.sh/.jar/.py）

 find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} +

find "$BIG_BENCH_HOME" -name '*.jar' -exec chmod 755 {} +

find "$BIG_BENCH_HOME" -name '*.py' -exec chmod 755 {} +

设置 userSetting.conf 的路径并 source

 USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf"

 if [ ! -f "$USER_SETTINGS" ]

 then

   echo "User settings file $USER_SETTINGS not found"

   exit 1

 else

   source "$USER_SETTINGS"

 fi

解析输入参数和选项并根据选项的内容作设置

第一个参数必须是module_name

如果没有输入参数或者第一个参数以"-"开头，说明用户没有输入需要运行的module。

 if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]]

 then

   export MODULE_NAME=""

   SHOW_HELP="1"

 else

   export MODULE_NAME="$1"

   shift

 fi

 export LIST_OF_USER_OPTIONS="$@"

解析用户输入的参数

根据用户输入的参数来设置环境变量

 while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do

case "$OPT" in

 # script options

 d)

   #echo "-d was triggered, Parameter: $OPTARG" >&2

   USER_DATABASE="$OPTARG"

 ;;

 D)

   #echo "-D was triggered, Parameter: $OPTARG" >&2

   DEBUG_QUERY_PART="$OPTARG"

 ;;

 e)

   #echo "-e was triggered, Parameter: $OPTARG" >&2

   USER_ENGINE="$OPTARG"

 ;;

 f)

   #echo "-f was triggered, Parameter: $OPTARG" >&2

   USER_SCALE_FACTOR="$OPTARG"

 ;;

 h)

   #echo "-h was triggered, Parameter: $OPTARG" >&2

   SHOW_HELP="1"

 ;;

 m)

   #echo "-m was triggered, Parameter: $OPTARG" >&2

   USER_MAP_TASKS="$OPTARG"

 ;;

 p)

   #echo "-p was triggered, Parameter: $OPTARG" >&2

   USER_BENCHMARK_PHASE="$OPTARG"

 ;;

 q)

   #echo "-q was triggered, Parameter: $OPTARG" >&2

   QUERY_NUMBER="$OPTARG"

 ;;

 s)

   #echo "-t was triggered, Parameter: $OPTARG" >&2

   USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG"

 ;;

 t)

   #echo "-s was triggered, Parameter: $OPTARG" >&2

   USER_STREAM_NUMBER="$OPTARG"

 ;;

 U)

   #echo "-U was triggered, Parameter: $OPTARG" >&2

   USER_EXPERT_MODE="1"

 ;;

 v)

   #echo "-v was triggered, Parameter: $OPTARG" >&2

   USER_POPULATE_FILE="$OPTARG"

 ;;

 w)

   #echo "-w was triggered, Parameter: $OPTARG" >&2

   USER_REFRESH_FILE="$OPTARG"

 ;;

 y)

   #echo "-y was triggered, Parameter: $OPTARG" >&2

   USER_QUERY_PARAMS_FILE="$OPTARG"

 ;;

 z)

   #echo "-z was triggered, Parameter: $OPTARG" >&2

   USER_ENGINE_SETTINGS_FILE="$OPTARG"

 ;;

 # driver options

 a)

   #echo "-a was triggered, Parameter: $OPTARG" >&2

   export USER_PRETEND_MODE="1"

 ;;

 b)

   #echo "-b was triggered, Parameter: $OPTARG" >&2

   export USER_PRINT_STD_OUT="1"

 ;;

 i)

   #echo "-i was triggered, Parameter: $OPTARG" >&2

   export USER_DRIVER_WORKLOAD="$OPTARG"

 ;;

 j)

   #echo "-j was triggered, Parameter: $OPTARG" >&2

   export USER_DRIVER_QUERIES_TO_RUN="$OPTARG"

 ;;

 \?)

   echo "Invalid option: -$OPTARG" >&2

   exit 1

 ;;

 :)

   echo "Option -$OPTARG requires an argument." >&2

   exit 1

;;

esac

done

设置全局变量。如果用户指定了某个参数的值，则采用该值，否则使用默认值。

 export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}"

export SHOW_HELP="${SHOW_HELP:-"0"}"

export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}"

export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}"

export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}"

export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}"

export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}"

export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}"

export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}"

export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE"

export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf"

检测 -s -m -f -j的选项是否为数字

 if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ]

 then

   echo "$BIG_BENCH_MAP_TASKS is not a number"

 fi

 if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ]

 then

   echo "$BIG_BENCH_SCALE_FACTOR is not a number"

 fi

 if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ]

 then

   echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number"

 fi

 if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ]

 then

   echo "$BIG_BENCH_STREAM_NUMBER is not a number"

 fi

检查引擎是否存在

 if [ ! -d "$BIG_BENCH_ENGINE_DIR" ]

 then

   echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..."

   exit 1

 fi

 if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ]

 then

   echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..."

   exit 1

 fi

设置 engineSetting.conf 路径并 source

 ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf"

 if [ ! -f "$ENGINE_SETTINGS" ]

 then

   echo "Engine settings file $ENGINE_SETTINGS not found"

   exit 1

 else

   source "$ENGINE_SETTINGS"

 fi

检查module是否存在

当输入某个module时，系统会先到$BIG_BENCH_ENGINE_BIN_DIR/目录下去找该module是否存在，如果存在，就source "$MODULE"；如果该目录下不存在指定的module，再到export MODULE="$BIG_BENCH_BIN_DIR/"目录下找该module，如果存在，就source "$MODULE"；否则，输出Module $MODULE not found, aborting script.
```
export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME"

if [ -f "$MODULE" ]

then

  source "$MODULE"

else

  export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME"

  if [ -f "$MODULE" ]

  then

    source "$MODULE"

  else

    echo "Module $MODULE not found, aborting script."

    exit 1

  fi

fi
```

检查module里的runModule（）、helpModule ( )、runEngineCmd()方法是否有定义

MODULE_RUN_METHOD="runModule"

if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1

then

  echo "$MODULE_RUN_METHOD was not implemented, aborting script"

  exit 1

fi

运行module

如果module是runBenchmark，执行
runCmdWithErrorCheck "$MODULE_RUN_METHOD"
也就是
runCmdWithErrorCheck runModule（）

由上可以看出，bigBench脚本主要执行一些如设置环境变量、赋予权限、检查并解析输入参数等基础工作，最终调用runBenchmark的runModule()方法继续往下执行。

runBenchmark

接下来看看runBenchmark脚本。

runBenchmark里有两个函数：helpModule ()和runModule ()。

helpModule ()就是显示帮助。

runModule ()是运行runBenchmark模块时真正调用的函数。该函数主要做四件事：

清除之前生成的日志
调用RunBigBench.jar来执行
logEnvInformation
将日志文件夹打包成zip

源码如下：

runModule () {

  #check input parameters

  if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ]

  then

    echo "The number of parallel streams -s must be greater than 0"

    return 1

  fi

  "${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS

  "$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar"

  "${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS

  "${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS

  return $?

}

相当于运行runBenchmark模块时又调用了cleanLogs、logEnvInformation、zipLogs三个模块以及RunBigBench.jar。其中RunBigBench.jar是TCPx-BB测试执行的核心代码，用Java语言编写。接下来分析RunBigBench.jar源码。

runModule()

runModule()函数用来执行某个module。我们已知，执行某个module需要切换到主目录下的bin目录，然后执行：

./bigBench module_name arguments

在runModule()函数里，cmdLine用来生成如上命令。

ArrayList cmdLine = new ArrayList();

cmdLine.add("bash");

cmdLine.add(this.runScript);

cmdLine.add(benchmarkPhase.getRunModule());

cmdLine.addAll(arguments);

其中，this.runScript为：

this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";

benchmarkPhase.getRunModule()用来获得需要执行的module。

arguments为用户输入的参数。

至此，cmdLine为：

bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments

那么，怎么让系统执行该bash命令呢？答案是调用runCmd()方法。

boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));

接下来介绍rumCmd()方法

runCmd()

runCmd()方法通过ProcessBuilder来创建一个操作系统进程，并用该进程执行以上的bash命令。

ProcessBuilder还可以设置工作目录和环境。

ProcessBuilder pb = new ProcessBuilder(command);

pb.directory(new File(workingDirectory));

Process p = null;

---

p = pb.start();

getQueryList()

getQueryList()用来获得需要执行的查询列表。从$BIG_BENCH_LOGS_DIR/bigBench.properties文件中读取。与$BIG_BENCH_HOME/conf/bigBench.properties内容一致。

bigBench.properties里power_test_0=1-30规定了powter_test_0阶段需要执行的查询及其顺序。

可以用区间如 5-12 或者单个数字如 21 表示，中间用 , 隔开。

power_test_0=28-25,2-5,10,22,30

表示powter_test_0阶段需要执行的查询及其顺序为：28,27,26,25,2,3,4,5,10,22,30

如果想让30个查询按顺序执行，则：

power_test_0=1-30

获得查询列表的源码如下：

    private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) {

        String SHUFFLED_NAME_PATTERN = "shuffledQueryList";

        BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST;

        String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber);

        boolean queryOrderCached = benchmarkPhase.isQueryOrderCached();

        if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) {

            return new ArrayList((Collection)this.queryListCache.get(propertyKey));

        } else {

            Object queryList;

            String basicPhaseNamePattern;

            if(!this.properties.containsKey(propertyKey)) {

                if(benchmarkPhase.isQueryOrderRandom()) {

                    if(!this.queryListCache.containsKey("shuffledQueryList")) {

                        basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0);

                        if(!this.properties.containsKey(basicPhaseNamePattern)) {

                            throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list.");

                        }

                        this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0));

                    }

                    queryList = (List)this.queryListCache.get("shuffledQueryList");

                    this.shuffleList((List)queryList, this.rnd);

                } else {

                    queryList = this.getQueryList(queryOrderBasicPhase, 0);

                }

            } else {

                queryList = new ArrayList();

                String[] var11;

                int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length;

                label65:

                for(int var9 = 0; var9 < var10; ++var9) {

                    basicPhaseNamePattern = var11[var9];

                    String[] queryRange = basicPhaseNamePattern.trim().split("-");

                    switch(queryRange.length) {

                    case 1:

                        ((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim())));

                        break;

                    case 2:

                        int startQuery = Integer.parseInt(queryRange[0]);

                        int endQuery = Integer.parseInt(queryRange[1]);

                        int i;

                        if(startQuery > endQuery) {

                            i = startQuery;

                            while(true) {

                                if(i < endQuery) {

                                    continue label65;

                                }

                                ((List)queryList).add(Integer.valueOf(i));

                                --i;

                            }

                        } else {

                            i = startQuery;

                            while(true) {

                                if(i > endQuery) {

                                    continue label65;

                                }

                                ((List)queryList).add(Integer.valueOf(i));

                                ++i;

                            }

                        }

                    default:

                        throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated.");

                    }

                }

            }

            if(queryOrderCached) {

                this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList));

            }

            return new ArrayList((Collection)queryList);

        }

    }

parseEnvironment()

parseEnvironment()读取系统的环境变量并解析。

Map env = System.getenv();

this.version = (String)env.get("BIG_BENCH_VERSION");

this.homeDir = (String)env.get("BIG_BENCH_HOME");

this.confDir = (String)env.get("BIG_BENCH_CONF_DIR");

this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";

this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR");

this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR");

this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG");

this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG");

this.engine = (String)env.get("BIG_BENCH_ENGINE");

this.database = (String)env.get("BIG_BENCH_DATABASE");

this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS");

this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS"));

this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR"));

this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1");

并自动在用户指定的参数后面加上 -U (解锁专家模式)

this.userArguments.add("-U");

如果用户指定了 PRETEND_MODE、PRINT_STD_OUT、WORKLOAD、QUERIES_TO_RUN，则以用户指定的参数为准，否则使用默认值。

if(env.containsKey("USER_PRETEND_MODE")) {

    this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE"));

}

if(env.containsKey("USER_PRINT_STD_OUT")) {

   this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT"));

}

if(env.containsKey("USER_DRIVER_WORKLOAD")) {

   this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD"));

}

if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) {

    this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN"));

}

读取 workload 并赋值 benchmarkPhases。如果 workload 里不包含 BENCHMARK_START 和 BENCHMARK_STOP，自动在 benchmarkPhases 的首位和末位分别加上 BENCHMARK_START 和 BENCHMARK_STOP。

this.benchmarkPhases = new ArrayList();

Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator();

while(var7.hasNext()) {

    String benchmarkPhase = (String)var7.next();

    this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim()));

}

if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) {

    this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START);

}

if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) {

    this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP);

}

run()

run() 方法是 RunBigBench.jar 里核心的方法。所有的执行都是通过 run() 方法调用的。比如 runQueries()、runModule()、generateData()等。runQueries()、runModule()、generateData() 又通过调用 runCmd() 方法来创建操作系统进程，执行bash命令，调用bash脚本。

run() 方法里通过一个 while 循环来逐一执行 workload 里的每一个 benchmarkPhase。不同的 benchmarkPhase 会调用 runQueries()、runModule()、generateData()...中的不同方法。

try {

long e = 0L;

this.log.finer("Benchmark phases: " + this.benchmarkPhases);

Iterator startCheckpoint = this.benchmarkPhases.iterator();

long throughputStart;

while(startCheckpoint.hasNext()) {

    BigBench.BenchmarkPhase children = (BigBench.BenchmarkPhase)startCheckpoint.next();

    if(children.isPhaseDone()) {

        this.log.info("The phase " + children.name() + " was already performed earlier. Skipping this phase");

    } else {

        try {

            switch($SWITCH_TABLE$io$bigdatabenchmark$v1$driver$BigBench$BenchmarkPhase()[children.ordinal()]) {

            case 1:

            case 20:

                throw new IllegalArgumentException("The value " + children.name() + " is only used internally.");

            case 2:

                this.log.info(children.getConsoleMessage());

                e = System.currentTimeMillis();

                break;

            case 3:

                if(!BigBench.BenchmarkPhase.BENCHMARK_START.isPhaseDone()) {

                    throw new IllegalArgumentException("Error: Cannot stop the benchmark before starting it");

                }

                throughputStart = System.currentTimeMillis();

                this.log.info(String.format("%-55s finished. Time: %25s", new Object[]{children.getConsoleMessage(), BigBench.Helper.formatTime(throughputStart - e)}));

                this.logTreeRoot.setCheckpoint(new BigBench.Checkpoint(BigBench.BenchmarkPhase.BENCHMARK, -1L, -1L, e, throughputStart, this.logTreeRoot.isSuccessful()));

                break;

            case 4:

            case 15:

            case 18:

            case 22:

            case 27:

            case 28:

            case 29:

                this.runModule(children, this.userArguments);

                break;

            case 5:

            case 10:

            case 11:

                this.runQueries(children, 1, validationArguments);

                break;

            case 6:

            case 9:

                this.runModule(children, validationArguments);

                break;

            case 7:

                this.generateData(children, false, validationArguments);

                break;

            case 8:

                this.generateData(children, true, validationArguments);

                break;

            case 12:

            case 19:

            case 24:

                this.runQueries(children, 1, this.userArguments);

                break;

            case 13:

            case 14:

            case 21:

            case 23:

            case 25:

            case 26:

                this.runQueries(children, this.numberOfParallelStreams, this.userArguments);

                break;

            case 16:

                this.generateData(children, false, this.userArguments);

                break;

            case 17:

                this.generateData(children, true, this.userArguments);

            }

            children.setPhaseDone(true);

        } catch (IOException var21) {

            this.log.info("==============\nBenchmark run terminated\nReason: An error occured while running a command in phase " + children + "\n==============");

            var21.printStackTrace();

            if(this.stopAfterFailure || children.mustSucceed()) {

                break;

            }

        }

    }

}

这里的 case 1-29 并不是 1-29 条查询，而是枚举类型里的 1-29 个 benmarkPhase 。如下所示：

private static enum BenchmarkPhase {

BENCHMARK((String)null, "benchmark", false, false, false, false, "BigBench benchmark"),

BENCHMARK_START((String)null, "benchmark_start", false, false, false, false, "BigBench benchmark: Start"),

BENCHMARK_STOP((String)null, "benchmark_stop", false, false, false, false, "BigBench benchmark: Stop"),

CLEAN_ALL("cleanAll", "clean_all", false, false, false, false, "BigBench clean all"),

ENGINE_VALIDATION_CLEAN_POWER_TEST("cleanQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Clean power test queries"),

ENGINE_VALIDATION_CLEAN_LOAD_TEST("cleanMetastore", "engine_validation_metastore", false, false, false, false, "BigBench engine validation: Clean metastore"),

ENGINE_VALIDATION_CLEAN_DATA("cleanData", "engine_validation_data", false, false, false, false, "BigBench engine validation: Clean data"),

ENGINE_VALIDATION_DATA_GENERATION("dataGen", "engine_validation_data", false, false, false, true, "BigBench engine validation: Data generation"),

ENGINE_VALIDATION_LOAD_TEST("populateMetastore", "engine_validation_metastore", false, false, false, true, "BigBench engine validation: Populate metastore"),

ENGINE_VALIDATION_POWER_TEST("runQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Power test"),

ENGINE_VALIDATION_RESULT_VALIDATION("validateQuery", "engine_validation_power_test", false, false, true, false, "BigBench engine validation: Check all query results"),

CLEAN_POWER_TEST("cleanQuery", "power_test", false, false, false, false, "BigBench clean: Clean power test queries"),

CLEAN_THROUGHPUT_TEST_1("cleanQuery", "throughput_test_1", false, false, false, false, "BigBench clean: Clean first throughput test queries"),

CLEAN_THROUGHPUT_TEST_2("cleanQuery", "throughput_test_2", false, false, false, false, "BigBench clean: Clean second throughput test queries"),

CLEAN_LOAD_TEST("cleanMetastore", "metastore", false, false, false, false, "BigBench clean: Load test"),

CLEAN_DATA("cleanData", "data", false, false, false, false, "BigBench clean: Data"),

DATA_GENERATION("dataGen", "data", false, false, false, true, "BigBench preparation: Data generation"),

LOAD_TEST("populateMetastore", "metastore", false, false, false, true, "BigBench phase 1: Load test"),

POWER_TEST("runQuery", "power_test", false, true, false, false, "BigBench phase 2: Power test"),

THROUGHPUT_TEST((String)null, "throughput_test", false, false, false, false, "BigBench phase 3: Throughput test"),

THROUGHPUT_TEST_1("runQuery", "throughput_test_1", true, true, false, false, "BigBench phase 3: First throughput test run"),

THROUGHPUT_TEST_REFRESH("refreshMetastore", "throughput_test_refresh", false, false, false, false, "BigBench phase 3: Throughput test data refresh"),

THROUGHPUT_TEST_2("runQuery", "throughput_test_2", true, true, false, false, "BigBench phase 3: Second throughput test run"),

VALIDATE_POWER_TEST("validateQuery", "power_test", false, false, true, false, "BigBench validation: Power test results"),

VALIDATE_THROUGHPUT_TEST_1("validateQuery", "throughput_test_1", false, false, true, false, "BigBench validation: First throughput test results"),

VALIDATE_THROUGHPUT_TEST_2("validateQuery", "throughput_test_2", false, false, true, false, "BigBench validation: Second throughput test results"),

SHOW_TIMES("showTimes", "show_times", false, false, true, false, "BigBench: show query times"),

SHOW_ERRORS("showErrors", "show_errors", false, false, true, false, "BigBench: show query errors"),

SHOW_VALIDATION("showValidation", "show_validation", false, false, true, false, "BigBench: show query validation results");

private String runModule;

private String namePattern;

private boolean queryOrderRandom;

private boolean queryOrderCached;

private boolean printStdOut;

private boolean mustSucceed;

private String consoleMessage;

private boolean phaseDone;

private BenchmarkPhase(String runModule, String namePattern, boolean queryOrderRandom, boolean queryOrderCached, boolean printStdOut, boolean mustSucceed, String consoleMessage) {

    this.runModule = runModule;

    this.namePattern = namePattern;

    this.queryOrderRandom = queryOrderRandom;

    this.queryOrderCached = queryOrderCached;

    this.printStdOut = printStdOut;

    this.mustSucceed = mustSucceed;

    this.consoleMessage = consoleMessage;

    this.phaseDone = false;

}

3对应 BENCHMARK_STOP，4对应 CLEAN_ALL,29对应 SHOW_VALIDATION，依此类推...

可以看出：

CLEAN_ALL、CLEAN_LOAD_TEST、LOAD_TEST、THROUGHPUT_TEST_REFRESH、SHOW_TIMES、SHOW_ERRORS、SHOW_VALIDATION等benchmarkPhases调用的是

this.runModule(children, this.userArguments);

方法是 runModule ，参数是 this.userArguments。

ENGINE_VALIDATION_CLEAN_POWER_TEST、ENGINE_VALIDATION_POWER_TEST、ENGINE_VALIDATION_RESULT_VALIDATION 调用的是

this.runQueries(children, 1, validationArguments);

方法是 runQueries ，参数是 1(stream number) 和 validationArguments。

ENGINE_VALIDATION_CLEAN_LOAD_TEST 和 ENGINE_VALIDATION_LOAD_TEST 调用的是

this.runModule(children, validationArguments);

ENGINE_VALIDATION_CLEAN_DATA 调用的是

this.generateData(children, false, validationArguments);

ENGINE_VALIDATION_DATA_GENERATION 调用的是

this.generateData(children, true, validationArguments);

CLEAN_POWER_TEST、POWER_TEST、VALIDATE_POWER_TEST 调用的是

this.runQueries(children, 1, this.userArguments);

CLEAN_THROUGHPUT_TEST_1``CLEAN_THROUGHPUT_TEST_2``THROUGHPUT_TEST_1``THROUGHPUT_TEST_2``VALIDATE_THROUGHPUT_TEST_1 VALIDATE_THROUGHPUT_TEST_2 调用的是

this.runQueries(children, this.numberOfParallelStreams, this.userArguments);

CLEAN_DATA 调用的是

this.generateData(children, false, this.userArguments);

DATA_GENERATION 调用的是

this.generateData(children, true, this.userArguments);

总结一下以上的方法调用可以发现：

跟 ENGINE_VALIDATION 相关的benchmarkPhase用的参数都是 validationArguments。其余用的是 userArguments（ validationArguments 和 userArguments 唯一的区别是 validationArguments 的 SCALE_FACTOR 恒为1）
跟 POWER_TEST 相关的都是调用 runQueries() 方法，因为 POWER_TEST 就是执行SQL查询
跟 CLEAN_DATA DATA_GENERATION 相关的都是调用 generateData() 方法
跟 LOAD_TEST SHOW 相关的都是调用 runModule() 方法

benchmarkPhase 和 module 对应关系

具体每个 benchmarkPhase 跟 module（执行的脚本）的对应关系如下：

CLEAN_ALL -> "cleanAll"

ENGINE_VALIDATION_CLEAN_POWER_TEST -> "cleanQuery"

ENGINE_VALIDATION_CLEAN_LOAD_TEST -> "cleanMetastore",

ENGINE_VALIDATION_CLEAN_DATA -> "cleanData"

ENGINE_VALIDATION_DATA_GENERATION -> "dataGen"

ENGINE_VALIDATION_LOAD_TEST -> "populateMetastore"

ENGINE_VALIDATION_POWER_TEST -> "runQuery"

ENGINE_VALIDATION_RESULT_VALIDATION -> "validateQuery"

CLEAN_POWER_TEST -> "cleanQuery"

CLEAN_THROUGHPUT_TEST_1 -> "cleanQuery"

CLEAN_THROUGHPUT_TEST_2 -> "cleanQuery"

CLEAN_LOAD_TEST -> "cleanMetastore"

CLEAN_DATA -> "cleanData"

DATA_GENERATION -> "dataGen"

LOAD_TEST -> "populateMetastore"

POWER_TEST -> "runQuery"

THROUGHPUT_TEST -> (String)null

THROUGHPUT_TEST_1 -> "runQuery"

THROUGHPUT_TEST_REFRESH -> "refreshMetastore"

THROUGHPUT_TEST_2 -> "runQuery"

VALIDATE_POWER_TEST -> "validateQuery"

VALIDATE_THROUGHPUT_TEST_1 -> "validateQuery"

VALIDATE_THROUGHPUT_TEST_2 -> "validateQuery"

SHOW_TIMES -> "showTimes"

SHOW_ERRORS -> "showErrors"

SHOW_VALIDATION -> "showValidation"

当执行某个 benchmarkPhase 时会去调用如上该 benchmarkPhase 对应的 module （脚本位于 $BENCH_MARK_HOME/engines/hive/bin 目录下）

cmdLine.add(benchmarkPhase.getRunModule());

程序调用流程

bigBench.png

接下来介绍每个module的功能

module

cleanAll

1. DROP DATABASE

2. 删除hdfs上的源数据

echo "dropping database (with all tables)"

runCmdWithErrorCheck runEngineCmd -e "DROP DATABASE IF EXISTS $BIG_BENCH_DATABASE CASCADE;"

echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_HOME}"

hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_HOME}"

cleanQuery

1. 删除对应的 Query 生成的临时表

2. 删除对应的 Query 生成的结果表

runCmdWithErrorCheck runEngineCmd -e "DROP TABLE IF EXISTS $TEMP_TABLE1; DROP TABLE IF EXISTS $TEMP_TABLE2; DROP TABLE IF EXISTS $RESULT_TABLE;"

return $?

cleanMetastore

1. 调用 `dropTables.sql` 将23张表依次DROP

echo "cleaning metastore tables"

runCmdWithErrorCheck runEngineCmd -f "$BIG_BENCH_CLEAN_METASTORE_FILE"

export BIG_BENCH_CLEAN_METASTORE_FILE="$BIG_BENCH_CLEAN_DIR/dropTables.sql"

dropTables.sql 将23张表依次DROP,源码如下：

DROP TABLE IF EXISTS ${hiveconf:customerTableName};

DROP TABLE IF EXISTS ${hiveconf:customerAddressTableName};

DROP TABLE IF EXISTS ${hiveconf:customerDemographicsTableName};

DROP TABLE IF EXISTS ${hiveconf:dateTableName};

DROP TABLE IF EXISTS ${hiveconf:householdDemographicsTableName};

DROP TABLE IF EXISTS ${hiveconf:incomeTableName};

DROP TABLE IF EXISTS ${hiveconf:itemTableName};

DROP TABLE IF EXISTS ${hiveconf:promotionTableName};

DROP TABLE IF EXISTS ${hiveconf:reasonTableName};

DROP TABLE IF EXISTS ${hiveconf:shipModeTableName};

DROP TABLE IF EXISTS ${hiveconf:storeTableName};

DROP TABLE IF EXISTS ${hiveconf:timeTableName};

DROP TABLE IF EXISTS ${hiveconf:warehouseTableName};

DROP TABLE IF EXISTS ${hiveconf:webSiteTableName};

DROP TABLE IF EXISTS ${hiveconf:webPageTableName};

DROP TABLE IF EXISTS ${hiveconf:inventoryTableName};

DROP TABLE IF EXISTS ${hiveconf:storeSalesTableName};

DROP TABLE IF EXISTS ${hiveconf:storeReturnsTableName};

DROP TABLE IF EXISTS ${hiveconf:webSalesTableName};

DROP TABLE IF EXISTS ${hiveconf:webReturnsTableName};

DROP TABLE IF EXISTS ${hiveconf:marketPricesTableName};

DROP TABLE IF EXISTS ${hiveconf:clickstreamsTableName};

DROP TABLE IF EXISTS ${hiveconf:reviewsTableName};

cleanData

1. 删除hdfs上 /user/root/benchmarks/bigbench/data 目录

2. 删除hdfs上 /user/root/benchmarks/bigbench/data_refresh 目录

echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"

hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"

echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

dataGen

1. 创建目录 /user/root/benchmarks/bigbench/data 并赋予权限

2. 创建目录 /user/root/benchmarks/bigbench/data_refresh 并赋予权限

3. 调用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data 到 /user/root/benchmarks/bigbench/data 目录下

4. 调用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data 到 /user/root/benchmarks/bigbench/data_refresh 目录下

创建目录 /user/root/benchmarks/bigbench/data 并赋予权限

runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"

runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"

创建目录 /user/root/benchmarks/bigbench/data_refresh 并赋予权限

runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"

调用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data

runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 0 -o "'${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1

调用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data

runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 1 -o "'${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1

populateMetastore

该过程是真正的创建数据库表的过程。建表的过程调用的是 $BENCH_MARK_HOME/engines/hive/population/ 目录下的 hiveCreateLoad.sql ,通过该sql文件来建数据库表。

从 /user/root/benchmarks/bigbench/data 路径下读取 .dat 的原始数据，生成 TEXTFILE 格式的外部临时表
用 select * from 临时表 来创建最终的 ORC 格式的数据库表
删除外部临时表。

从 /user/root/benchmarks/bigbench/data 路径下读取 .dat 的原始数据，生成 TEXTFILE 格式的外部临时表

DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};

CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}

  ( c_customer_sk             bigint              --not null

  , c_customer_id             string              --not null

  , c_current_cdemo_sk        bigint

  , c_current_hdemo_sk        bigint

  , c_current_addr_sk         bigint

  , c_first_shipto_date_sk    bigint

  , c_first_sales_date_sk     bigint

  , c_salutation              string

  , c_first_name              string

  , c_last_name               string

  , c_preferred_cust_flag     string

  , c_birth_day               int

  , c_birth_month             int

  , c_birth_year              int

  , c_birth_country           string

  , c_login                   string

  , c_email_address           string

  , c_last_review_date        string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'

  STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'

;

用 select * from 临时表 来创建最终的 ORC 格式的数据库表

DROP TABLE IF EXISTS ${hiveconf:customerTableName};

CREATE TABLE ${hiveconf:customerTableName}

STORED AS ${hiveconf:tableFormat}

AS

SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}

;

删除外部临时表

DROP TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};

runQuery

1. runQuery 调用每个query下的 run.sh 里的 `query_run_main_method()` 方法

2. `query_run_main_method()` 调用 `runEngineCmd` 来执行query脚本（qxx.sql）

runQuery 调用每个query下的 run.sh 里的 query_run_main_method() 方法

QUERY_MAIN_METHOD="query_run_main_method"

-----------------------------------------

"$QUERY_MAIN_METHOD" 2>&1 | tee -a "$LOG_FILE_NAME" 2>&1

query_run_main_method() 调用 runEngineCmd 来执行query脚本（qxx.sql）

query_run_main_method () {

    QUERY_SCRIPT="$QUERY_DIR/$QUERY_NAME.sql"

    if [ ! -r "$QUERY_SCRIPT" ]

    then

        echo "SQL file $QUERY_SCRIPT can not be read."

        exit 1

    fi

    runCmdWithErrorCheck runEngineCmd -f "$QUERY_SCRIPT"

    return $?

}

一般情况下 query_run_main_method () 方法只是执行对应的query脚本，但是像 q05、q20... 这些查询，用到了机器学习算法，所以在执行对应的query脚本后会把生成的结果表作为输入，然后调用执行机器学习算法（如聚类、逻辑回归）的jar包继续执行，得到最终的结果。

runEngineCmd () {

  if addInitScriptsToParams

  then

    "$BINARY" "${BINARY_PARAMS[@]}" "${INIT_PARAMS[@]}" "$@"

  else

    return 1

  fi

}

--------------------------

BINARY="/usr/bin/hive"

BINARY_PARAMS+=(--hiveconf BENCHMARK_PHASE=$BIG_BENCH_BENCHMARK_PHASE --hiveconf STREAM_NUMBER=$BIG_BENCH_STREAM_NUMBER --hiveconf QUERY_NAME=$QUERY_NAME --hiveconf QUERY_DIR=$QUERY_DIR --hiveconf RESULT_TABLE=$RESULT_TABLE --hiveconf RESULT_DIR=$RESULT_DIR --hiveconf TEMP_TABLE=$TEMP_TABLE --hiveconf TEMP_DIR=$TEMP_DIR --hiveconf TABLE_PREFIX=$TABLE_PREFIX)

INIT_PARAMS=(-i "$BIG_BENCH_QUERY_PARAMS_FILE" -i "$BIG_BENCH_ENGINE_SETTINGS_FILE")

INIT_PARAMS+=(-i "$LOCAL_QUERY_ENGINE_SETTINGS_FILE")

if [ -n "$USER_QUERY_PARAMS_FILE" ]

then

if [ -r "$USER_QUERY_PARAMS_FILE" ]

then

  echo "User defined query parameter file found. Adding $USER_QUERY_PARAMS_FILE to hive init."

  INIT_PARAMS+=(-i "$USER_QUERY_PARAMS_FILE")

else

  echo "User query parameter file $USER_QUERY_PARAMS_FILE can not be read."

  return 1

fi

fi

if [ -n "$USER_ENGINE_SETTINGS_FILE" ]

then

if [ -r "$USER_ENGINE_SETTINGS_FILE" ]

then

  echo "User defined engine settings file found. Adding $USER_ENGINE_SETTINGS_FILE to hive init."

  INIT_PARAMS+=(-i "$USER_ENGINE_SETTINGS_FILE")

else

  echo "User hive settings file $USER_ENGINE_SETTINGS_FILE can not be read."

  return 1

fi

fi

return 0

validateQuery

1. 调用每个query下的 run.sh 里的 `query_run_validate_method()` 方法

2. `query_run_validate_method()` 比较 `$BENCH_MARK_HOME/engines/hive/queries/qxx/results/qxx-result` 和hdfs上 `/user/root/benchmarks/bigbench/queryResults/qxx_hive_${BIG_BENCH_BENCHMARK_PHASE}_${BIG_BENCH_STREAM_NUMBER}_result` 两个文件，如果一样，则验证通过，否则验证失败。

if diff -q "$VALIDATION_RESULTS_FILENAME" <(hadoop fs -cat "$RESULT_DIR/*")

then

    echo "Validation of $VALIDATION_RESULTS_FILENAME passed: Query returned correct results"

else

    echo "Validation of $VALIDATION_RESULTS_FILENAME failed: Query returned incorrect results"

    VALIDATION_PASSED="0"

fi

SF为1时(-f 1)，用上面的方法比较，SF不为1（>1）时,只要hdfs上的结果表中行数大于等于1即验证通过

if [ `hadoop fs -cat "$RESULT_DIR/*" | head -n 10 | wc -l` -ge 1 ]

then

    echo "Validation passed: Query returned results"

else

    echo "Validation failed: Query did not return results"

    return 1

fi

refreshMetastore

1. 调用 `$BENCH_MARK_HOME/engines/hive/refresh/` 目录下的 `hiveRefreshCreateLoad.sql` 脚本

2. `hiveRefreshCreateLoad.sql` 将hdfs上 `/user/root/benchmarks/bigbench/data_refresh/` 目录下每个表数据插入外部临时表

3. 外部临时表再将每个表的数据插入Hive数据库对应的表中

hiveRefreshCreateLoad.sql 将hdfs上 /user/root/benchmarks/bigbench/data_refresh/ 目录下每个表数据插入外部临时表

DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};

CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}

  ( c_customer_sk             bigint              --not null

  , c_customer_id             string              --not null

  , c_current_cdemo_sk        bigint

  , c_current_hdemo_sk        bigint

  , c_current_addr_sk         bigint

  , c_first_shipto_date_sk    bigint

  , c_first_sales_date_sk     bigint

  , c_salutation              string

  , c_first_name              string

  , c_last_name               string

  , c_preferred_cust_flag     string

  , c_birth_day               int

  , c_birth_month             int

  , c_birth_year              int

  , c_birth_country           string

  , c_login                   string

  , c_email_address           string

  , c_last_review_date        string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}'

  STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}'

;

-----------------

set hdfsDataPath=${env:BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR};

外部临时表再将每个表的数据插入Hive数据库对应的表中

INSERT INTO TABLE ${hiveconf:customerTableName}

SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}

;

附录

23张数据库表