hive简单的项目实战

解压user.zip

[root@hadoop1 test]# unzip user.zip -d /test/bigdatacase/dataset

Archive:  user.zip

  inflating: /test/bigdatacase/dataset/raw_user.csv

  inflating: /test/bigdatacase/dataset/small_user.csv

查看解压出来的两个文件，查看raw_user.csv头文件看一下

[root@hadoop1 dataset]# ll

总用量 1025444

-rw-r--r--. 1 root root 1036519392 3月   3 2016 raw_user.csv

-rw-r--r--. 1 root root   13530111 11月 26 2016 small_user.csv

[root@hadoop1 dataset]# head -5 raw_user.csv

user_id,item_id,behavior_type,user_geohash,item_category,time

10001082,285259775,1,97lk14c,4076,2014-12-08 18

10001082,4368907,1,,5503,2014-12-12 12

10001082,4368907,1,,5503,2014-12-12 12

10001082,53616768,1,,9762,2014-12-02 15

[root@hadoop1 dataset]#

每行记录都包含5个字段，数据集中的字段及其含义如下：

user_id（用户id）

item_id(商品id)

behaviour_type（包括浏览、收藏、加购物车、购买，对应取值分别是1、2、3、4）

user_geohash(用户地理位置哈希值，有些记录中没有这个字段值，所以后面会用脚本做数据预处理时把这个字段全部删除)

item_category（商品分类）

time（该记录产生时间）

一、数据集的预处理

1.删除文件第一行记录，即字段名称

raw_user和small_user中的第一行都是字段名称，在文件中的数据导入到数据仓库Hive中时，不需要第一行字段名称，因此，这里在做数据预处理时，删除第一行

[root@hadoop1 dataset]# sed -i '1d' raw_user.csv

[root@hadoop1 dataset]# head -5 raw_user.csv

10001082,285259775,1,97lk14c,4076,2014-12-08 18

10001082,4368907,1,,5503,2014-12-12 12

10001082,4368907,1,,5503,2014-12-12 12

10001082,53616768,1,,9762,2014-12-02 15

10001082,151466952,1,,5232,2014-12-12 11

[root@hadoop1 dataset]#

同理small_user.csv也执行一次

[root@hadoop1 dataset]# sed -i '1d' small_user.csv

[root@hadoop1 dataset]# head -5 small_user.csv

10001082,285259775,1,97lk14c,4076,2014-12-08 18

10001082,4368907,1,,5503,2014-12-12 12

10001082,4368907,1,,5503,2014-12-12 12

10001082,53616768,1,,9762,2014-12-02 15

10001082,151466952,1,,5232,2014-12-12 11

[root@hadoop1 dataset]#

后面都是用small_user.csv这个小数据集进行操作，这样可以节省时间。等所有流程都跑通以后，就可以使用大数据集raw_user.csv去测试一遍。

2.对字段进行预处理

下面对数据集进行一些预处理，包括为每行记录增加一个id字段（让记录具有唯一性）、增加一个省份字段（用来后续进行可视化分析），并且丢弃user_geohash字段（后面分析不需要这个字段）。

下面要建一个脚本文件pre_deal.sh，把这个脚本文件放在dataset目录下，和数据集small_user.csv放在同一个目录下：

[root@hadoop1 dataset]# cd /test/bigdatacase//dataset

[root@hadoop1 dataset]# vim pre_deal.sh

#/bin/bash

#设置输入文件，把用户执行pre_deal.sh命令时提供的第一个参数作为输入文件名称

infile=$1

#下面设置输出文件，把用户执行pre_deal.sh命令时提供的第二个参数作为输出文件名称

outfile=$2

#awk -F "," '处理逻辑' $infile > $outfile

awk -F "," 'BEGIN{

                srand();

                id=0;

                Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市";

                Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南";

                Province[14]="浙江";Province[15]="贵州";Province[16]="新疆";Province[17]="西藏";Province[18]="江西";Province[19]="湖南";Province[20]="湖北";

                Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川";

                Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="台湾";

        }

        {

                id=id+1;

                value=int(rand()*34);

                print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value]

        }' $infile > $outfile

这段脚本大概是-F参数用于指出每行记录的不同字段之间用逗号进行分割

srand()用于生成随机数的种子，id是为数据集新增的一个字段，它是一个自增类型，每条记录增加1，这样可以保证每条记录具有唯一性。

为数据集新增一个省份字段，方便用来进行后面的数据可视化分析，给每条记录增加一个省份字段的值，用Province[]数组用来保存全国各个省份信息，在遍历数据集raw_user.csv的时候，每当遍历到其中一条记录，使用value=int(rand()*34)语句随机生成一个0-33的整数，作为Province省份值，然后从Province[]数组当中获取省份名称，增加到该条记录中。

substr($6,1,10)这个语句是为了截取时间字段time的年月日，方便后续存储为date格式

awk每次遍历到一条记录时，每条记录包含了6个字段，其中，第6个字段是时间字段，substr($6,1,10)语句就表示获取第6个字段的值，截取前10个字符，第6个字段是类似”2014-12-08 18″这样的字符串（也就是表示2014年12月8日18时），substr($6,1,10)截取后，就丢弃了小时，只保留了年月日。

print id”\t”$1″\t”$2″\t”$3″\t”$5″\t”substr($6,1,10)”\t”Province[value]这行语句中，我们丢弃了每行记录的第4个字段，前面有说用户地理位置哈希值数据预处理时把这个字段全部删除，所以这里丢弃。生成后的文件是“\t”进行分割

执行脚本，对small_user.csv进行数据预处理

[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt

-bash: ./pre_deal.sh: 权限不够

# 加一个可执行的权限

[root@hadoop1 dataset]# chmod +x pre_deal.sh

[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt

[root@hadoop1 dataset]# head -10 user_table.txt

1   10001082    285259775   1   4076    2014-12-08  香港

2   10001082    4368907 1   5503    2014-12-12  山西

3   10001082    4368907 1   5503    2014-12-12  湖北

4   10001082    53616768    1   9762    2014-12-02  河北

5   10001082    151466952   1   5232    2014-12-12  江西

6   10001082    53616768    4   9762    2014-12-02  广东

7   10001082    290088061   1   5503    2014-12-12  西藏

8   10001082    298397524   1   10894   2014-12-12  湖北

9   10001082    32104252    1   6513    2014-12-12  甘肃

10  10001082    323339743   1   10894   2014-12-12  江西

3.导入数据库

把user_table.txt中的数据最终导入到数据仓库Hive中，将user_table.txt上传到分布式文件系统HDFS中，然后，在Hive中创建一个外部表，完成数据的导入。

把Linux本地文件系统中的user_table.txt上传到分布式文件系统HDFS中，存放在HDFS中的“/bigdatacase/dataset”目录下，在HDFS的根目录下面创建一个新的目录bigdatacase，并在这个目录下创建一个子目录dataset

[root@hadoop1 dataset]# hdfs dfs -mkdir -p /bigdatacase/dataset

[root@hadoop1 dataset]# hdfs dfs -put /test/bigdatacase/dataset/user_table.txt /bigdatacase/dataset

查看一下

[root@hadoop1 dataset]# hdfs dfs -cat /bigdatacase/dataset/user_table.txt | head -10

1   10001082    285259775   1   4076    2014-12-08  香港

2   10001082    4368907 1   5503    2014-12-12  山西

3   10001082    4368907 1   5503    2014-12-12  湖北

4   10001082    53616768    1   9762    2014-12-02  河北

5   10001082    151466952   1   5232    2014-12-12  江西

6   10001082    53616768    4   9762    2014-12-02  广东

7   10001082    290088061   1   5503    2014-12-12  西藏

8   10001082    298397524   1   10894   2014-12-12  湖北

9   10001082    32104252    1   6513    2014-12-12  甘肃

10  10001082    323339743   1   10894   2014-12-12  江西

cat: Unable to write to output stream.

[root@hadoop1 dataset]#

创建外部表

先创建一个数据库，然后在库中建立一张外部表bigdata_user，它包含字段（id, uid, item_id, behavior_type, item_category, date, province），请在hive命令提示符下输入如下命令

hive (default)> CREATE DATABASE dblab;

OK

hive (default)> use dblab;

OK

Time taken: 0.692 seconds

CREATE EXTERNAL TABLE bigdata_user(

id INT,

uid STRING,

item_id STRING,

behavior_type INT,

item_category STRING,

visit_data DATE,

province STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION '/bigdatacase/dataset';

已经成功把HDFS中的“/bigdatacase/dataset”目录下的数据加载到了数据仓库Hive中，查看一下

这里不小心把visit_date DATE 写成 visit_data DATE，修改一下列

hive (dblab)> ALTER TABLE bigdata_user CHANGE visit_data visit_date date;

OK

Time taken: 9.862 seconds

hive (dblab)> select * from bigdata_user limit 10;

OK

bigdata_user.id bigdata_user.uid    bigdata_user.item_id    bigdata_user.behavior_type  bigdata_user.item_category  bigdata_user.visit_date bigdata_user.province

1   10001082    285259775   1   4076    2014-12-08  香港

2   10001082    4368907 1   5503    2014-12-12  山西

3   10001082    4368907 1   5503    2014-12-12  湖北

4   10001082    53616768    1   9762    2014-12-02  河北

5   10001082    151466952   1   5232    2014-12-12  江西

6   10001082    53616768    4   9762    2014-12-02  广东

7   10001082    290088061   1   5503    2014-12-12  西藏

8   10001082    298397524   1   10894   2014-12-12  湖北

9   10001082    32104252    1   6513    2014-12-12  甘肃

10  10001082    323339743   1   10894   2014-12-12  江西

Time taken: 12.896 seconds, Fetched: 10 row(s)

hive (dblab)> select behavior_type from bigdata_user limit 10;

OK

behavior_type

1

1

1

1

1

4

1

1

1

1

Time taken: 6.927 seconds, Fetched: 10 row(s)

beeline查询表会比较工整

0: jdbc:hive2://hadoop001:10000> select * from bigdata_user limit 10;

+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+

| bigdata_user.id  | bigdata_user.uid  | bigdata_user.item_id  | bigdata_user.behavior_type  | bigdata_user.item_category  | bigdate_user.visit_data  | bigdata_user.province  |

+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+

| 1                | 10001082          | 285259775             | 1                           | 4076                        | 2014-12-08               | 香港                     |

| 2                | 10001082          | 4368907               | 1                           | 5503                        | 2014-12-12               | 山西                     |

| 3                | 10001082          | 4368907               | 1                           | 5503                        | 2014-12-12               | 湖北                     |

| 4                | 10001082          | 53616768              | 1                           | 9762                        | 2014-12-02               | 河北                     |

| 5                | 10001082          | 151466952             | 1                           | 5232                        | 2014-12-12               | 江西                     |

| 6                | 10001082          | 53616768              | 4                           | 9762                        | 2014-12-02               | 广东                     |

| 7                | 10001082          | 290088061             | 1                           | 5503                        | 2014-12-12               | 西藏                     |

| 8                | 10001082          | 298397524             | 1                           | 10894                       | 2014-12-12               | 湖北                     |

| 9                | 10001082          | 32104252              | 1                           | 6513                        | 2014-12-12               | 甘肃                     |

| 10               | 10001082          | 323339743             | 1                           | 10894                       | 2014-12-12               | 江西                     |

+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+

10 rows selected (1.055 seconds)

0: jdbc:hive2://hadoop001:10000> select behavior_type from bigdata_user limit 10;

+----------------+

| behavior_type  |

+----------------+

| 1              |

| 1              |

| 1              |

| 1              |

| 1              |

| 4              |

| 1              |

| 1              |

| 1              |

| 1              |

+----------------+

10 rows selected (6.337 seconds)

0: jdbc:hive2://hadoop001:10000>

查看一下的bigdata_user属性show create table bigdata_user;

查看一下表的简单结构desc bigdata_user;

hive (dblab)> show create table bigdata_user;

OK

createtab_stmt

CREATE EXTERNAL TABLE `bigdata_user`(

  `id` int,

  `uid` string,

  `item_id` string,

  `behavior_type` int,

  `item_category` string,

  `visit_date` date,

  `province` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

  'field.delim'='\t',

  'serialization.format'='\t')

STORED AS INPUTFORMAT

  'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://192.168.1.11:8020/bigdatacase/dataset'

TBLPROPERTIES (

  'bucketing_version'='2',

  'transient_lastDdlTime'='1605450738')

Time taken: 35.032 seconds, Fetched: 22 row(s)

hive (dblab)>

hive (dblab)> desc bigdata_user;

OK

col_name    data_type   comment

id                      int

uid                     string

item_id                 string

behavior_type           int

item_category           string

visit_date              date

province                string

Time taken: 14.897 seconds, Fetched: 7 row(s)

hive (dblab)>

二、简单查询分析

先测试一下简单的指令：

查看前10位用户对商品的行为

SELECT behavior_type FROM bigdata_user LIMIT 10;

hive (dblab)> SELECT behavior_type FROM bigdata_user LIMIT 10;

OK

behavior_type

1

1

1

1

1

4

1

1

1

1

Time taken: 10.666 seconds, Fetched: 10 row(s)

查询前20位用户购买商品时的时间和商品的种类

SELECT visit_date,item_category FROM bigdata_user LIMIT 20;

hive (dblab)> SELECT visit_date,item_category FROM bigdata_user LIMIT 20;

OK

visit_date  item_category

2014-12-08  4076

2014-12-12  5503

2014-12-12  5503

2014-12-02  9762

2014-12-12  5232

2014-12-02  9762

2014-12-12  5503

2014-12-12  10894

2014-12-12  6513

2014-12-12  10894

2014-12-12  2825

2014-11-28  2825

2014-12-15  3200

2014-12-03  10576

2014-11-20  10576

2014-12-13  10576

2014-12-08  10576

2014-12-14  7079

2014-12-02  6669

2014-12-12  5232

Time taken: 7.74 seconds, Fetched: 20 row(s)

表中查询可以利用嵌套语句

hive (dblab)> SELECT e.bh, e.it FROM (SELECT behavior_type AS bh, item_category AS it FROM bigdata_user) AS e LIMIT 20;

OK

e.bh    e.it

1   4076

1   5503

1   5503

1   9762

1   5232

4   9762

1   5503

1   10894

1   6513

1   10894

1   2825

1   2825

1   3200

1   10576

1   10576

1   10576

1   10576

1   7079

1   6669

1   5232

Time taken: 5.872 seconds, Fetched: 20 row(s)

三、查询条数统计分析

(1)用聚合函数count()计算出表内有多少条行数据

SELECT COUNT(*) FROM bigdata_user;

hive (dblab)> SELECT COUNT(*) FROM bigdata_user;

Query ID = root_20201115231503_8cce31a9-81cb-4ef3-92ca-94bafb2086b4

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-15 23:17:28,850 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0001, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0001/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-15 23:20:28,863 Stage-1 map = 0%,  reduce = 0%

2020-11-15 23:21:29,392 Stage-1 map = 0%,  reduce = 0%

2020-11-15 23:22:27,725 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.91 sec

2020-11-15 23:23:10,606 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.72 sec

MapReduce Total cumulative CPU time: 11 seconds 970 msec

Ended Job = job_1605437253399_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.97 sec   HDFS Read: 15603831 HDFS Write: 106 SUCCESS

Total MapReduce CPU Time Spent: 11 seconds 970 msec

OK

_c0

300000

Time taken: 501.726 seconds, Fetched: 1 row(s)

(2)在函数内部加上distinct，查出uid不重复的数据有多少条

SELECT COUNT(DISTINCT uid) FROM bigdata_user;

hive (dblab)> SELECT COUNT(DISTINCT uid) FROM bigdata_user;

Query ID = root_20201115232536_31d143aa-c5ed-4b53-b0a6-8c64f14f8634

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-15 23:25:58,752 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0002, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0002/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-15 23:26:38,676 Stage-1 map = 0%,  reduce = 0%

2020-11-15 23:27:38,820 Stage-1 map = 0%,  reduce = 0%

2020-11-15 23:28:20,002 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.33 sec

2020-11-15 23:29:05,715 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.77 sec

2020-11-15 23:29:11,177 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 10.77 sec

MapReduce Total cumulative CPU time: 10 seconds 770 msec

Ended Job = job_1605437253399_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 10.77 sec   HDFS Read: 15599879 HDFS Write: 103 SUCCESS

Total MapReduce CPU Time Spent: 10 seconds 770 msec

OK

_c0

270

Time taken: 253.202 seconds, Fetched: 1 row(s)

(3)查询不重复的数据有多少条(为了排除客户刷单情况)

SELECT COUNT(*) FROM (SELECT

uid,item_id,behavior_type,item_category,visit_date,province

FROM

bigdata_user

GROUP BY

uid,item_id,behavior_type,item_category,visit_date,province

HAVING COUNT(*)=1)a;

hive (dblab)> SELECT COUNT(*) FROM (SELECT

            > uid,item_id,behavior_type,item_category,visit_date,province

            > FROM

            > bigdata_user

            > GROUP BY

            > uid,item_id,behavior_type,item_category,visit_date,province

            > HAVING COUNT(*)=1)a;

Query ID = root_20201116003743_ffd71e21-1df6-4876-b80e-1672d4c3f19d

Total jobs = 2

Launching Job 1 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 00:38:51,966 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0004, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0004/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 00:40:19,298 Stage-1 map = 0%,  reduce = 0%

2020-11-16 00:41:12,139 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 11.81 sec

2020-11-16 00:41:13,194 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 13.35 sec

2020-11-16 00:42:13,270 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 13.35 sec

2020-11-16 00:42:43,105 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 17.76 sec

2020-11-16 00:42:49,495 Stage-1 map = 100%,  reduce = 74%, Cumulative CPU 22.34 sec

2020-11-16 00:42:55,171 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.5 sec

MapReduce Total cumulative CPU time: 25 seconds 680 msec

Ended Job = job_1605437253399_0004

Launching Job 2 out of 2

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 00:43:09,951 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0005, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0005/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0005

Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1

2020-11-16 00:43:51,371 Stage-2 map = 0%,  reduce = 0%

2020-11-16 00:44:11,773 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.71 sec

2020-11-16 00:45:12,075 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.71 sec

2020-11-16 00:45:41,184 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 7.7 sec

2020-11-16 00:45:47,993 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 8.67 sec

MapReduce Total cumulative CPU time: 8 seconds 670 msec

Ended Job = job_1605437253399_0005

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 25.68 sec   HDFS Read: 15606542 HDFS Write: 117 SUCCESS

Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 8.67 sec   HDFS Read: 6997 HDFS Write: 106 SUCCESS

Total MapReduce CPU Time Spent: 34 seconds 350 msec

OK

_c0

284368

Time taken: 494.126 seconds, Fetched: 1 row(s)

四．关键字条件查询分析

1.以关键字的存在区间为条件的查询使用where可以缩小查询分析的范围和精确度，下面用实例来测试一下。

(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE

            > bwhavior_type='1'

            > AND visit_date<'2014-12-13'

            > AND visit_date>'2014-12-10';

            Query ID = root_20201116090806_6b2a5b9c-f996-4caf-baa7-4bea6812cfda

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 09:11:29,381 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0008, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0008/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0008

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 09:16:53,597 Stage-1 map = 0%,  reduce = 0%

2020-11-16 09:17:08,350 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.95 sec

2020-11-16 09:17:34,025 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.15 sec

MapReduce Total cumulative CPU time: 16 seconds 350 msec

Ended Job = job_1605437253399_0008

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 16.35 sec   HDFS Read: 15606581 HDFS Write: 105 SUCCESS

Total MapReduce CPU Time Spent: 16 seconds 350 msec

OK

_c0

26329

Time taken: 629.506 seconds, Fetched: 1 row(s)

（2）以月的第n天为统计单位，依次显示第n天网站卖出去的商品的个数

SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE

behavior_type='4' GROUP BY day(visit_date);

hive (dblab)> SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE

            > behavior_type='4' GROUP BY day(visit_date);

Query ID = root_20201116115335_d26e8b96-2962-4e18-8e18-adbc7324d127

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 11:55:06,219 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0009, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0009/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0009

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 11:56:13,054 Stage-1 map = 0%,  reduce = 0%

2020-11-16 11:56:35,088 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.08 sec

2020-11-16 11:57:03,002 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.01 sec

MapReduce Total cumulative CPU time: 14 seconds 10 msec

Ended Job = job_1605437253399_0009

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.01 sec   HDFS Read: 15601205 HDFS Write: 618 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 10 msec

OK

_c0 _c1

37  1

48  2

42  3

38  4

42  5

33  6

42  7

36  8

34  9

40  10

43  11

98  12

39  13

43  14

42  15

44  16

42  17

66  18

38  19

50  20

33  21

34  22

32  23

47  24

34  25

31  26

30  27

34  28

39  29

38  30

Time taken: 212.641 seconds, Fetched: 30 row(s)

2.关键字赋予给定值为条件，对其他数据进行分析

取给定时间和给定地点，求当天发出到该地点的货物的数量

SELEct COUNT(*) FROM bigdata_user WHERE

province='江西' AND visit_date='2014-12-12' AND behavior_type='4';

hive (dblab)> SELEct COUNT(*) FROM bigdata_user WHERE

            > province='江西' AND visit_date='2014-12-12' AND behavior_type='4';

Query ID = root_20201116120451_c13ae714-6cf1-400c-b9c0-d6afdd25ac7a

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 12:05:01,440 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0010, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0010/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0010

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 12:06:43,726 Stage-1 map = 0%,  reduce = 0%

2020-11-16 12:07:13,387 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.12 sec

2020-11-16 12:07:31,429 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.12 sec

MapReduce Total cumulative CPU time: 14 seconds 140 msec

Ended Job = job_1605437253399_0010

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.14 sec   HDFS Read: 15606342 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 140 msec

OK

_c0

12

Time taken: 166.15 seconds, Fetched: 1 row(s)

五．根据用户行为分析

1．查询一件商品在某天的购买比例或浏览比例

例如

SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'

AND behavior_type='4';

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'

            > AND behavior_type='4';

Query ID = root_20201116122136_fbb9be2d-7c88-409d-ada8-0e24ca4e912d

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 12:22:11,860 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0011, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0011/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0011

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 12:24:31,024 Stage-1 map = 0%,  reduce = 0%

2020-11-16 12:24:40,319 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.37 sec

2020-11-16 12:25:19,783 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.99 sec

MapReduce Total cumulative CPU time: 16 seconds 990 msec

Ended Job = job_1605437253399_0011

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 16.99 sec   HDFS Read: 15605958 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 16 seconds 990 msec

OK

_c0

69

Time taken: 242.638 seconds, Fetched: 1 row(s)

查询有多少用户在2014-12-11点击了该店

SELECT COUNT(*) FROM bigdata_user WHERE

 visit_date='2014-12-12';

根据上面语句得到购买数量和点击数量，两个数相除即可得出当天该商品的购买率。

2.查询某个用户在某一天点击网站占该天所有点击行为的比例（点击行为包括浏览，加入购物车，收藏，购买）

//查询用户10001082在2014-12-12点击网站的次数

SELECT COUNT(*) FROM bigdata_user WHERE

uid=10001082 AND visit_date='2014-12-12';

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE

            > uid=10001082 AND visit_date='2014-12-12';

Query ID = root_20201116124039_d60d642d-b253-49a4-a5d6-b938e307157b

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 12:40:52,056 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0013, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0013/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0013

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 12:41:17,591 Stage-1 map = 0%,  reduce = 0%

2020-11-16 12:41:44,422 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.98 sec

2020-11-16 12:42:10,097 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.17 sec

MapReduce Total cumulative CPU time: 15 seconds 170 msec

Ended Job = job_1605437253399_0013

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 15.17 sec   HDFS Read: 15606459 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 15 seconds 170 msec

OK

_c0

69

Time taken: 95.548 seconds, Fetched: 1 row(s)

hive (dblab)>

//查询所有用户在这一天点击该网站的次数

SELECT COUNT(*) FROM bigdata_user WHERE

visit_date='2014-12-12';

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE

            > visit_date='2014-12-12';

Query ID = root_20201116124543_a447db6c-52cd-4be9-adb6-1a57f5163eb3

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 12:45:48,156 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0014, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0014/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0014

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 12:48:59,555 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.29 sec

2020-11-16 12:50:06,680 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.29 sec

2020-11-16 12:51:44,664 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 11.4 sec

2020-11-16 12:51:59,859 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.82 sec

MapReduce Total cumulative CPU time: 17 seconds 280 msec

Ended Job = job_1605437253399_0014

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.28 sec   HDFS Read: 15605296 HDFS Write: 105 SUCCESS

Total MapReduce CPU Time Spent: 17 seconds 280 msec

OK

_c0

17494

Time taken: 454.833 seconds, Fetched: 1 row(s)

上面两条语句的结果相除，就得到了要要求的比例。

3.给定购买商品的数量范围，查询某一天在该网站的购买该数量商品的用户id

//查询某一天在该网站购买商品超过5次的用户id

SELECT uid FROM bigdata_user WHERE

behavior_type='4' AND visit_date='2014-12-12'

GROUP BY uid having count(behavior_type='4')>5;

hive (dblab)> SELECT uid FROM bigdata_user WHERE

            > behavior_type='4' AND visit_date='2014-12-12'

            > GROUP BY uid having count(behavior_type='4')>5;

Query ID = root_20201116130147_72230e9f-b748-4851-974c-1ca5dcceb37c

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 13:07:39,493 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0015, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0015/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0015

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 13:09:16,947 Stage-1 map = 0%,  reduce = 0%

2020-11-16 13:10:02,481 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.51 sec

2020-11-16 13:10:33,388 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 17.71 sec

MapReduce Total cumulative CPU time: 17 seconds 710 msec

Ended Job = job_1605437253399_0015

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.71 sec   HDFS Read: 15607239 HDFS Write: 478 SUCCESS

Total MapReduce CPU Time Spent: 17 seconds 710 msec

OK

uid

100226515

100300684

100555417

100605

10095384

10142625

101490976

101982646

102011320

102030700

102079825

102349447

102612580

102650143

103082347

103139791

103794013

103995979

Time taken: 533.86 seconds, Fetched: 18 row(s)

六.用户实时查询分析

某个地区的用户当天浏览网站的次数

//创建新的数据表进行存储

CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

hive (dblab)> CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday'

            > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

OK

Time taken: 47.899 seconds

//导入数据

INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)

FROM bigdata_user WHERE behavior_type='1' GROUP BY province;


//查看一下

SELECT * FROM scan;

hive (dblab)> INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)

            > FROM bigdata_user WHERE behavior_type='1' GROUP BY province;

Query ID = root_20201116133805_d502f968-f966-4de9-ac2b-075942cd23dc

Total jobs = 2

Launching Job 1 out of 2

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 13:38:36,530 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0016, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0016/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0016

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-11-16 13:39:13,851 Stage-1 map = 0%,  reduce = 0%

2020-11-16 13:39:36,587 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.79 sec

2020-11-16 13:39:54,796 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.42 sec

MapReduce Total cumulative CPU time: 13 seconds 420 msec

Ended Job = job_1605437253399_0016

Loading data to table dblab.scan

Launching Job 2 out of 2

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

2020-11-16 13:40:02,593 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

Starting Job = job_1605437253399_0017, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0017/

Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0017

Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1

2020-11-16 13:40:34,772 Stage-3 map = 0%,  reduce = 0%

2020-11-16 13:41:35,425 Stage-3 map = 0%,  reduce = 0%

2020-11-16 13:42:27,059 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 5.0 sec

2020-11-16 13:42:57,783 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 9.3 sec

MapReduce Total cumulative CPU time: 9 seconds 300 msec

Ended Job = job_1605437253399_0017

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 13.42 sec   HDFS Read: 15608313 HDFS Write: 913 SUCCESS

Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 9.3 sec   HDFS Read: 11053 HDFS Write: 515 SUCCESS

Total MapReduce CPU Time Spent: 22 seconds 720 msec

OK

_col0   _col1

Time taken: 336.34 seconds

hive (dblab)> SELECT * FROM scan;

OK

scan.province   scan.scan

上海市 8394

云南  8357

内蒙古 8146

北京市 8328

台湾  8186

吉林  8397

四川  8412

天津市 8557

宁夏  8414

安徽  8222

山东  8404

山西  8342

广东  8321

广西  8265

新疆  8311

江苏  8356

江西  8344

河北  8356

河南  8475

浙江  8321

海南  8417

湖北  8231

湖南  8390

澳门  8342

甘肃  8367

福建  8426

西藏  8355

贵州  8236

辽宁  8316

重庆市 8197

陕西  8364

青海  8384

香港  8203

黑龙江 8380

Time taken: 8.221 seconds, Fetched: 34 row(s)

hive (dblab)>