Hive 基本语法操练(一):表操作
Hive 和 Mysql 的表操作语句类似,如果熟悉 Mysql,学习Hive 的表操作就非常容易了,下面对 Hive 的表操作进行深入讲解。 **(1)先来创建一个表名为student的内部表** hive> create table if not exists student (sno INT, sname STRING, age INT, sex STRING) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.985 seconds 建表规则如下: CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path] •CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXIST 选项来忽略这个异常 •EXTERNAL 关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION) •LIKE 允许用户复制现有的表结构,但是不复制数据 •COMMENT可以为表与字段增加描述 •ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] 用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,将会使用自带的 SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的 SerDe,Hive 通过 SerDe 确定表的具体的列的数据。 •STORED AS SEQUENCEFILE | TEXTFILE | RCFILE | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname 如果文件数据是纯文本,可以使用 STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCE 。 **(2)创建外部表**
```
hive> create external table if not exists student2 (sno INT, sname STRING, age INT, sex STRING) row format delimited fields terminated by '\t' stored as textfile location '/user/external';
OK
Time taken: 0.089 seconds hive> show tables;
OK
student1
student2
Time taken: 0.06 seconds, Fetched: 12 row(s)
``` (3)删除表 首先创建一个表名为test1的表
```
hive> create table if not exists test1(id INT, name STRING);
OK
Time taken: 0.064 seconds ```
然后查看一下是否有test1表
```
hive> show tables;
OK
student
student2
test1
Time taken: 0.22 seconds, Fetched: 3 row(s) ```
用命令删test1表
```
hive> drop table test1;
OK
Time taken: 0.838 seconds ```
查看test1表是否删除
```
hive> show tables;
OK
student
student2
Time taken: 0.14 seconds, Fetched: 2 row(s) ```
(4)修改表的结构,比如为表增加字段 首先看一下student表的结构
```
hive> desc student;
OK
sno int
sname string
age int
sex string
Time taken: 0.142 seconds, Fetched: 4 row(s) ```
为表student增加两个字段
```
hive> alter table student add columns (address STRING, grade STRING);
OK
Time taken: 0.138 seconds ```
再查看一下表的结构,看是否增加
```
hive> desc student;
OK
sno int
sname string
age int
sex string
address string
grade string
Time taken: 0.145 seconds, Fetched: 6 row(s) ```
(5)修改表名student为student1
```
hive> alter table student rename to student1;
OK
Time taken: 0.15 seconds ```
查看一下
```
hive> show tables;
OK
student1
student2
Time taken: 0.028 seconds, Fetched: 2 row(s) ``` (6)创建和已知表相同结构的表
```
hive> create table copy_student1 like student1;
OK
Time taken: 0.092 seconds ```
查看一下
```
hive> show tables;
OK
copy_student1
student1
student2
Time taken: 0.03 seconds, Fetched: 3 row(s) ```
***2、加入导入数据的方法,(数据里可以包含重复记录),只有导入了数据,才能供后边的查询使用*** **(1)加载本地数据load** 首先看一下表的结构
```
hive> desc student1;
OK
sno int
sname string
age int
sex string
address string
grade string
Time taken: 0.118 seconds, Fetched: 6 row(s) ```
创建/home/hadoop/data目录,并在该目录下创建student1.txt文件,添加如下内容
```
[hadoop@master ~]$ cd /home
[hadoop@master home]$ ll
total 4
drwx------. 28 hadoop hadoop 4096 May 17 18:42 hadoop
[hadoop@master home]$ cd hadoop/
[hadoop@master ~]$ ll
total 32
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Desktop
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Documents
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Downloads
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Music
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Pictures
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Public
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Templates
drwxr-xr-x. 2 hadoop hadoop 4096 Apr 3 18:12 Videos
[hadoop@master ~]$ sudo mkdir data/
[hadoop@master ~]$ cd data/
[hadoop@master data]$ sudo vim student1.txt
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三 ```
加载数据到student1表中 ```
hive> load data local inpath '/home/hadoop/data/student1.txt' into table student1;
Loading data to table default.student1
Table default.student1 stats: [numFiles=1, numRows=0, totalSize=300, rawDataSize=0]
OK
Time taken: 1.271 seconds ``` 查看是否加载成功 ```
hive> select * from student1;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三
Time taken: 0.052 seconds, Fetched: 15 row(s) ``` **(2)加载hdfs中的文件** 首先将文件student1.txt上传到hdfs文件系统对应目录上 ```
[hadoop@master hadoop-2.6.0]$ hadoop fs -put /home/hadoop/data/student1.txt /user/hive
[hadoop@master hadoop-2.6.0]$ hadoop fs -ls /user/hive
Found 2 items
-rw-r--r-- 3 hadoop supergroup 193 2018-05-17 23:54 /user/hive/student1.txt
drwxr-xr-x - hadoop supergroup 0 2018-05-17 23:10 /user/hive/warehouse ``` 加载hdfs中的文件数据到copy_student1表中 ```
hive> LOAD DATA INPATH '/user/hive/student1.txt' INTO TABLE copy_student1;
Loading data to table default.copy_student1
Table default.copy_student1 stats: [numFiles=1, totalSize=191]
OK
Time taken: 1.354 seconds
``` 查看是否加载成功 ```
hive> SELECT * FROM copy_student1;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三
Time taken: 0.44 seconds, Fetched: 5 row(s)
``` **(3)表插入数据(单表插入、多表插入)** 1)单表插入 首先创建一个表copy_student2,表结构和student1相同 ```
hive> create table copy_student2 like student1;
OK
Time taken: 0.691 seconds ``` 查看一下是否创建成功 ```
hive> show tables;
OK
copy_student1
copy_student2
student1
student2
Time taken: 0.065 seconds, Fetched: 4 row(s) ``` 看一下copy_student2表的表结构 ```
hive> DESC copy_student2;
OK
sno int
sname string
age int
sex string
address string
grade string
Time taken: 0.121 seconds, Fetched: 6 row(s)
``` 把表student1中的数据插入到copy_student2表中 ```
hive> insert overwrite table copy_student2 select * from copy_student1;
Query ID = hadoop_20180518000101_af36da39-e88b-4c1b-b89c-c000bf5f59dd
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1526553207632_0001, Tracking URL = http://master:8088/proxy/application_1526553207632_0001/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-05-18 00:01:16,715 Stage-1 map = 0%, reduce = 0%
2018-05-18 00:01:29,632 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_1526553207632_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ns/tmp/hive/hadoop/d6cb41c0-cc18-471e-861f-f08553caea48/hive_2018-05-18_00-01-00_086_4552315865937351442-1/-ext-10000
Loading data to table default.copy_student2
Table default.copy_student2 stats: [numFiles=1, numRows=5, totalSize=190, rawDataSize=185]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.36 sec HDFS Read: 403 HDFS Write: 268 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
Time taken: 35.015 seconds ``` 查看数据是否插入 ```
hive> select * from copy_student2;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三
Time taken: 0.073 seconds, Fetched: 5 row(s) ``` 2)多表插入 先创建两个表 ```
hive> CREATE TABLE copy_student3 LIKE student1;
OK
Time taken: 0.21 seconds
hive> CREATE TABLE copy_student4 LIKE student1;
OK
Time taken: 0.099 seconds ``` 向多表插入数据 ```
hive> FROM student1 INSERT OVERWRITE TABLE copy_student3 SELECT * INSERT OVERWRITE TABLE copy_student4 SELECT *;
(省略MapReduce过程)
``` 查看结果 ```
hive> select * from copy_student3;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三
Time taken: 0.049 seconds, Fetched: 5 row(s) ``` ```
hive> select * from copy_student4;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三
Time taken: 0.049 seconds, Fetched: 5 row(s) ``` **3、有关表的内容的查询** (1)查表的所有内容 ```
hive> select * from student1;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
201501011 李红 23 女 北京 大三
Time taken: 0.041 seconds, Fetched: 5 row(s) ``` (2)查表的某个字段的属性 ```
hive> select sname from student1;
OK
张三
李四
王娟
周王
李红
Time taken: 0.056 seconds, Fetched: 5 row(s) ``` (3)where条件查询 ```
hive> SELECT * FROM student1 WHERE sno>201501004 AND address="北京";
OK
201501011 李红 23 女 北京 大三
Time taken: 0.203 seconds, Fetched: 1 row(s) ``` (4)all和distinct的区别(这就要求表中要有重复的记录,或者某个字段要有重复的数据) ```
hive> select all age,grade from student1;
OK
22 大三
23 大二
22 大三
24 大四
23 大三
Time taken: 0.054 seconds, Fetched: 5 row(s) ``` ```
hive> select age,grade from student1;
OK
22 大三
23 大二
22 大三
24 大四
23 大三
Time taken: 0.053 seconds, Fetched: 5 row(s) ``` ```
hive> select distinct age,grade from student1;
Query ID = hadoop_20180518001414_fe7461b7-7edd-4661-abc4-14859e3dba91
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0004, Tracking URL = http://master:8088/proxy/application_1526553207632_0004/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 00:14:10,913 Stage-1 map = 0%, reduce = 0%
2018-05-18 00:14:22,260 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.27 sec
2018-05-18 00:14:36,734 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.51 sec
MapReduce Total cumulative CPU time: 2 seconds 510 msec
Ended Job = job_1526553207632_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.51 sec HDFS Read: 391 HDFS Write: 40 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 510 msec
OK
22 大三
23 大三
23 大二
24 大四
Time taken: 34.358 seconds, Fetched: 4 row(s) ``` ```
hive> select distinct age from student1;
Query ID = hadoop_20180518001414_69278499-54b5-42b7-867c-4ebe8113a2f9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0005, Tracking URL = http://master:8088/proxy/application_1526553207632_0005/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 00:14:56,548 Stage-1 map = 0%, reduce = 0%
2018-05-18 00:15:03,047 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.85 sec
2018-05-18 00:15:10,390 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.98 sec
MapReduce Total cumulative CPU time: 1 seconds 980 msec
Ended Job = job_1526553207632_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.98 sec HDFS Read: 391 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 980 msec
OK
22
23
24
Time taken: 23.181 seconds, Fetched: 3 row(s) ``` (5)limit限制查询 ```
hive> SELECT * FROM student1 LIMIT 4;
OK
201501001 张三 22 男 北京 大三
201501003 李四 23 男 上海 大二
201501004 王娟 22 女 广州 大三
201501010 周王 24 男 深圳 大四
Time taken: 0.253 seconds, Fetched: 4 row(s)
``` **(6) GROUP BY 分组查询** group by 分组查询在数据统计时比较常用,接下来讲解 group by 的使用。 1) 创建一个表 group_test,表的内容如下。 ```
create table group_test(uid STRING, gender STRING, ip STRING) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.449 seconds ``` ```
[hadoop@master test]$ sudo vim user.txt
08 female 192.168.1.42
01 male 192.168.1.22
02 female 192.168.1.3
01 male 192.168.1.26
03 male 192.168.1.5
08 female 192.168.1.62
04 male 192.168.1.9
06 female 192.168.1.52
06 female 192.168.1.7
08 female 192.168.1.21
05 male 192.168.1.8
01 male 192.168.1.2
01 male 192.168.1.32
05 male 192.168.1.29
03 male 192.168.1.23
06 female 192.168.1.201
07 female 192.168.1.11
08 female 192.168.1.88 ``` 向 group_test 表中导入数据。 hive> load data local inpath '/home/hadoop/test/user.txt' into table group_test;
Loading data to table default.group_test
Table default.group_test stats: [numFiles=1, totalSize=193]
OK
Time taken: 0.865 seconds 2) 计算表的行数命令如下。 ```
hive> select count(*) from group_test;
Query ID = hadoop_20180518040808_a73617a5-dd9a-48c4-b2a9-0ce1dd4bf4cd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0013, Tracking URL = http://master:8088/proxy/application_1526553207632_0013/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 04:08:45,431 Stage-1 map = 0%, reduce = 0%
2018-05-18 04:08:58,184 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.59 sec
2018-05-18 04:09:09,818 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec
MapReduce Total cumulative CPU time: 2 seconds 780 msec
Ended Job = job_1526553207632_0013
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec HDFS Read: 624 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 780 msec
OK
18
Time taken: 34.896 seconds, Fetched: 1 row(s)hive> create table group_gender_sum(gender STRING, sum INT);
OK
Time taken: 0.081 seconds ``` 3) 根据性别计算去重用户数。 首先创建一个表 group_gender_sum ```
hive> create table group_gender_sum(gender STRING,sum INT);
OK
Time taken: 0.142 seconds ``` 将表 group_test 去重后的数据导入表 group_gender_sum。 ```
hive> insert overwrite table group_gender_sum select group_test.gender,count(distinct group_test.uid) from group_test group by group_test.gender;
Query ID = hadoop_20180518041010_e51ae2fb-0b9e-4b5d-9a0c-87946496282f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0014, Tracking URL = http://master:8088/proxy/application_1526553207632_0014/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 04:10:44,336 Stage-1 map = 0%, reduce = 0%
2018-05-18 04:10:50,573 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2018-05-18 04:10:58,903 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.12 sec
MapReduce Total cumulative CPU time: 3 seconds 120 msec
Ended Job = job_1526553207632_0014
Loading data to table default.group_gender_sum
Table default.group_gender_sum stats: [numFiles=1, numRows=17, totalSize=371, rawDataSize=354]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.12 sec HDFS Read: 624 HDFS Write: 452 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 120 msec
OK
Time taken: 29.357 seconds ``` 同时可以做多个聚合操作,但是不能有两个聚合操作有不同的 distinct 列。下面正确合法的聚合操作语句。 首先创建一个表 group_gender_agg ```
hive> create table group_gender_agg(gender STRING, sum1 INT, sum2 INT, sum3 INT);
OK
Time taken: 0.092 seconds ``` 将表 group_test 聚合后的数据插入表 group_gender_agg。 ```
hive> insert overwrite table group_gender_agg select group_test.gender,count(distinct group_test.uid),count(*),sum(distinct group_test.uid) from group_test group by group_test.gender;
Query ID = hadoop_20180518041212_0cf81102-2c8f-4370-8cda-3b7d61c51877
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0015, Tracking URL = http://master:8088/proxy/application_1526553207632_0015/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 04:12:45,953 Stage-1 map = 0%, reduce = 0%
2018-05-18 04:12:52,218 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.81 sec
2018-05-18 04:12:59,519 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.42 sec
MapReduce Total cumulative CPU time: 2 seconds 420 msec
Ended Job = job_1526553207632_0015
Loading data to table default.group_gender_agg
Table default.group_gender_agg stats: [numFiles=1, numRows=17, totalSize=439, rawDataSize=422]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.42 sec HDFS Read: 624 HDFS Write: 520 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 420 msec
OK
Time taken: 21.103 seconds ``` 但是,不允许在同一个查询内有多个 distinct 表达式。下面的查询是不允许的。 ```
hive> insert overwrite table group_gender_agg select group_test.gender,count(distinct group_test.uid),count(distinct group_test.ip) from group_test group by group_test.gender;
``` 这条查询语句是不合法的,因为 distinct group_test.uid 和 distinct group_test.ip 操作了uid 和 ip 两个不同的列。 (7) ORDER BY 排序查询 ORDER BY 会对输入做全局排序,因此只有一个 Reduce(多个 Reduce 无法保证全局有序)会导致当输入规模较大时,需要较长的计算时间。使用 ORDER BY 查询的时候,为了优化查询的速度,使用 hive.mapred.mode 属性。 ```
hive.mapred.mode = nonstrict;(default value/默认值)
hive.mapred.mode=strict;
``` 与数据库中 ORDER BY 的区别在于,在 hive.mapred.mode=strict 模式下必须指定limit ,否则执行会报错。 ```
hive> set hive.mapred.mode=strict;
hive> select * from group_test order by uid limit 5;
Query ID = hadoop_20180518041414_f4daefe3-60ec-43d3-ab5c-d7fa7518fc5c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0016, Tracking URL = http://master:8088/proxy/application_1526553207632_0016/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 04:14:18,047 Stage-1 map = 0%, reduce = 0%
2018-05-18 04:14:25,572 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.98 sec
2018-05-18 04:14:31,896 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.07 sec
MapReduce Total cumulative CPU time: 2 seconds 70 msec
Ended Job = job_1526553207632_0016
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.07 sec HDFS Read: 624 HDFS Write: 121 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 70 msec
OK
01 male 192.168.1.32
01 male 192.168.1.2
01 male 192.168.1.22
01 male 192.168.1.26
02 female 192.168.1.3
Time taken: 22.228 seconds, Fetched: 5 row(s) ``` (8) SORT BY 查询 sort by 不受 hive.mapred.mode 的值是否为 strict 和 nostrict 的影响。sort by 的数据只能保证在同一个 Reduce 中的数据可以按指定字段排序。 使用 sort by 可以指定执行的 Reduce 个数(set mapred.reduce.tasks=< number>)这样可以输出更多的数据。对输出的数据再执行归并排序,即可以得到全部结果。 ```
hive> set hive.mapred.mode=strict;
hive> select * from group_test sort by uid ;
Query ID = hadoop_20180518041616_68543eaf-2bac-4c35-bad6-dd286052ded6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1526553207632_0017, Tracking URL = http://master:8088/proxy/application_1526553207632_0017/
Kill Command = /opt/modules/hadoop-2.6.0/bin/hadoop job -kill job_1526553207632_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-18 04:16:11,201 Stage-1 map = 0%, reduce = 0%
2018-05-18 04:16:19,537 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.83 sec
2018-05-18 04:16:26,844 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.88 sec
MapReduce Total cumulative CPU time: 1 seconds 880 msec
Ended Job = job_1526553207632_0017
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.88 sec HDFS Read: 624 HDFS Write: 469 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
01 male 192.168.1.32
01 male 192.168.1.2
01 male 192.168.1.22
01 male 192.168.1.26
02 female 192.168.1.3
03 male 192.168.1.5
03 male 192.168.1.23
04 male 192.168.1.9
05 male 192.168.1.29
05 male 192.168.1.8
06 female 192.168.1.7
06 female 192.168.1.52
06 female 192.168.1.201
07 female 192.168.1.11
08 female 192.168.1.88
08 female 192.168.1.21
08 female 192.168.1.62
08 female 192.168.1.42
Time taken: 26.065 seconds, Fetched: 18 row(s) ``` (9) DISTRIBUTE BY 排序查询 按照指定的字段对数据划分到不同的输出 Reduce 文件中,操作如下。 ```
hive> insert overwrite local directory '/home/hadoop/djt/test' select * from group_test distribute by length(gender);
``` 此方法根据 gender 的长度划分到不同的 Reduce 中,最终输出到不同的文件中。length 是内建函数,也可以指定其它的函数或者使用自定义函数。 ```
hive> insert overwrite local directory '/home/hadoop/djt/test' select * from group_test order by gender distribute by length(gender);
``` order by gender 与 distribute by length(gender) 不能共用。 (10) CLUSTER BY 查询 cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。 以上就是博主为大家介绍的这一板块的主要内容,这都是博主自己的学习过程,希望能给大家带来一定的指导作用,有用的还望大家点个支持,如果对你没用也望包涵,有错误烦请指出。如有期待可关注博主以第一时间获取更新哦,谢谢! 版权声明:本文为博主原创文章,未经博主允许不得转载。
Hive 基本语法操练(一):表操作的更多相关文章
- Hive 基本语法操练(二):视图和索引操作
1. 视图操作 ------- 1) 创建一个测试表. ``` hive> create table test(id int,name string); OK Time taken: 0.385 ...
- Hive基本语法操练
建表规则如下: CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment ...
- Hive 基本语法操练(五):Hive 的 JOIN 用法
Hive 的 JOIN 用法 hive只支持等连接,外连接,左半连接.hive不支持非相等的join条件(通过其他方式实现,如left outer join),因为它很难在map/reduce中实现这 ...
- SQL进阶语法的多表操作
AS别名 多张表联合操作,如果表多,字段名长,不方便阅读.这里我们可以使用 as 关键字来对字段名设置别名. as也可以省略,看个人喜好,在这里我还是支持把 as 写上,这样我们在面对复杂的SQL ...
- SQL基础语法的单表操作 select|insert|update|delete(增删改查) 简单使用
以下案列以此表举例 1.select(查询) select简单的查询分为两种 注:字段也就是表结构中的列的名称 第一种: select 字段名 from 表名 此种查询只列出你所需要查询的字段, ...
- Hive 基本语法操练(三):分区操作和桶操作
(一)分区操作 Hive 的分区通过在创建表时启动 PARTITION BY 实现,用来分区的维度并不是实际数据的某一列,具体分区的标志是由插入内容时给定的.当要查询某一分区的内容时可以采用 WHER ...
- 009-Hadoop Hive sql语法详解4-DQL 操作:数据查询SQL-select、join、union、udtf
一.基本的Select 操作 语法SELECT [ALL | DISTINCT] select_expr, select_expr, ...FROM table_reference[WHERE whe ...
- Hive 基本语法操练(六):Hive 的权限控制
Hive 的权限控制 Hive从0.10可以通过元数据控制权限.但是Hive的权限控制并不是完全安全的.基本的授权方案的目的是防止用户不小心做了不合适的事情. 为了使用Hive的授权机制,有两个参数必 ...
- 008-Hadoop Hive sql语法详解3-DML 操作:元数据存储
一.概述 hive不支持用insert语句一条一条的进行插入操作,也不支持update操作.数据是以load的方式加载到建立好的表中.数据一旦导入就不可以修改. DML包括:INSERT插入.UPDA ...
随机推荐
- C# 自定义颜色
一.需要引用 using System.Windows.Media; 二. 自定义颜色 通过自定义 RGB 的值来达到自定义颜色的目的 Color _Mycolor = Color.FromRgb(5 ...
- shell入门-tr替换字符和split切割大文件
命令:tr 说明:替换字符 格式tr ‘原字符’ ‘新字符’ 可以是范围字符,指定字符 命令:split 选项:-b 50m 1.txt 根据大小分割 单位是b不用单位,单位是兆加m -l 100 ...
- JAVAWeb SSH框架 上传文件,如2007的EXCEL
下面的代码是上传EXCEL的代码,其实,就是在上传文件到服务器,代码都差不多,只是接收的文件的类型改一下即可. 1.jsp 用的是struts2 标签 代码: <s:file name=&quo ...
- 任务调度TimerTask&Quartz的 Java 实现方法与比较
文章引自--https://www.ibm.com/developerworks/cn/java/j-lo-taskschedule/ 前言 任务调度是指基于给定时间点,给定时间间隔或者给定执行次数自 ...
- 由hibernate配置inverse="true"而导致的软件错误,并分析解决此问题的过程
题目背景软件是用来做安装部署的工具,在部署一套系统时会有很多安装包,通过此工具,可以生成一个xml文件用以保存每个安装包的文件位置.顺序.参数.所需脚本.依赖条件验证(OS..net.IIS.数据版本 ...
- 9、perldoc文档阅读器
转载:http://www.cnblogs.com/nkwy2012/p/6016320.html 一般来说,将文档的名称作为参数传递给perldoc命令,即可查阅该文档.比如下面,给定文档名称per ...
- 1.从GUI到MVC
GUI(graphic user interface 用户图形界面).GUI编程的目的是提供交互性,并根据用户的操作实时的更新界面.用户的操作是不可预知的鼠标和键盘事件,我们如何保持同步和更新?在上层 ...
- 26.【转载】挖洞技巧:绕过短信&邮箱轰炸限制以及后续
邮箱轰炸可能对企业来说危害很小,但对用户危害很大.短信轰炸相比邮箱轰炸,带来的危害涉及到企业和用户. 那么这些问题都存在在哪些方面呢? ①:登录处 ②:注册处 ③:找回密码处 ④:绑定处 ⑤:活动领取 ...
- opencv头文件
转载自:http://blog.csdn.net/aaron121211/article/details/51526901 1. .hpp文件是.h和.cpp文件在一起的2. #include < ...
- JWT使用过程中遇到的问题
1.创建token的盐设置过于简单,出现secret key byte array cannot be null or empty. 异常 解决方法:jwt:config:key:hwy ------ ...