Hive官方文档:Home-UserDocumentation

Hive DML官方文档:LanguageManual DML

参考文章:Hive 用户指南

1. Loading files into tables

当我们做Load操作是,hive不会做任何数据转换,只是纯复制/移动操作,将数据文件移动到与Hive表对应的位置。

语法

 LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

实例

 # 将本地的数据导入到表中
# 参见 Hive-1.2.1_03_DDL操作
load data local inpath '/app/software/hive/t_sz05_buck.dat' into table t_sz05; # 导入数据
load data local inpath '/app/software/hive/t_sz03_part.dat' into table t_sz03_part partition (dt='', country='CN');

2. Inserting data into Hive Tables from queries

可以使用insert子句将查询结果插入到表中。

语法

 # 标准语法:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; # Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; # Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

实例

 # 建表
create table t_sz10 (id int, name string)
row format delimited fields terminated by ','; # 操作步骤
: jdbc:hive2://mini01:10000> select * from t_sz02_ext; # 要查询的表
+----------------+------------------+--+
| t_sz02_ext.id | t_sz02_ext.name |
+----------------+------------------+--+
| | 刘晨 |
| | 王敏 |
| | 张立 |
| | 刘刚 |
| | 孙庆 |
| | 易思玲 |
| | 李娜 |
| | 梦圆圆 |
| NULL | NULL |
+----------------+------------------+--+
rows selected (0.099 seconds)
: jdbc:hive2://mini01:10000> insert into table t_sz10 select id, name from t_sz02_ext where id < 5;
……………… # MapReduce
No rows affected (16.029 seconds)
: jdbc:hive2://mini01:10000> select * from t_sz10; # 数据已经插入
+------------+--------------+--+
| t_sz10.id | t_sz10.name |
+------------+--------------+--+
| | 刘晨 |
| | 王敏 |
| | 张立 |
| | 刘刚 |
+------------+--------------+--+
rows selected (0.092 seconds)

3. Writing data into the filesystem from queries

根据查询结果导出数据。如果有local 那么导出到本地,如果没有local那么导出到HDFS。

 Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.)
SELECT ... FROM ... Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

实例1

 ### 这是一个分区表
: jdbc:hive2://mini01:10000> select * from t_sz03_part;
+-----------------+-------------------+-----------------+----------------------+--+
| t_sz03_part.id | t_sz03_part.name | t_sz03_part.dt | t_sz03_part.country |
+-----------------+-------------------+-----------------+----------------------+--+
| | 张三_20180711 | | CN |
| | lisi_20180711 | | CN |
| | Wangwu_20180711 | | CN |
| | Tom_20180711 | | US |
| | Dvid_20180711 | | US |
| | cherry_20180711 | | US |
| | 张三_20180712 | | CN |
| | lisi_20180712 | | CN |
| | Wangwu_20180712 | | CN |
| | Tom_20180712 | | US |
| | Dvid_20180712 | | US |
| | cherry_20180712 | | US |
+-----------------+-------------------+-----------------+----------------------+--+
rows selected (0.543 seconds)

导出1

 ###  导出1,如果导出的目录不存在,那么创建对应目录
: jdbc:hive2://mini01:10000> insert overwrite local directory '/app/software/hive/export/t_sz03_part_exp.dat'
: jdbc:hive2://mini01:10000> select a.* from t_sz03_part a;
INFO : Number of reduce tasks is set to since there's no reduce operator
INFO : number of splits:
INFO : Submitting tokens for job: job_1531701073794_0001
INFO : The url to track the job: http://mini02:8088/proxy/application_1531701073794_0001/
INFO : Starting Job = job_1531701073794_0001, Tracking URL = http://mini02:8088/proxy/application_1531701073794_0001/
INFO : Kill Command = /app/hadoop/bin/hadoop job -kill job_1531701073794_0001
INFO : Hadoop job information for Stage-: number of mappers: ; number of reducers:
INFO : -- ::, Stage- map = %, reduce = %
INFO : -- ::, Stage- map = %, reduce = %, Cumulative CPU 2.87 sec
INFO : -- ::, Stage- map = %, reduce = %, Cumulative CPU 6.58 sec
INFO : MapReduce Total cumulative CPU time: seconds msec
INFO : Ended Job = job_1531701073794_0001
INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-42-16_386_323439967845595583-1/-mr-10000
INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-42-16_386_323439967845595583-1/-mr-10000
No rows affected (29.35 seconds) # 本地系统的导出数据, 没有任何分隔符
[yun@mini01 t_sz03_part_exp.dat]$ pwd
/app/software/hive/export/t_sz03_part_exp.dat
[yun@mini01 t_sz03_part_exp.dat]$ ll
total
-rw-r--r-- yun yun Jul : 000000_0
-rw-r--r-- yun yun Jul : 000001_0
[yun@mini01 t_sz03_part_exp.dat]$ cat 000000_0
1张三_2018071120180711CN
2lisi_2018071120180711CN
3Wangwu_2018071120180711CN
11Tom_2018071220180712US
12Dvid_2018071220180712US
13cherry_2018071220180712US
[yun@mini01 t_sz03_part_exp.dat]$ cat 000001_0
11Tom_2018071120180711US
12Dvid_2018071120180711US
13cherry_2018071120180711US
1张三_2018071220180712CN
2lisi_2018071220180712CN
3Wangwu_2018071220180712CN

导出2

 # 导出2 # 有分隔符
: jdbc:hive2://mini01:10000> insert overwrite local directory '/app/software/hive/export/t_sz03_part_exp2.dat'
: jdbc:hive2://mini01:10000> row format delimited fields terminated by ','
: jdbc:hive2://mini01:10000> select a.* from t_sz03_part a;
INFO : Number of reduce tasks is set to since there's no reduce operator
INFO : number of splits:
INFO : Submitting tokens for job: job_1531701073794_0002
INFO : The url to track the job: http://mini02:8088/proxy/application_1531701073794_0002/
INFO : Starting Job = job_1531701073794_0002, Tracking URL = http://mini02:8088/proxy/application_1531701073794_0002/
INFO : Kill Command = /app/hadoop/bin/hadoop job -kill job_1531701073794_0002
INFO : Hadoop job information for Stage-: number of mappers: ; number of reducers:
INFO : -- ::, Stage- map = %, reduce = %
INFO : -- ::, Stage- map = %, reduce = %, Cumulative CPU 3.2 sec
INFO : -- ::, Stage- map = %, reduce = %, Cumulative CPU 6.49 sec
INFO : MapReduce Total cumulative CPU time: seconds msec
INFO : Ended Job = job_1531701073794_0002
INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp2.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-49-09_419_2948346934380749234-1/-mr-10000
INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp2.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-49-09_419_2948346934380749234-1/-mr-10000
No rows affected (27.983 seconds) # 本地导出数据,根据 逗号(,) 分隔
[yun@mini01 t_sz03_part_exp2.dat]$ pwd
/app/software/hive/export/t_sz03_part_exp2.dat
[yun@mini01 t_sz03_part_exp2.dat]$ ll
total
-rw-r--r-- yun yun Jul : 000000_0
-rw-r--r-- yun yun Jul : 000001_0
[yun@mini01 t_sz03_part_exp2.dat]$ cat 000000_0
,张三_20180711,,CN
,lisi_20180711,,CN
,Wangwu_20180711,,CN
,Tom_20180712,,US
,Dvid_20180712,,US
,cherry_20180712,,US
[yun@mini01 t_sz03_part_exp2.dat]$ cat 000001_0
,Tom_20180711,,US
,Dvid_20180711,,US
,cherry_20180711,,US
,张三_20180712,,CN
,lisi_20180712,,CN
,Wangwu_20180712,,CN

4. Insert

语法

 INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

 Where values_row is:
( value [, value ...] )

就是一个正常的insert语句

实例1

 # 建表语句
CREATE TABLE students (name VARCHAR(), age INT, gpa DECIMAL(, ))
CLUSTERED BY (age) INTO BUCKETS STORED AS ORC; # insert语句 其中insert 会走map reduce
INSERT INTO TABLE students
VALUES ('fred flintstone', , 1.28), ('barney rubble', , 2.32); # 查询结果
: jdbc:hive2://mini01:10000> select * from students;
+------------------+---------------+---------------+--+
| students.name | students.age | students.gpa |
+------------------+---------------+---------------+--+
| fred flintstone | | 1.28 |
| barney rubble | | 2.32 |
+------------------+---------------+---------------+--+
rows selected (0.241 seconds)

实例2

 CREATE TABLE pageviews (userid VARCHAR(), link STRING, came_from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO BUCKETS STORED AS ORC; INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null); # 查询结果
: jdbc:hive2://mini01:10000> select * from pageviews ;
+-------------------+-----------------+----------------------+----------------------+--+
| pageviews.userid | pageviews.link | pageviews.came_from | pageviews.datestamp |
+-------------------+-----------------+----------------------+----------------------+--+
| jsmith | mail.com | sports.com | -- |
| jdoe | mail.com | NULL | -- |
+-------------------+-----------------+----------------------+----------------------+--+
rows selected (0.123 seconds)

5. Select

 SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]

注意:

1、order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。

2、sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。

3、distribute by(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。

4、Cluster by(字段)除了具有Distribute by的功能外,还会对该字段进行排序。

       因此,如果分桶和sort字段是同一个时,此时,cluster by = distribute by + sort by

分桶表的作用:最大的作用是用来提高join操作的效率;

(思考这个问题:

select a.id,a.name,b.addr from a join b on a.id = b.id;

如果a表和b表已经是分桶表,而且分桶的字段是id字段

做这个join操作时,还需要全表做笛卡尔积吗?)

5.1. Join

两张表

 SELECT a.* FROM a JOIN b ON (a.id = b.id);
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department);
SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id); 示例:
select * from t_sz01 a join t_sz05 b on a.id = b.id;

三张表

 SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2);

 示例:
select * from t_sz01 a join t_sz05 b on a.id = b.id join t_sz03_part c on a.id = c.id;

6. Update

语法

 UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

7. Delete

语法

 DELETE FROM tablename [WHERE expression]

8. User-Defined Functions (UDFs)

官方文档:LanguageManual UDF

8.1. 使用案例

 hive (test_db)> select current_database();
OK
test_db
Time taken: 0.324 seconds, Fetched: row(s)
hive (test_db)> create table dual (id string); # 建表
OK
Time taken: 0.082 seconds # 本地文件上传
[yun@mini01 hive]$ ll /app/software/hive/dual.dat
-rw-rw-r-- yun yun Jul : /app/software/hive/dual.dat
[yun@mini01 hive]$ cat /app/software/hive/dual.dat
# 只有一个空格 【必须要有一个字符,不能为空】
hive (test_db)> load data local inpath '/app/software/hive/dual.dat' overwrite into table dual; # 导入数据
Loading data to table test_db.dual
Table test_db.dual stats: [numFiles=, numRows=, totalSize=, rawDataSize=]
OK
Time taken: 0.421 seconds # 函数测试
hive (test_db)> select substr('zhangtest', , ) from dual; # 测试 substr
OK
han
Time taken: 0.081 seconds, Fetched: row(s)
hive (test_db)> select concat('zha', '---', 'kkk') from dual; # 测试concat
OK
zha---kkk
Time taken: 0.118 seconds, Fetched: row(s)

8.2. Transform实现

Hive的 TRANSFORM 关键字提供了在SQL中调用自写脚本的功能

适合实现Hive中没有的功能又不想写UDF的情况

使用示例1:下面这句sql就是借用了weekday_mapper.py对数据进行了处理.

 CREATE TABLE u_data_new (
movieid INT,
rating INT,
weekday INT,
userid INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (movieid , rate, timestring,uid)
USING 'python weekday_mapper.py'
AS (movieid, rating, weekday,userid)
FROM t_rating;

其中weekday_mapper.py内容如下

 #!/bin/python
import sys
import datetime for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])

Hive-1.2.1_04_DML操作的更多相关文章

  1. hive执行结果moveTask操作失败

    hive执行结果moveTask操作失败 Apache Hive 2.1.0 ,在执行"INSERT OVERWRITE TABLE ...... select "或者 " ...

  2. Hive学习(三)Hive的Java客户端操作

    Hive的Java客户端操作分为JDBC和Thrifit Client,首先启动Hive远程服务: hive --service hiveserver 一.JDBC 在MyEclipse中首先创建连接 ...

  3. Hive DDL、DML操作

    • 一.DDL操作(数据定义语言)包括:Create.Alter.Show.Drop等. • create database- 创建新数据库 • alter database - 修改数据库 • dr ...

  4. Hive[5] HiveQL 数据操作

    5.1 向管理表中装载数据   Hive 没有行级别的数据插入更新和删除操作,那么往表中装载数据的唯一途径就是使用一种“大量”的数据装载操作,或者通过其他方式仅仅将文件写入到正确的目录下:   LOA ...

  5. Hive DDL及DML操作

    一.修改表 增加/删除分区 语法结构 ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ...

  6. HIVE的sql语句操作

    Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在hadoop 分布式文件系统中的数据,可以将结构 化的数据文件映射为一张数据库表,并提供完整的SQL查 ...

  7. hive表分区相关操作

    Hive 表分区 Hive表的分区就是一个目录,分区字段不和表的字段重复 创建分区表: create table tb_partition(id string, name string) PARTIT ...

  8. Hive的两种操作模式

    Hive的客户端操作 Hive的客户端操作 通过JDBC操作Hive 通过Thrift操作Hive 通过JDBC操作Hive 首先 Hive 启动远程服务 hive --service hiveser ...

  9. HDFS文件和HIVE表的一些操作

    1. hadoop fs -ls  可以查看HDFS文件 后面不加目录参数的话,默认当前用户的目录./user/当前用户 $ hadoop fs -ls 16/05/19 10:40:10 WARN ...

随机推荐

  1. Linux_CentOS-服务器搭建 <三> 补充

    今天 才发现,服务器上 JDK 都没有好好的安装下.在这里补充说下. 1.看看机子上JDK的安装了多少 $ rpm -qa |grep java 会出现类似: java-1.6.0-openjdk-1 ...

  2. 手把手教你学会用Spring AOP

    用了Spring很长时间了,一直想写些AOP的东西,但一直没有空闲,直到现在项目稍微进入正轨了,才赶紧写写.废话不多说,先从AOP入门开始,后面再介绍AOP的原理(JDK动态代码和CGLIB动态代码的 ...

  3. Eureka多机高可用

    线上Eureka高可用集群,至少三个节点组成一个集群,推荐部署在不同的服务器上,IP用域名绑定,端口保持一致. 10.1.22.26:876210.1.22.27:876210.1.22.28:876 ...

  4. win7下安装mongodb

    1.下载mongodb,解压2.新建路径,如D:\mongodb,将解压出来的bin目录复制到该目录下3.在D:\mongodb目录下在新建data目录,在data目录下新建两个目录:db和log.4 ...

  5. 分布式文件系统FastDFS安装教程

    前言 FastDFS(Fast Distributed File System)是一款开源轻量级分布式文件系统,本文不讲解原理和架构,只是在个人使用部署过程中耗费了好长时间和精力,遇到了很多的坑,于是 ...

  6. React native 环境搭建遇到问题解决(android)

    新建项目 react-native init TestApp 运行项目 react-native run-android 不好意思,错误马上就到了 问题一 通常遇到这个错误之后,系统会给出这个具体详情 ...

  7. Git Extensions system.invalidoperationexception尚未提供文件名,因此无法启动进程

    根据别人的博客按照步骤安装,地址如下:http://www.cnblogs.com/sorex/archive/2011/08/10/2132359.html 但是安装Git Extensions后生 ...

  8. SQL不重复查找数据及把一列多行内容拼成一行

    如下表: 表名:Test ID RowID Col1 Col2 1 1 A A 2 1 B A 3 1 A B 4 1 C B 1,查找表中字段重复的只查找一次 select distinct Col ...

  9. int**a = new int[5][6] 怎么delete

    int **a = new int[5][6],这个根本编译不过去吧. 如果你想new一个二维数组出来,应该这样: int** a= new int*[5]; for (int i = 0; i &l ...

  10. Ajax实现的城市二级联动一

    前一篇是把省份和城市都写在JS里,这里把城市放在PHP里,通过发送Ajax请求城市数据渲染到页面. 1.html <select id="province"> < ...