本案例使用的数据均来源于Oracle自带的emp和dept表

创建表

语法:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] (Note: Only available starting with Hive 0.10.0)]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] (Note: Only available starting with Hive 0.6.0)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] (Note: Only available starting with Hive 0.6.0)
[AS select_statement] (Note: Only available starting with Hive 0.5.0, and not supported when creating external tables.)
create table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as textfile;

create table dept(
deptno int,
dname string,
loc string
)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as textfile;

注:创建表时默认列分割符是\001,行分隔符是\n

加载数据到hive表

Hive操作的数据源:文件、其他表、其他数据库

1)load:加载本地/HDFS文件到hive表

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]

默认表数据存储在HDFS上的/user/hive/warehouse目录下,该目录可以在hive-site.xml中配置。

load data inpath 加载hdfs文件到hive表中;

load data local inpath 加载本地文件到hive表中;

overwrite 是否会覆盖表里已有的数据

load data local inpath '/home/spark/software/data/emp.txt' overwrite into table emp;
load data local inpath '/home/spark/software/data/dept.txt' overwrite into table dept;

2)insert:导入数据到表里/从表里导出到HDFS或者本地目录

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ... Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

3)sqoop: 关系型数据库和HDFS文件导入/导出操作

详见sqoop章节介绍。

select操作

select * from emp;
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10

select * from dept;
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON

where使用

select * from emp where deptno =10;
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 select * from emp where deptno <>10;
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20 select * from emp where ename ='SCOTT';
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 select ename,sal from emp where sal between 800 and 1500;
SMITH 800.0
WARD 1250.0
MARTIN 1250.0
TURNER 1500.0
ADAMS 1100.0
JAMES 950.0
MILLER 1300.0

limit使用

select * from emp limit 4;
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20

(not) in使用

select ename,sal,comm from emp where ename in ('SMITH','KING');
SMITH 800.0 NULL
KING 5000.0 NULL select ename,sal,comm from emp where ename not in ('SMITH','KING');
ALLEN 1600.0 300.0
WARD 1250.0 500.0
JONES 2975.0 NULL
MARTIN 1250.0 1400.0
BLAKE 2850.0 NULL
CLARK 2450.0 NULL
SCOTT 3000.0 NULL
TURNER 1500.0 0.0
ADAMS 1100.0 NULL
JAMES 950.0 NULL
FORD 3000.0 NULL
MILLER 1300.0 NULL

is (not) null使用

select ename,sal,comm from emp where comm is null;
SMITH 800.0 NULL
JONES 2975.0 NULL
BLAKE 2850.0 NULL
CLARK 2450.0 NULL
SCOTT 3000.0 NULL
KING 5000.0 NULL
ADAMS 1100.0 NULL
JAMES 950.0 NULL
FORD 3000.0 NULL
MILLER 1300.0 NULL select ename,sal,comm from emp where comm is not null;
ALLEN 1600.0 300.0
WARD 1250.0 500.0
MARTIN 1250.0 1400.0
TURNER 1500.0 0.0

order by的使用

与关系型数据库的order by功能一致,按照某个字段或某几个字段排序输出;

与关系型数据库区别在于:当hive.mapred.mode=strict模式下,必须指定limit否则执行报错;

hive.mapred.mode默认值为nonstrict;

select * from dept;
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
select * from dept order by deptno desc;
40 OPERATIONS BOSTON
30 SALES CHICAGO
20 RESEARCH DALLAS
10 ACCOUNTING NEW YORK select ename,sal,deptno from emp order by deptno asc,ename desc;
MILLER 1300.0 10
KING 5000.0 10
CLARK 2450.0 10
SMITH 800.0 20
SCOTT 3000.0 20
JONES 2975.0 20
FORD 3000.0 20
ADAMS 1100.0 20
WARD 1250.0 30
TURNER 1500.0 30
MARTIN 1250.0 30
JAMES 950.0 30
BLAKE 2850.0 30
ALLEN 1600.0 30
set hive.mapred.mode=strict;
select * from emp order by empno desc;

报错:FAILED: SemanticException 1:27 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'empno'

正确写法:

select * from emp order by empno desc limit 4;
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20

为什么会报错呢?

在order by状态下所有数据会分发到一个节点上进行reduce操作也就只有一个reduce作业,如果在数据量大的情况下会出现无法输出结果的情况,如果进行limit n,那就只有n*map数个记录而已,只有一个reduce也可以处理的过来。

select嵌套查询、别名

from(select ename, sal from emp) e
select e.ename, e.sal
where e.sal>1000;

等价于

select ename, sal from emp where sal>1000;
ALLEN   1600.0
WARD 1250.0
JONES 2975.0
MARTIN 1250.0
BLAKE 2850.0
CLARK 2450.0
SCOTT 3000.0
KING 5000.0
TURNER 1500.0
ADAMS 1100.0
FORD 3000.0
MILLER 1300.0

组函数:max(), min(), avg(), sum(), count()等

select count(*) from emp where deptno=10;
3 select count(ename) from emp where deptno=10; #count某个字段,如果这个字段不为空就算一个.
3 select count(distinct deptno) from emp;
3 select sum(sal) from emp;
29025.0

group by的使用

出现在select中的字段,如果没出现在组函数中,必须出现在Group by语句中

求每个部门的平均薪水:

select deptno, avg(sal) from emp group by deptno;
10 2916.6666666666665
20 2175.0
30 1566.6666666666667

求每个部门中每个工作最高的薪水:

select deptno,job,max(sal) from emp group by deptno,job;
10 CLERK 1300.0
10 MANAGER 2450.0
10 PRESIDENT 5000.0
20 ANALYST 3000.0
20 CLERK 1100.0
20 MANAGER 2975.0
30 CLERK 950.0
30 MANAGER 2850.0
30 SALESMAN 1600.0

having的使用

对分组结果筛选,后跟聚合函数,hive0.11版本之后才支持;where是对单条纪录进行筛选,Having是对分组结果进行筛选。

求每个部门的平均薪水大于2000的部门:

select avg(sal),deptno from emp group by deptno having avg(sal)>2000;
2916.6666666666665 10
2175.0 20

having是hive0.11后才支持的,如果不使用having而想达到having一样的功能,语句如何写?

select deptno, e.avg_sal from (select deptno, avg(sal) as avg_sal from emp group by deptno) e where e.avg_sal > 2000;

CASE...WHEN..THEN使用

select ename, sal,
case
when sal > 1 and sal <=1000 then 'LOWER'
when sal >1000 and sal <=2000 then 'MIDDLE'
when sal >2000 and sal <=4000 then 'HIGH'
ELSE 'HIGHEST' end
from emp; SMITH 800.0 LOWER
ALLEN 1600.0 MIDDLE
WARD 1250.0 MIDDLE
JONES 2975.0 HIGH
MARTIN 1250.0 MIDDLE
BLAKE 2850.0 HIGH
CLARK 2450.0 HIGH
SCOTT 3000.0 HIGH
KING 5000.0 HIGHEST
TURNER 1500.0 MIDDLE
ADAMS 1100.0 MIDDLE
JAMES 950.0 LOWER
FORD 3000.0 HIGH
MILLER 1300.0 MIDDLE

Hive基础之Hive表常用操作的更多相关文章

  1. Hive基础之Hive环境搭建

    Hive默认元数据信息存储在Derby里,Derby内置的关系型数据库.单Session的(只支持单客户端连接,两个客户端连接过去会报错): Hive支持将元数据存储在关系型数据库中,比如:Mysql ...

  2. Hive基础之Hive体系架构&运行模式&Hive与关系型数据的区别

    Hive架构 1)用户接口: CLI(hive shell):命令行工具:启动方式:hive 或者 hive --service cli ThriftServer:通过Thrift对外提供服务,默认端 ...

  3. Docker 基础概念科普 和 常用操作介绍

    Docker 基础概念 Docker是什么?         Docker的思想来自于集装箱,集装箱解决了:在一艘大船上,可以把货物规整的摆放起来.并且各种各样的货物被集装箱标准化了,集装箱和集装箱之 ...

  4. hive数据导入导出和常用操作

    导出到本地文件 insert overwrite local directory '/home/hadoop'select * from test1; 导出到hdfs insert overwrite ...

  5. Hive基础之Hive是什么以及使用场景

    Hive是什么1)Hive由facebook开源,构建在Hadoop (HDFS/MR)上的用于管理和查询结果化/非结构化的数据仓库:2)一种可以存储.查询和分析存储在Hadoop 中的大规模数据的机 ...

  6. Hive基础之Hive与关系型数据库的比较

    Hive与关系型数据库的比较     使用Hive的CTL(命令行接口)时,你会感觉它很像是在操作关系型数据库,但是实际上,Hive和关系型数据库有很大的不同.       1)Hive和关系型数据库 ...

  7. Hive基础之Hive的存储类型

    Hive常用的存储类型有: 1.TextFile: Hive默认的存储类型:文件大占用空间大,未压缩,查询慢: 2.Sequence File:将属于以<KEY,VALUE>的形式序列化到 ...

  8. 一起学HTML基础-CSS样式表常用样式属性

    样式属性 背景与前景: background-color:#F90; /*背景颜色,样式表优先级最高*/ background-image:url(路径); /*设置背景图片(默认)*/ backgr ...

  9. Hive基础之Hive开启查询列名及行转列显示

    Hive默认情况下查询结果里面是只显示值: hive> select * from click_log; OK ad_101 :: ad_102 :: ad_103 :: ad_104 :: a ...

随机推荐

  1. HDU 1588 Gauss Fibonacci(矩阵快速幂)

    Gauss Fibonacci Time Limit: 3000/1000 MS (Java/Others)     Memory Limit: 32768/32768 K (Java/Others) ...

  2. tensorflow中屏蔽输出的log信息方法

    tensorflow中可以通过配置环境变量 'TF_CPP_MIN_LOG_LEVEL' 的值,控制tensorflow是否屏蔽通知信息.警告.报错等输出信息. 使用方法: import os imp ...

  3. Javascript鼠标事件大全

    事件 浏览器支持 描述 onClick IE3|N2|O3 鼠标点击事件,多用在某个对象控制的范围内的鼠标点击 onDblClick IE4|N4|O 鼠标双击事件 onMouseDown IE4|N ...

  4. Apache和iis的冲突处理

    http://wenku.baidu.com/link?url=N4GYFpkQyr8G0kVEy3AR2Q5FBho8EOle-_5inEfEq6QSxlyzB3xSbcpeugRdExkSU-tw ...

  5. js 在IOS系统微信浏览器内如何动态给title赋值

    var body = document.getElementsByTagName('body')[0]; document.title = title; var iframe = document.c ...

  6. suggest parentheses around comparison in operand of &|

    error discription: .:: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparen ...

  7. [LeetCode&Python] Problem 811. Subdomain Visit Count

    A website domain like "discuss.leetcode.com" consists of various subdomains. At the top le ...

  8. from表单的分向提交

    一:需求: 思路:document.form.action,表单分向提交,javascript提交表单同一个表单可以根据用户的选择,提交给不同的后台处理程序.即,表单的分向提交.如,在编写论坛程序时, ...

  9. 好使-利用python 下paramiko模块无密码登录

    [root@salt-minion02 paramiko]# vim baoleiji4.py # -*- coding:utf-8 -*-import paramikoprivate_key = p ...

  10. oracle 与sql serve 获取随机行数的数据

    Oracle 随机获取N条数据    当我们获取数据时,可能会有这样的需求,即每次从表中获取数据时,是随机获取一定的记录,而不是每次都获取一样的数据,这时我们可以采取Oracle内部一些函数,来达到这 ...