Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。本文描述了HIve的一些基本操作,如有错误之处还请指出。

常用语法

#显示相关信息
show tables;
show databases;
show partitions;
show functions;
desc extended table_name;
desc formatted table_name;
#创建库
create database test_db;
#删除库
drop database 库名;
#删除表
drop table 表名;
#重命名表名
ALTER TABLE table_name RENAME TO new_table_name;
#清空表数据
truncate table 表名;

建表语句

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

创建内部表

create table if not exists my_tb(id int,name string)
row format delimited fields terminated by ',';

创建外部表

#创建外部表要指明表的数据所在路径
create table if not exists my_ex_tb(id int,name string)
row format delimited fields terminated by ','
location 'hdfs://192.168.38.3:9000/externdb/my_ex_tb/';

在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。

加载数据到表目录下

#在表的最后插入或覆盖整个表的数据(into/overwrite)
load data local inpath '/root/1.txt' into(overwrite) table my_ex_tb;

创建分区表

create table if not exists my_par_tb(id int,name string)
partitioned by(country string)
row format delimited fields terminated by ','; load data local inpath '/root/1.txt' into table my_par_tb partition(country='China');
load data local inpath '/root/1.txt.us' into table my_par_tb partition(country='US'); #1.txt中的数据 1,张三
2,李四
3,王五 #1.txt.us中的数据 1,张三
2,李四
3,王五 #select * from my_par_tb显示的数据 1 张三 China
2 李四 China
3 王五 China
1 张三 US
2 李四 US
3 王五 US #查某个分区里的数据 select * from my_par_tb where country='China' 1 张三 China
2 李四 China
3 王五 China

添加删除分区

#添加分区
alter table my_par_tb add partition(country='Eng') partition(country='Ame');
#删除分区
alter table my_par_tb drop partition(country='Eng') partition(country='Ame'); #显示表中的分区
show partitions my_par_tb; country=China
country=US

创建分桶表

create table if not exists my_buck_tb(id int,name string)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by ','; #指定开启分桶
set hive.enforce.bucketing=true;
#分了几个桶就设置几个reduce,将从其他表中查出来多个文件,分表放入到多个桶里。
set mapreduce.job.reduces=4; #从my_tb表中查出数据插入到分桶表里
insert into table my_buck_tb
#指定map输出的数据根据id去分区,排序(cluster by等价于distribute by+sort by的效果)
select id,name from my_tb cluster by(id);

保存查询结果

默认情况下查询结果显示在屏幕上,可以将查询结果保存到表里。

#将查询结果保存到一张新创建的表中
create table tmp_tb as select * from my_tb; #将查询结果保存到一张已经存在的表中
insert into table tmp_tb select * from my_tb; #将查询结果保存到指定目录下(本地或hdfs上)
#本地
insert overwrite local directory '/root/out_tb/'
select * from my_tb;
#hdfs
insert overwrite directory '/out_tb/'
select * from my_tb;

join操作

a表数据:

1,张三
2,李四
3,c
4,a
5,e
6,r b表数据: 1,绿间
3,青峰
4,黑子
9,红发 建表:
create table a(id int,name string)
row format delimited fields terminated by ','; create table b(id int,name string)
row format delimited fields terminated by ','; 导入数据:
load data local inpath '/root/a.txt' into table a;
load data local inpath '/root/b.txt' into table b; #内连接(交集)
select * from a inner join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | 张三 | 1 | 绿间 |
| 3 | c | 3 | 青峰 |
| 4 | a | 4 | 黑子 |
+-------+---------+-------+---------+--+ #左连接
select * from a left join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | 张三 | 1 | 绿间 |
| 2 | 李四 | NULL | NULL |
| 3 | c | 3 | 青峰 |
| 4 | a | 4 | 黑子 |
| 5 | e | NULL | NULL |
| 6 | r | NULL | NULL |
+-------+---------+-------+---------+--+ #右连接
select * from a right join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | 张三 | 1 | 绿间 |
| 3 | c | 3 | 青峰 |
| 4 | a | 4 | 黑子 |
| NULL | NULL | 9 | 红发 |
+-------+---------+-------+---------+--+ #全连接
select * from a full outer join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | 张三 | 1 | 绿间 |
| 2 | 李四 | NULL | NULL |
| 3 | c | 3 | 青峰 |
| 4 | a | 4 | 黑子 |
| 5 | e | NULL | NULL |
| 6 | r | NULL | NULL |
| NULL | NULL | 9 | 红发 |
+-------+---------+-------+---------+--+ #左半连接(内连接的结果中只取左边的表的数据)
select * from a left semi join b on a.id = b.id;
+-------+---------+--+
| a.id | a.name |
+-------+---------+--+
| 1 | 张三 |
| 3 | c |
| 4 | a |
+-------+---------+--+

select查询语句

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_name[..join..on(a.id=b.id)]
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number] #指定map输出的数据根据id去分区,排序(cluster by等价于distribute by+sort by的效果)
select id,name from my_tb cluster by(id);

自定义函数

hive内置函数

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

pom文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.xiaojie.mm</groupId>
<artifactId>my_hive</artifactId>
<version>0.0.1-SNAPSHOT</version> <properties>
<hadoop.version>2.6.5</hadoop.version>
<hive.version>1.2.1</hive.version>
</properties> <dependencies>
<!-- Hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.5</version>
</dependency>
<!-- Hive -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-pdk</artifactId>
<version>0.10.0</version>
</dependency>
<dependency>
<groupId>javax.jdo</groupId>
<artifactId>jdo2-api</artifactId>
<version>2.3-eb</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.7</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.7</version>
<scope>system</scope>
<systemPath>/home/miao/apps/install/jdk1.7.0_45/lib/tools.jar</systemPath>
</dependency>
</dependencies>
</project>

自定义将大写转为小写的方法

package com.xiaojie.mm;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLower extends UDF{
// 重载该方法
public String evaluate(String field) {
return field.toLowerCase();
}
}

导出jar包,并放到hive所在的机器上

scp tolower.jar mini1:/root/apps/

hive客户端添加自定义函数

#第一步
add JAR /root/apps/tolower.jar;
#第二步 引号里是自定义方法的全名(临时方法,只在该回话窗口有效)
create temporary function tolower as 'com.xiaojie.mm.ToLower';
#第三步使用
select * from a;
+-------+---------+--+
| a.id | a.name |
+-------+---------+--+
| 7 | AAAAA |
| 1 | 张三 |
| 2 | 李四 |
| 3 | c |
| 4 | a |
| 5 | e |
| 6 | r |
+-------+---------+--+ select id,tolower(name) from a;
+-----+--------+--+
| id | _c1 |
+-----+--------+--+
| 7 | aaaaa |
| 1 | 张三 |
| 2 | 李四 |
| 3 | c |
| 4 | a |
| 5 | e |
| 6 | r |
+-----+--------+--+

自定义获取手机归属地

package com.xiaojie.mm;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;
public class GetProvince extends UDF{
public static HashMap<String,String> provinceMap = new HashMap<String,String>();
static {
provinceMap.put("183", "hangzhou");
provinceMap.put("186", "nanjing");
provinceMap.put("187", "suzhou");
provinceMap.put("188", "ningbo");
}
public String evaluate(int phonenumber) {
String phone_num = String.valueOf(phonenumber);
#取手机号码前三位
String phone = phone_num.substring(0, 3);
return provinceMap.get(phone)==null?"未知":provinceMap.get(phone);
}
} 原数据:
+----------------------+---------------------+--+
| flow_province.phone | flow_province.flow |
+----------------------+---------------------+--+
| 1837878 | 12m |
| 1868989 | 13m |
| 1878989 | 14m |
| 1889898 | 15m |
| 1897867 | 16m |
| 1832323 | 78m |
| 1858767 | 88m |
| 1862343 | 99m |
| 1893454 | 77m |
+----------------------+---------------------+--+ 调用自定义方法后: select phone,getpro(phone),flow from flow_province;
+----------+-----------+-------+--+
| phone | _c1 | flow |
+----------+-----------+-------+--+
| 1837878 | hangzhou | 12m |
| 1868989 | nanjing | 13m |
| 1878989 | suzhou | 14m |
| 1889898 | ningbo | 15m |
| 1897867 | 未知 | 16m |
| 1832323 | hangzhou | 78m |
| 1858767 | 未知 | 88m |
| 1862343 | nanjing | 99m |
| 1893454 | 未知 | 77m |
+----------+-----------+-------+--+

自定义解析json格式的数据

#创建表
create table json_tb(line string);
#导入数据
load data local inpath '/root/test_data/a.json' into table json_tb;
#显示原数据
select line from my_tb limit 10;
+----------------------------------------------------------------+--+
| json_tb.line |
+----------------------------------------------------------------+--+
| {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} |
| {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} |
| {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} |
| {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} |
| {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} |
| {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} |
| {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} |
| {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} |
| {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} |
| {"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"} |
+----------------------------------------------------------------+--+ #自定义函数
package com.xiaojie.mm;
import org.apache.hadoop.hive.ql.exec.UDF;
import parquet.org.codehaus.jackson.map.ObjectMapper; public class JsonParse extends UDF{
public String evaluate(String jsonLine) {
ObjectMapper objectMapper = new ObjectMapper();
try {
MovieBean bean = objectMapper.readValue(jsonLine, MovieBean.class);
return bean.toString();
}catch(Exception e){ }
return "";
}
} package com.xiaojie.mm;
public class MovieBean {
// 电影id
private String movie;
// 电影评分
private String rate;
// 评分时间
private String timeStamp;
// 用户id
private String uid; public String getMovie() {
return movie;
} public void setMovie(String movie) {
this.movie = movie;
} public String getRate() {
return rate;
} public void setRate(String rate) {
this.rate = rate;
} public String getTimeStamp() {
return timeStamp;
} public void setTimeStamp(String timeStamp) {
this.timeStamp = timeStamp;
} public String getUid() {
return uid;
} public void setUid(String uid) {
this.uid = uid;
} @Override
public String toString() {
return this.movie + "\t" +this.rate + "\t" + this.timeStamp + "\t" + this.uid;
} } #打jar包上传到hive所在机器,创建函数
add JAR /root/test_data/json_parse.jar;
create temporary function json_parse as 'com.xiaojie.mm.JsonParse'; #使用自定义的json解析函数
select json_parse(line) from json_tb limit 10;
+---------------------+--+
| _c0 |
+---------------------+--+
| 1193 5 978300760 1 |
| 661 3 978302109 1 |
| 914 3 978301968 1 |
| 3408 4 978300275 1 |
| 2355 5 978824291 1 |
| 1197 3 978302268 1 |
| 1287 5 978302039 1 |
| 2804 5 978300719 1 |
| 594 4 978302268 1 |
| 919 4 978301368 1 |
+---------------------+--+ #将json解析的数据保存到一张新创建的表里
create table json_parse_tb as
select split(json_parse(line),'\t')[0] as movieid,
split(json_parse(line),'\t')[1] as rate,
split(json_parse(line),'\t')[2] as time,
split(json_parse(line),'\t')[3] as userid
from json_tb limit 100; #内置json函数
select get_json_object(line,'$.movie') as moiveid,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as time,
get_json_object(line,'$.uid') as userid
from json_tb limit 10;

Transform(调用自定义脚本)

Hive的 TRANSFORM 关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况。

自定义python脚本(vim time_parse.py)

#!/bin/python
import sys
import datetime for line in sys.stdin:
line = line.strip()
movieid, rate, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rate, str(weekday),userid])

将py文件导入到hive的工作目录下

add file time_parse.py

使用transform调用自定义的py代码

create TABLE json_parse_time_tb as
SELECT
#根据transform括号中的参数,将json_parse_tb表的对应数据取出
TRANSFORM (movieid, rate, time, userid)
USING 'python time_parse.py'
AS (movieid, rate, weekday,userid)
FROM json_parse_tb;

查看新表数据

select * from json_parse_time_tb;
+-----------------------------+--------------------------+-----------------------------+----------------------------+--+
| json_parse_time_tb.movieid | json_parse_time_tb.rate | json_parse_time_tb.weekday | json_parse_time_tb.userid |
+-----------------------------+--------------------------+-----------------------------+----------------------------+--+
| 1690 | 3 | 1 | 2 |
| 589 | 4 | 1 | 2 |
| 3471 | 5 | 1 | 2 |
| 1834 | 4 | 1 | 2 |
| 2490 | 3 | 1 | 2 |
| 2278 | 3 | 1 | 2 |
| 110 | 5 | 1 | 2 |
| 3257 | 3 | 1 | 2 |
| 3256 | 2 | 1 | 2 |
| 3255 | 4 | 1 | 2 |
+-----------------------------+--------------------------+-----------------------------+----------------------------+--+

案例

原数据(用户名,月份,点击量)

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5

求每个人每个月的点击量,以及点击量累计

第一步:创建表,导入数据

#建表
create table click_tb(username string,month string,click int)
row format delimited fields terminated by ',';
#导入数据
load data local inpath ‘/root/test_data/click.txt’ into click_tb;

第二步:求每个用户每个月的点击量

select username,month,sum(click) as click_count from click_tb group by username,month;

+-----------+----------+--------------+--+
| username | month | click_count |
+-----------+----------+--------------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+--------------+--+

第三步:自己和自己内连接(求交集)

select * from
(select username,month,sum(click) as click_count from click_tb group by username,month) A
inner join
(select username,month,sum(click) as click_count from click_tb group by username,month) B
on
A.username=B.username; +-------------+----------+----------------+-------------+----------+----------------+--+
| a.username | a.month | a.click_count | b.username | b.month | b.click_count |
+-------------+----------+----------------+-------------+----------+----------------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+----------------+-------------+----------+----------------+--+

第四步:求出最终所需结果

select a.username,a.month,min(a.click_count) as click_count,sum(b.click_count) as sum_count from
(select username,month,sum(click) as click_count from click_tb group by username,month) a
inner join
(select username,month,sum(click) as click_count from click_tb group by username,month) b
on
A.username=B.username
where b.month<=a.month
group by a.username,a.month
order by a.username,a.month; +-------------+----------+--------------+------------+--+
| a.username | a.month | click_count | sum_count |
+-------------+----------+--------------+------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
+-------------+----------+--------------+------------+--+

Hive入门操作的更多相关文章

  1. hadoop笔记之Hive入门(什么是Hive)

    Hive入门(一) Hive入门(一) 什么是Hive? Hive是个数据仓库,数据仓库就是数据库,但又与一般意义上的数据库有点区别 实际上,Hive是构建在hadoop HDFS上的一个数据仓库. ...

  2. 第1章 Hive入门

    第1章 Hive入门 1.1 什么是Hive Hive:由Facebook开源用于解决海量结构化日志的数据统计. Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提 ...

  3. spark使用Hive表操作

    spark Hive表操作 之前很长一段时间是通过hiveServer操作Hive表的,一旦hiveServer宕掉就无法进行操作. 比如说一个修改表分区的操作 一.使用HiveServer的方式 v ...

  4. 4 weekend110的hive入门

    查看企业公认的最新稳定版本:       https://archive.apache.org/dist/  Hive和HBase都很重要,当然啦,各自也有自己的替代品. 在公司里,SQL有局限,大部 ...

  5. hadoop笔记之Hive入门(Hive的体系结构)

    Hive入门(二) Hive入门(二) Hive的体系结构 ○ Hive的元数据 Hive将元数据存储在数据库中(metastore),支持mysql.derby.oracle等数据库,Hive默认是 ...

  6. spring boot 入门操作(二)

    spring boot入门操作 使用FastJson解析json数据 pom dependencies里添加fastjson依赖 <dependency> <groupId>c ...

  7. spring boot 入门操作(三)

    spring boot入门操作 devtools热部署 pom dependencies里添加依赖 <dependency> <groupId>org.springframew ...

  8. Mysql的二进制安装和基础入门操作

    前言:Mysql数据库,知识非常的多,要想学精学通这块知识,估计也要花费和学linux一样的精力和时间.小编也是只会些毛皮,给大家分享一下~ 一.MySQL安装 (1)安装方式: 1 .程序包yum安 ...

  9. java之servlet入门操作教程一续

    本节主要是在java之servlet入门操作教程一  的基础上使用myeclipse实现自动部署的功能 准备: java之servlet入门操作教程一 中完成myFirstServlet项目的创建: ...

随机推荐

  1. Git学习系列之为什么选择Git?

    为什么选择Git? 流行的软件版本开源管理软件,有CVS.SVN.GIT版本管理工具,Git的优势在哪里呢? Git 和 CVS.SVN不同,是一个分布式的源代码管理工具,它很强,也很快,Linux内 ...

  2. CSS3 pointer-events:none应用举例及扩展

    一.pointer-events:none是? pointer-events是CSS3中又一冉冉的属性,其支持的值牛毛般多,不过大多都与SVG相关,我们可以不用理会.当下,对于偶们来讲,与SVG划开界 ...

  3. java多线程开发之CyclicBarrier,CountDownLatch

    最近研究了一个别人的源码,其中用到多个线程并行操作一个文件,并且在所有线程全部结束后才进行主线程后面的处理. 其用到java.util.concurrent.CyclicBarrier 这个类. Cy ...

  4. [转]vs2012 + web api + OData + EF + MYsql 开发及部署

    本文转自:http://www.cnblogs.com/liumang/p/4403436.html 先说下我的情况,b/s开发这块已经很久没有搞了,什么web api .MVC.OData都只是听过 ...

  5. dbcp数据库连接池的java实现

    1.准备 导入jar包 commons-dbcp-1.4.jar commons-pool-1.3.jar 数据库驱动包,如:mysql-connector-java-5.1.28-bin.jar 2 ...

  6. Struts2初步学习总结

    Struts2当时上课的时候老师给我们讲过,了解过一些,但也仅仅是了解,,,没动手去做,准确的说是试了一下,然后没做成功,,,现在又想把这个夹生饭给煮一下了,,,, 结合W3Cschool和轻量级Ja ...

  7. Azure 上 Linux 虚拟机 Mac 地址的持久化

    有些用户在使用 Azure Linux 虚拟机安装软件时,有些软件的 license 会和当前系统的 mac 地址绑定,那么在 Azure VM 重启,reszie(改变尺寸大小),停止然后再启动的时 ...

  8. TCP学习(一)

    协议分层 可以看到 物理层, 链路层,网络层是所有网络设备共有的, 而传输层, 会话层, 表示层, 应用层 是存在于主机上的 各设备实现的协议层次 IP地址的表示 ​ 为什么会出现ip地址?是为了在一 ...

  9. go install runtime/cgo: open /usr/local/go/pkg/darwin_amd64/runtime/cgo.a: permission denied

    在做更新时,收到下面提示: go get  github.com/astaxie/beego go install runtime/cgo: open /usr/local/go/pkg/darwin ...

  10. div居中方式

    1. position: absolute; top:50%:left: 50%; margin-top: -高度的一半; margin-left: -宽度的一半(此方法适用于固定宽高的元素) 注: ...