hive 上篇
hive 是以hadoop为基础的数据仓库,使用HQL查询存放在HDFS上面的数据,HSQL是一种类SQL的语句,最终会被编译成map/reduce,HSQL可以查询HDFS上面的数据;不支持记录级别的插入,更新或者删除,但用户可以通过查询生产新表或者将查询文件保存在文件中。同时由于hadoop是面向批处理的系统,而mapreduce任务(job)的启动过程需要消耗较多的时间,所以hive查询延迟比较严重,因此提交查询和返回查询结果可能会有较大的延迟,不能进行交互性查询。在传统数据中秒级的查询,在hive中,即便是数据集相对较少,往往也会需要更长的时间;hive不支持事务。因此不支持OLTP,如果想对大规模的数据进行OLTP功能,则可以选择一种NOSQL数据库,如hbase等
hive替代工具最有名的就是PIG了,pig是一种数据流语言,而不是一种查询语言。常用于ETL过程的一部分,也就是将外部数据装载到HADOOP集群中,然后转换为期望的数据格式。
hadoop团队通常会将hive和pig结合使用。
如果用户需要hive无法提供的特性(行级更新,快速查询响应以及支持事务),可以用hbase;
hbase是一种分布式,可伸缩的数据存储,支持行级别的更新,快速查询,行级事务(不支持多行事务)。
hbase支持的一个重要特性是列存储,其中列可以组织成列族,列族在分布式存储中物理上是存储在一起的,这使得查询场景涉及的列只是所有列的一个子集时,读写速度会快很多,因为不需要读取所有的行然后丢弃大部分列,而是读取需要的列。
可以像键-值存储一样使用hbase,其每一行都使用了唯一键来提供非常快速的读写这一行的列或者列族,hbase还可以对每个列保存多个版本的值(按时间戳进行标记),版本数据是可以配置的.
hbase使用HDFS持久化存储数据
hbase没有提供类似于sql的查询语言,可以和hive结合使用
变量和属性
hivevar 用户自定义变量 可以读写
hiveconf hive相关配置属性 可以读写
system java定义的配置属性 可以读写
env shell环境定义的环境变量 只读
在cli中,可以使用set显示或者修改变量值
查看env的一个变量
hive> set env:HOME;
env:HOME=/root
如果只有set 会打印出命名空间hivevar,hiveconf,system,env中的所有变量
hive> set hivevar:foo;
hivevar:foo is undefined as a hive variable
Query returned non-zero code: 1, cause: null
hive> set foo='aa';
hive> set foo;
foo='aa'
hive> set hivevar:foo='bbb';
hive> set foo;
foo='aa'
hive> set foo='ccc';
hive> set foo;
foo='ccc'
hive> set hivevar:num=20;
hive> set num;
num=20
hive> select * from gamedw.cust where nianling = ${num};
OK
zhangsan 1 20
zhangsan 1 20
Time taken: 0.337 seconds, Fetched: 2 row(s)
hive> set hivevar:name='nihao'
> ;
hive> select * from gamedw.cust where custname= ${name};
OK
nihao 1 5
nihao 1 5
Time taken: 0.34 seconds, Fetched: 2 row(s)
hive> select * from gamedw.cust where custname= ${hivevar:name};
OK
nihao 1 5
nihao 1 5
Time taken: 0.39 seconds, Fetched: 2 row(s)
hive> set name1='hello';
hive> select * from gamedw.cust where custname= ${hiveconf:name1};
OK
hello 0 100
hello 0 100
Time taken: 0.362 seconds, Fetched: 2 row(s)
由上面可知定义变量前缀hivevar:是可选的
查看建库语句:
hive> show create database gamedw;
OK
CREATE DATABASE `gamedw`
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/gamedw.db'
WITH DBPROPERTIES (
'creator'='tianyongtaao')
Time taken: 0.01 seconds, Fetched: 5 row(s)
创建一个表:
hive> create table tb_test(id int,str string);
OK
Time taken: 0.379 seconds
hive> show create table tb_test;
OK
CREATE TABLE `tb_test`(
`id` int,
`str` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1521107014')
Time taken: 0.049 seconds, Fetched: 18 row(s)
hive中引用外部变量: 在代码中引用时使用${...};
[root@host bin]# hive -d name=username;
hive> set name;
name=username
hive> use gamedw;
OK
Time taken: 0.016 seconds
hive> create table taa(id int,${name} string);
OK
Time taken: 0.516 seconds
hive> desc taa;
OK
id int
username string
Time taken: 0.061 seconds, Fetched: 2 row(s)
hive一次查询命令:
[root@host bin]# hive -e "select * from gamedw.gamerole;"
hive 一次查询命令并保存查询结果(输出到本地文件而不是HDFS):
[root@host bin]# hive -e "select * from gamedw.gamerole">/root/tmpdata/gamerole.txt
查看数据仓库路径:
[root@host bin]# hive -S -e "set" |grep ware
which: no hbase in (/root/app/apache-maven-3.5.2/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/java/jdk1.8.0_101/bin:/root/hadoop/hadoop-2.7.4/bin:/root/hive/apache-hive-2.1.1/bin:/root/spark/spark-2.2.0-bin-hadoop2.7/bin:/root/scala/scala-2.12.3/bin:/root/scala/sbt:/opt/netcat/bin:/usr/java/jdk1.8.0_101/bin:/root/hadoop/hadoop-2.7.4/bin:/root/hive/apache-hive-2.1.1/bin:/root/spark/spark-2.2.0-bin-hadoop2.7/bin:/root/scala/scala-2.12.3/bin:/root/scala/sbt/bin:/root/bin:/root/hadoop/hadoop-2.7.4/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=true
执行文件中的一个或者多个查询语句:
[root@host ~]# cat /root/tmpdata/getgamerole.hql
select * from gamedw.gamerole;
use gamedw;
show tables;
[root@host bin]# hive -f /root/tmpdata/getgamerole.hql
.......................
OK
Time taken: 2.104 seconds
OK
Time taken: 0.035 seconds
OK
gamerole
taa
tb_test
tb_test2
Time taken: 0.027 seconds, Fetched: 4 row(s)
hive shell里面执行脚本文件:
hive> source /root/tmpdata/getgamerole.hql;
OK
Time taken: 0.144 seconds
OK
Time taken: 0.009 seconds
OK
gamerole
taa
tb_test
tb_test2
Time taken: 0.011 seconds, Fetched: 4 row(s)
从本地文件加载数据:
hive> create table src(ss string);
OK
Time taken: 0.21 seconds
[root@host ~]# cat /root/tmpdata/tian.txt
name tianyongtao
sex man
game wzry
age 100
score 120
home linying
hive> load data local inpath '/root/tmpdata/tian.txt' into table src;
Loading data to table gamedw.src
OK
Time taken: 0.999 seconds
hive> select * from src;
OK
name tianyongtao
sex man
game wzry
age 100
score 120
home linying
Time taken: 0.13 seconds, Fetched: 6 row(s)
hive cli更多功能
tab键 自动补全功能
执行shell命令:在命令前加上!,后面加上;就可以了(不支持管道功能)
hive> !ls /root/app/;
apache-maven-3.5.2
eclipse
hive> !pwd;
/root
hive 使用hadoop的dfs命令(去掉前面的hdfs):
[root@host ~]# hdfs dfs -ls /tmp;
Found 5 items
drwxr-xr-x - root supergroup 0 2017-10-13 16:06 /tmp/20170915
drwxr-xr-x - root supergroup 0 2017-12-07 16:35 /tmp/20171024
drwxr-xr-x - root supergroup 0 2017-12-07 16:04 /tmp/20171206
drwxr-xr-x - root supergroup 0 2017-12-07 16:20 /tmp/20171207
drwx-wx-wx - root supergroup 0 2017-11-02 16:42 /tmp/hive
hive> dfs -ls /tmp;
Found 5 items
drwxr-xr-x - root supergroup 0 2017-10-13 16:06 /tmp/20170915
drwxr-xr-x - root supergroup 0 2017-12-07 16:35 /tmp/20171024
drwxr-xr-x - root supergroup 0 2017-12-07 16:04 /tmp/20171206
drwxr-xr-x - root supergroup 0 2017-12-07 16:20 /tmp/20171207
drwx-wx-wx - root supergroup 0 2017-11-02 16:42 /tmp/hive
hive使用-- 进行注释
hive 显示字段名称
hive> set hive.cli.print.header=true;
hive> select * from gamerole;
OK
gamerole.id gamerole.rolename gamerole.createtime gamerole.zoneid
Time taken: 0.123 seconds
显示当前库名:
hive> set hive.cli.print.current.db=true;
hive (gamedw)>
数据类型 int,float,double,string,timestamp等
集合数据类型struct ,map,array等
HQL
列举以g开头的数据库
hive> show databases like 'g.*';
OK
gamedw
Time taken: 0.008 seconds, Fetched: 1 row(s)
创建数据库
hive> create database testdb with DBPROPERTIES('creater'='tyt')
> ;
OK
Time taken: 0.27 seconds
hive> desc database testdb;
OK
testdb hdfs://localhost:9000/user/hive/warehouse/testdb.db root USER
Time taken: 0.009 seconds, Fetched: 1 row(s)
hive> desc database extended testdb;
OK
testdb hdfs://localhost:9000/user/hive/warehouse/testdb.db root USER {creater=tyt}
Time taken: 0.009 seconds, Fetched: 1 row(s)
修改数据库的属性,其他数据库的元数据都是不可以修改的
hive> alter database testdb set DBPROPERTIES('creater'='tyt','createtime'='2018-03-19');
OK
Time taken: 0.065 seconds
hive> desc database extended testdb;
OK
testdb hdfs://localhost:9000/user/hive/warehouse/testdb.db root USER {creater=tyt, createtime=2018-03-19}
Time taken: 0.009 seconds, Fetched: 1 row(s)
修改表属性
hive> alter table employees set tblproperties('creator'='tianyongtao');
OK
Time taken: 0.132 seconds
拷贝一个表(不拷贝数据)
hive> create table if not exists role like gamerole;
OK
Time taken: 0.355 seconds
hive> show tables like 'r.*';
OK
role
Time taken: 0.011 seconds, Fetched: 1 row(s)
列举某个数据库的表
hive> show tables in userdb;
OK
name
Time taken: 0.011 seconds, Fetched: 1 row(s)
我们可以通过的desc extended tbname来查看详细的表结构但可读性不强
hive> desc extended employees;
OK
name string
salary float
subdinates array<string>
deducation map<string,float>
address struct<street:string,city:string,state:string,zip:int>
Detailed Table Information Table(tableName:employees, dbName:gamedw, owner:root, createTime:1521184118, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:salary, type:float, comment:null), FieldSchema(name:subdinates, type:array<string>, comment:null), FieldSchema(name:deducation, type:map<string,float>, comment:null), FieldSchema(name:address, type:struct<street:string,city:string,state:string,zip:int>, comment:null)], location:hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{creator=tianyongtao, last_modified_time=1521447397, totalSize=0, numRows=0, rawDataSize=0, numFiles=0, transient_lastDdlTime=1521447397, last_modified_by=root}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.032 seconds, Fetched: 7 row(s)
我们通常会使用formatted 替代extended,这样可以获得更加可读的信息。
hive> desc formatted employees;
OK
# col_name data_type comment
name string
salary float
subdinates array<string>
deducation map<string,float>
address struct<street:string,city:string,state:string,zip:int>
# Detailed Table Information
Database: gamedw
Owner: root
CreateTime: Fri Mar 16 15:08:38 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees
Table Type: MANAGED_TABLE
Table Parameters:
creator tianyongtao
last_modified_by root
last_modified_time 1521447397
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1521447397
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.032 seconds, Fetched: 36 row(s)
通常我们在hive创建的表为管理表,也叫内部表,默认存储位置与hive配置信息有关,删除管理表时,hive也会删除表里的数据。
管理表不方便与其他工作共享数据,如果一份数据, 想使用hive对其进行查询,我们可以创建一个外部表来指向这份数据,而不用对其具有所有权。
创建一个外部表
HDFS有一个文件如下:
[root@host ~]# hdfs dfs -cat /tmp/20180321/name.csv
?1,爷爷 ,1
2,大爷 ,1
3,叔叔 ,1
4,堂哥 ,2
5,堂姐 ,2
6,堂妹 ,3
创建外部表:
hive> create external table if not exists t_name
> (id int,name string,level int)
> row format delimited fields terminated by ','
> location '/tmp/20180321';
OK
Time taken: 0.108 seconds
hive> select * from t_name;
OK
1 爷爷 1
2 大爷 1
3 叔叔 1
4 堂哥 2
5 堂姐 2
6 堂妹 3
Time taken: 0.127 seconds, Fetched: 6 row(s)
[root@host ~]# hdfs dfs -cp /tmp/20180321/name.csv /tmp/20180321/name1.csv
[root@host ~]# hdfs dfs -ls /tmp/2018*
Found 2 items
-rw-r--r-- 1 root supergroup 123 2017-10-26 11:10 /tmp/20180321/name.csv
-rw-r--r-- 1 root supergroup 123 2018-03-21 17:57 /tmp/20180321/name1.csv
hive> select * from t_name;
OK
1 爷爷 1
2 大爷 1
3 叔叔 1
4 堂哥 2
5 堂姐 2
6 堂妹 3
1 爷爷 1
2 大爷 1
3 叔叔 1
4 堂哥 2
5 堂姐 2
6 堂妹 3
hive> set hive.cli.print.header=true;
hive> select * from t_name where id=1;
OK
t_name.id t_name.name t_name.level
1 爷爷 1
1 爷爷 1
Time taken: 0.131 seconds, Fetched: 2 row(s)
hive> desc formatted t_name;
OK
col_name data_type comment
# col_name data_type comment
id int
name string
level int
# Detailed Table Information
Database: gamedw
Owner: root
CreateTime: Wed Mar 21 17:52:22 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://localhost:9000/tmp/20180321
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
numFiles 1
totalSize 123
transient_lastDdlTime 1521625942
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.034 seconds, Fetched: 31 row(s)
可以复制外部表结构,生成一个管理表(内部表),也可以复制外部表为外部表,但需要加external修饰
以下复制外部表生成一个管理表:
hive> create table t_name1 like t_name;
OK
Time taken: 0.15 seconds
如果想复制外部表生成一个外部表,格式如下:
需要添加external修饰,可以指定外部数据源
hive> create external table if not exists t_name2 like t_name location '/tmp/20180321'
> ;
OK
Time taken: 0.297 seconds
hive> select * from t_name2;
OK
t_name2.id t_name2.name t_name2.level
1 爷爷 1
2 大爷 1
3 叔叔 1
4 堂哥 2
5 堂姐 2
6 堂妹 3
1 爷爷 1
2 大爷 1
3 叔叔 1
4 堂哥 2
5 堂姐 2
6 堂妹 3
Time taken: 0.206 seconds, Fetched: 12 row(s)
分区表,管理表
数据分区可以水平分散压力,也可以将数据物理转移到距离使用者近的地方。
创建分区表,分区列为city:
hive> CREATE TABLE `customers`(
> `custname` string,
> `sex` int,
> `age` int)
> PARTITIONED BY (
> `city` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> LINES TERMINATED BY '\n' ;
OK
Time taken: 0.462 seconds
添加分区(导入数据的时候,会自动创建分区,因此此步骤也可以忽略):
hive> alter table customers add partition(city='shenzhen');
OK
Time taken: 0.233 seconds
hive> alter table customers add partition(city='beijing');
OK
Time taken: 0.136 seconds
查看数据表文件目录:
drwx-wx-wx - root supergroup 0 2018-03-23 14:49 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx - root supergroup 0 2018-03-23 14:49 /user/hive/warehouse/gamedw.db/customers/city=beijing
drwx-wx-wx - root supergroup 0 2018-03-23 14:49 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
导入数据(由于分区列city的值已经包含在文件目录里面了,因此也没必要把值的内容存放在它们目录下面的文件中了)
[root@host tmpdata]# cat /root/tmpdata/customers.txt
tianyongtao 1 50
wangwu 1 85
zhangsan 1 20
liuqin 0 56
wangwu 0 47
hive> load data local inpath '/root/tmpdata/customers.txt' into table customers partition(city='shanghai');
Loading data to table gamedw.customers partition (city=shanghai)
OK
Time taken: 0.567 seconds
hive> load data local inpath '/root/tmpdata/customers.txt' into table customers partition(city='beijing');
Loading data to table gamedw.customers partition (city=beijing)
OK
Time taken: 0.447 seconds
查看HDFS的表文件目录:
drwx-wx-wx - root supergroup 0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx - root supergroup 0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx 1 root supergroup 80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx - root supergroup 0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx 1 root supergroup 80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
从数据表查数据:
hive> set hive.cli.print.header=true;
hive> select * from customers;
OK
customers.custname customers.sex customers.age customers.city
tianyongtao 1 50 beijing
wangwu 1 85 beijing
zhangsan 1 20 beijing
liuqin 0 56 beijing
wangwu 0 47 beijing
liuyang 1 32 beijing
tianyongtao 1 50 shanghai
wangwu 1 85 shanghai
zhangsan 1 20 shanghai
liuqin 0 56 shanghai
wangwu 0 47 shanghai
liuyang 1 32 shanghai
Time taken: 0.152 seconds, Fetched: 12 row(s)
hive> select * from customers where city='beijing';
OK
customers.custname customers.sex customers.age customers.city
tianyongtao 1 50 beijing
wangwu 1 85 beijing
zhangsan 1 20 beijing
liuqin 0 56 beijing
wangwu 0 47 beijing
liuyang 1 32 beijing
Time taken: 0.153 seconds, Fetched: 6 row(s)
在分区不存在的情况下,导入一个新的分区(在管理表中,可以通过载入数据的方式创建分区):
hive> load data local inpath '/root/tmpdata/customers.txt' into table customers partition(city='shenzhen');
Loading data to table gamedw.customers partition (city=shenzhen)
OK
Time taken: 0.439 seconds
再次查看HDFS数据表文件目录:
drwx-wx-wx - root supergroup 0 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx - root supergroup 0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx 1 root supergroup 80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx - root supergroup 0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx 1 root supergroup 80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
drwx-wx-wx - root supergroup 0 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx 1 root supergroup 80 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt
hive> select * from customers where city='shenzhen';
OK
customers.custname customers.sex customers.age customers.city
tianyongtao 1 50 shenzhen
wangwu 1 85 shenzhen
zhangsan 1 20 shenzhen
liuqin 0 56 shenzhen
wangwu 0 47 shenzhen
liuyang 1 32 shenzhen
Time taken: 0.169 seconds, Fetched: 6 row(s)
hive> select * from customers where city='shenzhen' limit 2;
OK
customers.custname customers.sex customers.age customers.city
tianyongtao 1 50 shenzhen
wangwu 1 85 shenzhen
Time taken: 0.175 seconds, Fetched: 2 row(s)
查看分区表的分区:
hive> show partitions customers;
OK
partition
city=beijing
city=shanghai
city=shenzhen
Time taken: 0.067 seconds, Fetched: 3 row(s)
hive> show partitions customers partition(city='shenzhen');
OK
partition
city=shenzhen
Time taken: 0.078 seconds, Fetched: 1 row(s)
hive> desc formatted customers;
OK
col_name data_type comment
# col_name data_type comment
custname string
sex int
age int
# Partition Information
# col_name data_type comment
city string
# Detailed Table Information
Database: gamedw
Owner: root
CreateTime: Fri Mar 23 14:55:22 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://localhost:9000/user/hive/warehouse/gamedw.db/customers
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1521788122
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim
line.delim \n
serialization.format
Time taken: 0.026 seconds, Fetched: 34 row(s)
删除一个分区:
hive> alter table customers drop partition(city='shanghai');
Dropped the partition city=shanghai
OK
Time taken: 0.143 seconds
hive> show partitions customers;
OK
partition
city=beijing
city=shenzhen
Time taken: 0.042 seconds, Fetched: 2 row(s)
再次查看hdfs数据表目录:
drwx-wx-wx - root supergroup 0 2018-03-23 15:17 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx - root supergroup 0 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx 1 root supergroup 80 2018-03-23 14:55 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx - root supergroup 0 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx 1 root supergroup 80 2018-03-23 15:04 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt
外部分区表
外部表同样可以分区,这是管理大型生产数据集最常见的情况。给用户提供了与其他工具共享数据的方式,也可以优化查询性能,逻辑数据管理等。
用户可以自己定义文件结构,因此用户对于目录的使用具有更多的灵活性。
本地文件
[root@host tmpdata]# cat log_message0.txt
1 server1 1 aggaggaggewag
2 server2 1 ageggeqagegge
3 server1 2 agegagag4degg
4 server1 5 agetetgdgagggeg
5 server3 7 gahewgjaglewgjew
6 server1 8 agegagegggfewg
7 server2 2 agegwagwggew
8 server1 3 agewgagggfew
9 server1 2 geawtgggggagg
[root@host tmpdata]# cat log_message1.txt
100 server1 2 dsaggegagg
101 server1 3 gaggg
102 server2 5 ageggge
103 server1 1 sagegeg
104 server2 3 agegawggge
105 server2 5 agfeggeg
在hdfs上创建文件夹
[root@host ~]# hdfs dfs -mkdir /tmp/20180325/2018/
[root@host ~]# hdfs dfs -mkdir /tmp/20180325/2018/1
[root@host ~]# hdfs dfs -ls -R /tmp/20180325
drwxr-xr-x - root supergroup 0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x - root supergroup 0 2018-03-26 11:55 /tmp/20180325/2018/1
[root@host ~]# hdfs dfs -mkdir /tmp/20180325/2018/1/1 /tmp/20180325/2018/1/2
[root@host ~]# hdfs dfs -ls -R /tmp/20180325
drwxr-xr-x - root supergroup 0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x - root supergroup 0 2018-03-26 11:56 /tmp/20180325/2018/1
drwxr-xr-x - root supergroup 0 2018-03-26 11:56 /tmp/20180325/2018/1/1
drwxr-xr-x - root supergroup 0 2018-03-26 11:56 /tmp/20180325/2018/1/2
上传本地文件到hdfs
[root@host ~]# hdfs dfs -put /root/tmpdata/log_message0.txt /tmp/20180325/2018/1/1
[root@host ~]# hdfs dfs -put /root/tmpdata/log_message1.txt /tmp/20180325/2018/1/2
[root@host ~]# hdfs dfs -ls -R /tmp/20180325
drwxr-xr-x - root supergroup 0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x - root supergroup 0 2018-03-26 11:56 /tmp/20180325/2018/1
drwxr-xr-x - root supergroup 0 2018-03-26 11:57 /tmp/20180325/2018/1/1
-rw-r--r-- 1 root supergroup 238 2018-03-26 11:57 /tmp/20180325/2018/1/1/log_message0.txt
drwxr-xr-x - root supergroup 0 2018-03-26 11:57 /tmp/20180325/2018/1/2
-rw-r--r-- 1 root supergroup 137 2018-03-26 11:57 /tmp/20180325/2018/1/2/log_message1.txt
创建分区外部表,列分隔符为空格:
hive> create external table log_message
> (id int,server string,level int,message string)
> partitioned by(year int,month int,day int)
> row format delimited fields terminated by ' ';
OK
Time taken: 0.099 seconds
添加分区
hive> alter table log_message add partition(year=2018,month=1,day=1)
> location '/tmp/20180325/2018/1/1';
OK
Time taken: 0.066 seconds
hive> show partitions log_message;
OK
partition
year=2018/month=1/day=1
Time taken: 0.042 seconds, Fetched: 1 row(s)
hive> select * from log_message;
OK
log_message.id log_message.server log_message.level log_message.message log_message.year log_message.month log_message.day
1 server1 1 aggaggaggewag 2018 1 1
2 server2 1 ageggeqagegge 2018 1 1
3 server1 2 agegagag4degg 2018 1 1
4 server1 5 agetetgdgagggeg 2018 1 1
5 server3 7 gahewgjaglewgjew 2018 1 1
6 server1 8 agegagegggfewg 2018 1 1
7 server2 2 agegwagwggew 2018 1 1
8 server1 3 agewgagggfew 2018 1 1
9 server1 2 geawtgggggagg 2018 1 1
Time taken: 0.122 seconds, Fetched: 9 row(s)
hive> alter table log_message add partition(year=2018,month=1,day=2)
> location '/tmp/20180325/2018/1/2';
OK
Time taken: 0.116 seconds
hive> show partitions log_message;
OK
partition
year=2018/month=1/day=1
year=2018/month=1/day=2
Time taken: 0.046 seconds, Fetched: 2 row(s)
hive> select * from log_message;
OK
log_message.id log_message.server log_message.level log_message.message log_message.year log_message.month log_message.day
1 server1 1 aggaggaggewag 2018 1 1
2 server2 1 ageggeqagegge 2018 1 1
3 server1 2 agegagag4degg 2018 1 1
4 server1 5 agetetgdgagggeg 2018 1 1
5 server3 7 gahewgjaglewgjew 2018 1 1
6 server1 8 agegagegggfewg 2018 1 1
7 server2 2 agegwagwggew 2018 1 1
8 server1 3 agewgagggfew 2018 1 1
9 server1 2 geawtgggggagg 2018 1 1
100 server1 2 dsaggegagg 2018 1 2
101 server1 3 gaggg 2018 1 2
102 server2 5 ageggge 2018 1 2
103 server1 1 sagegeg 2018 1 2
104 server2 3 agegawggge 2018 1 2
105 server2 5 agfeggeg 2018 1 2
Time taken: 0.14 seconds, Fetched: 15 row(s)
hive不关心一个分区对应的分区目录是否存在或者分区目录下是否有文件,这个功能的好处,可以将新数据写入专有目录,并与其它目录的数据有明显的区别。
不管用户将旧数据存档或者直接删除,新数据被篡改的风险都降低了,因为新数据的子集位于不同的目录下。
与非分区外部表一样,hive并不控制这些数据,即使表被删除,数据也不会被删除。
查看表的信息:
hive> desc formatted log_message;
OK
# col_name data_type comment
id int
server string
level int
message string
# Partition Information
# col_name data_type comment
year int
month int
day int
# Detailed Table Information
Database: gamedw
Owner: root
CreateTime: Mon Mar 26 13:29:08 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://localhost:9000/user/hive/warehouse/gamedw.db/log_message
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1522042148
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim
serialization.format
Time taken: 0.048 seconds, Fetched: 37 row(s)
查看表某个分区的信息:
hive> desc formatted log_message partition(year=2018,month=1,day=1);
OK
# col_name data_type comment
id int
server string
level int
message string
# Partition Information
# col_name data_type comment
year int
month int
day int
# Detailed Partition Information
Partition Value: [2018, 1, 1]
Database: gamedw
Table: log_message
CreateTime: Mon Mar 26 13:29:39 CST 2018
LastAccessTime: UNKNOWN
Location: hdfs://localhost:9000/tmp/20180325/2018/1/1
Partition Parameters:
numFiles 1
totalSize 238
transient_lastDdlTime 1522042179
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim
serialization.format
Time taken: 0.244 seconds, Fetched: 37 row(s)
同样可以使用load insert命令指定分区路径
[root@host ~]# hdfs dfs -ls -R /tmp/20180325
drwxr-xr-x - root supergroup 0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x - root supergroup 0 2018-03-26 14:08 /tmp/20180325/2018/1
drwxr-xr-x - root supergroup 0 2018-03-26 14:24 /tmp/20180325/2018/1/1
-rw-r--r-- 1 root supergroup 238 2018-03-26 11:57 /tmp/20180325/2018/1/1/log_message0.txt
drwxr-xr-x - root supergroup 0 2018-03-26 11:57 /tmp/20180325/2018/1/2
-rw-r--r-- 1 root supergroup 137 2018-03-26 11:57 /tmp/20180325/2018/1/2/log_message1.txt
drwxr-xr-x - root supergroup 0 2018-03-26 14:25 /tmp/20180325/2018/1/3
-rw-r--r-- 1 root supergroup 238 2018-03-26 14:25 /tmp/20180325/2018/1/3/log_message2.txt
hive> load data inpath '/tmp/20180325/2018/1/3' into table log_message partition (year=2018,month=1,day=3);
Loading data to table gamedw.log_message partition (year=2018, month=1, day=3)
OK
Time taken: 0.85 seconds
[root@host ~]# hdfs dfs -ls -R /tmp/20180325
drwxr-xr-x - root supergroup 0 2018-03-26 11:55 /tmp/20180325/2018
drwxr-xr-x - root supergroup 0 2018-03-26 14:25 /tmp/20180325/2018/1
drwxr-xr-x - root supergroup 0 2018-03-26 14:24 /tmp/20180325/2018/1/1
-rw-r--r-- 1 root supergroup 238 2018-03-26 11:57 /tmp/20180325/2018/1/1/log_message0.txt
drwxr-xr-x - root supergroup 0 2018-03-26 11:57 /tmp/20180325/2018/1/2
-rw-r--r-- 1 root supergroup 137 2018-03-26 11:57 /tmp/20180325/2018/1/2/log_message1.txt
hive> desc formatted log_message partition(year=2018,month=1,day=3);
OK
# col_name data_type comment
id int
server string
level int
message string
# Partition Information
# col_name data_type comment
year int
month int
day int
# Detailed Partition Information
Partition Value: [2018, 1, 3]
Database: gamedw
Table: log_message
CreateTime: Mon Mar 26 14:25:48 CST 2018
LastAccessTime: UNKNOWN
Location: hdfs://localhost:9000/user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3
Partition Parameters:
numFiles 1
numRows 0
rawDataSize 0
totalSize 238
transient_lastDdlTime 1522045548
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim
serialization.format
Time taken: 0.153 seconds, Fetched: 39 row(s)
在hdfs文件查看log_message2.txt被转移到了hive,如下:
drwx-wx-wx - root supergroup 0 2018-03-26 12:01 /user/hive/warehouse/gamedw.db/log_message
drwx-wx-wx - root supergroup 0 2018-03-26 12:01 /user/hive/warehouse/gamedw.db/log_message/year=2018
drwx-wx-wx - root supergroup 0 2018-03-26 14:25 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1
drwx-wx-wx - root supergroup 0 2018-03-26 12:01 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=1
drwx-wx-wx - root supergroup 0 2018-03-26 14:25 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3
-rwx-wx-wx 1 root supergroup 238 2018-03-26 14:25 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3/log_message2.txt
hive> show partitions log_message;
OK
year=2018/month=1/day=1
year=2018/month=1/day=2
year=2018/month=1/day=3
Time taken: 0.063 seconds, Fetched: 3 row(s)
删除外部分区表,查看 /user/hive/warehouse/gamedw.db/log_message/year=2018/month=1/day=3/log_message2.txt 是否会被删除:
hive> drop table log_message;
OK
Time taken: 1.005 seconds
查看hdfs发现加载的数据依然存在;
我们需要记住并非所有的表数据都放在hive(warehouse)目录下,删除管理表(外部表)时,这些数据不会被连带删除。
外部表的数据不由hive自身负责管理,虽然数据会被加载到/user/hive/warehouse/,但是不由hive管理,因此drop掉外部表后,加载到/user/hive/warehouse/的数据会依然存在.
内部表的数据由hive自身负责管理,无论是分区表还是非分区表,drop table后,会连同数据一块删除。
自定义表的存储格式
hive默认存储格式是文本文件格式,也可以通过可选子句stored as textfile显示指定;同时用户在创建表时可以指定各种各样的分割符
terminate 终止delimite定义,划定;terminated 终止,delimited分割
hive默认记录和字符分割符如下:
\n :对于文本来说,每行是一条记录,换行符来分割记录
^A:用于分割字段,建表语句使用八进制 \001表示
^B:用来分割ARRAY,STRUCT中的元素,或者用于MAP键-值对之间的分割,建表语句使用八进制 \002表示
^C:用来MAP键和值之间的分割,建表语句使用八进制 \003表示
hive 默认分隔符的查看,输入,修改:
查看:
[root@host tmpdata]# cat employees.txt
tianyongtao100wangZHANGLIUaaa10bb5CC8HENANLUOHELINYINGTAICHEN
[root@host tmpdata]# cat -A employees.txt
tianyongtao^A100^Awang^BZHANG^BLIU^Aaaa^C10^Bbb^C5^BCC^C8^AHENAN^BLUOHE^BLINYING^BTAICHEN$
输入,修改
[root@host tmpdata]# vim employees.txt set list
ctrl + V 可以输入 ^符号
ctrl + a 可以输入A---'\001'
ctrl + b 可以输入A---'\002'
ctrl + c 可以输入A---'\003'
完整建表语句如下:
CREATE TABLE `employees`(
`name` string,
`salary` float,
`subdinates` array<string>,
`deducation` map<string,float>,
`address` struct<street:string,city:string,state:string,zip:int>)
row formatted delimited
fields delimited by '\001'
collection items delimited by '\002'
map keys delimited by '\003'
lines terminated by '\n'
stored as textfile;
textfile 意味着所有字段都使用数字,字母,字符编码,每一行被认为是一个单独记录。
用户可以将textfile替换成其他hive支持的内置文件格式,包括sequncefile和rcfile,这两种格式都是采用二进制编码和压缩(可选)来优化磁盘空间的使用以及I/O带宽性能。
对于记录是如何被编码成文件的,以及列是如何被编码成记录的,hive指出了他们的不同,用户可以分别自定义这些行为。
记录编码是通过一个inputformat对象来控制的。
记录的解析是通过序列化器和反序列化器(缩写成SerDe)来控制的.
Serializable 可序列化的
Deserialization 反序列化的
为了保持完整性,hive还使用了一个叫做outputformat的对象来将查询的输出写入到文件或者输出到控制台。
hive使用一个inputformat对象将输入流分割成记录,然后使用一个outputformat对象将记录格式为输出流(如查询的结果),再使用一个SerDe在读数据时将记录解析成列,在写数据时将列编码成记录。
用户可以指定第三方的输入输出格式以及SerDe
hive> show create table employees;
OK
CREATE TABLE `employees`(
`name` string,
`salary` float,
`subdinates` array<string>,
`deducation` map<string,float>,
`address` struct<street:string,city:string,state:string,zip:int>)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/employees'
TBLPROPERTIES (
'creator'='tianyongtao',
'last_modified_by'='root',
'last_modified_time'='1521447397',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1521447397')
Time taken: 0.032 seconds, Fetched: 23 row(s)
以上建表语句ROW FORMAT SERDE 指定了使用的SerDe
STORED AS INPUTFORMAT ........OUTPUTFORMAT.............指定了用于输出格式和输入格式对应的java类
修改表名
hive> show tables;
OK
......
role
...............
Time taken: 0.013 seconds, Fetched: 5 row(s)
hive> alter table role rename to roleinfor;
OK
Time taken: 0.162 seconds
hive> show tables;
OK
.........
roleinfor
........................
Time taken: 0.013 seconds, Fetched: 5 row(s)
修改分区路径
修改前:
drwx-wx-wx - root supergroup 0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx - root supergroup 0 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx 1 root supergroup 80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
-rwx-wx-wx 1 root supergroup 80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_1.txt
-rwx-wx-wx 1 root supergroup 80 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_2.txt
drwx-wx-wx - root supergroup 0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx 1 root supergroup 80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt
修改:
hive> alter table customers partition(city='shanghai')set location 'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/customers/city=wudu'
> ;
OK
Time taken: 0.174 seconds
查看修改结果:
hive> show partitions customers;
OK
partition
city=shanghai
city=shenzhen
Time taken: 0.053 seconds, Fetched: 2 row(s)
hive> select * from customers;
OK
customers.custname customers.sex customers.age customers.city
tianyongtao 1 50 shenzhen
wangwu 1 85 shenzhen
zhangsan 1 20 shenzhen
liuqin 0 56 shenzhen
wangwu 0 47 shenzhen
liuyang 1 32 shenzhen
Time taken: 0.115 seconds, Fetched: 6 row(s)
由于调整了分区的位置,新的位置没数据,所有city=shanghai分区没数据。
查看hdfs:
drwx-wx-wx - root supergroup 0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx - root supergroup 0 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai
-rwx-wx-wx 1 root supergroup 80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers.txt
-rwx-wx-wx 1 root supergroup 80 2018-03-28 16:58 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_1.txt
-rwx-wx-wx 1 root supergroup 80 2018-03-28 17:02 /user/hive/warehouse/gamedw.db/customers/city=shanghai/customers_copy_2.txt
drwx-wx-wx - root supergroup 0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx 1 root supergroup 80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt
由此可见,重新设置分区位置,旧的数据不会被转移的新的位置,旧的数据也不会被删除掉
还原分区位置:
hive> alter table customers partition(city='shanghai')set location 'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/customers/city=shanghai';
OK
Time taken: 0.151 seconds
查看表中的数据:发现上海的数据又回来了
hive> select * from customers;
OK
customers.custname customers.sex customers.age customers.city
tianyongtao 1 50 shanghai
wangwu 1 85 shanghai
zhangsan 1 20 shanghai
liuqin 0 56 shanghai
wangwu 0 47 shanghai
liuyang 1 32 shanghai
tianyongtao 1 50 shanghai
wangwu 1 85 shanghai
zhangsan 1 20 shanghai
liuqin 0 56 shanghai
wangwu 0 47 shanghai
liuyang 1 32 shanghai
tianyongtao 1 50 shanghai
wangwu 1 85 shanghai
zhangsan 1 20 shanghai
liuqin 0 56 shanghai
wangwu 0 47 shanghai
liuyang 1 32 shanghai
tianyongtao 1 50 shenzhen
wangwu 1 85 shenzhen
zhangsan 1 20 shenzhen
liuqin 0 56 shenzhen
wangwu 0 47 shenzhen
liuyang 1 32 shenzhen
Time taken: 0.114 seconds, Fetched: 24 row(s)
修改列将字段名age修改为niangling:
hive> alter table tb_cust change column age nianling int COMMENT 'nianling alter at 20180329';
OK
Time taken: 0.093 seconds
hive> select * from tb_cust;
OK
tb_cust.custname tb_cust.sex tb_cust.nianling tb_cust.city
tianyongtao 1 50 beijing
wangwu 1 85 beijing
zhangsan 1 20 beijing
liuqin 0 56 beijing
wangwu 0 47 beijing
liuyang 1 32 beijing
Time taken: 0.103 seconds, Fetched: 6 row(s)
hive> desc tb_cust;
OK
col_name data_type comment
custname string
sex int
nianling int nianling alter at 20180329
city string
# Partition Information
# col_name data_type comment
city string
Time taken: 0.015 seconds, Fetched: 9 row(s)
-----------------------------------------------------------------------
调整字段位置将字段sex调整到字段niangling后面:
hive> alter table tb_cust change column sex sex int after nianling;
OK
Time taken: 0.113 seconds
hive> desc tb_cust;
OK
col_name data_type comment
custname string
nianling int nianling alter at 20180329
sex int
city string
# Partition Information
# col_name data_type comment
city string
Time taken: 0.015 seconds, Fetched: 9 row(s)
与上次查询相比变换了位置。
修改字段,就算没有修改字段名,字段类型没有改变,用户也要指定旧的字段名。
重要的一点,字段位置做了变动,只是修改元数据信息,数据内容的位置没变;如果需要调整包括数据的表的字段位置,对应的数据的位置也应该做调整。
修改表属性
hive> alter table mess set tblproperties('note'='message table');
OK
Time taken: 0.173 seconds
CREATE TABLE `mess`(
`aa` string,
`cc` string,
`bb` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/mess'
TBLPROPERTIES (
'last_modified_by'='root',
'last_modified_time'='1522393964',
'note'='message table',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1522393964')
Time taken: 0.039 seconds, Fetched: 21 row(s)
修改存储属性
hive> alter table mess set fileformat sequencefile ;
OK
Time taken: 0.12 seconds
hive> show create table mess;
OK
createtab_stmt
CREATE TABLE `mess`(
`aa` string,
`cc` string,
`bb` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/mess'
TBLPROPERTIES (
'last_modified_by'='root',
'last_modified_time'='1522394155',
'note'='message table',
'numFiles'='0',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='0',
'transient_lastDdlTime'='1522394155')
Time taken: 0.045 seconds, Fetched: 21 row(s)
修改表的存储属性:
hive> alter table t_name2 set fileformat sequencefile;
OK
Time taken: 0.09 seconds
hive> select * from t_name2;
OK
t_name2.id t_name2.name t_name2.level
Failed with exception java.io.IOException:java.io.IOException: hdfs://localhost:9000/tmp/20180321/name.csv not a SequenceFile
Time taken: 0.104 seconds
修改成功,外部表t_name2数据文件不是SequenceFile格式报错
hive> alter table tb_cust clustered by(city,custname) sorted by(nianling) into 10 buckets;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Invalid column reference //分区无效
hive> alter table tb_cust clustered by(custname) sorted by(nianling) into 10 buckets;
OK
Time taken: 0.126 seconds
上述中sorted by是可以选的 clustered by....into ....buckets是必选的。
用户可以指定一个新的SerDe,并为其指定SerDe属性,或者修改已经存在SerDe属性;
hive "执行钩子"
alter table .... touch .....用于触发钩子
如 alter table t_cust touch partition(city='shenzhen') 其中partition子句用于分区表.这个语句的典型应用场景就是,当表的数据文件在hive以外做了修改,就会触发钩子的执行
如果某个分区写入了新的日志信息可以执行:
hive -e 'alter table tb_cust touch partition(city='shenzhen');'
暂时没有验证成功
如果表或者分区不存在,这个语句也不会建表或者创建分区.
archive ['arkaiv] n 档案,档案馆 vt 存档
altere table .... archive partition会将这个分区的文件打成一个hadoop压缩包(har)文件,这样仅仅是降低文件数据,降低namenode的压力,而不会减少任何存储空间。
hive> set hive.archive.enabled=true;
hive> alter table tb_cust archive partition(city='shenzhen');
intermediate.archived is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen
in hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_cust/city=shenzhen/.hive-staging_hive_2018-04-02_15-54-07_256_1308807572832742090-1/-ext-10000/partlevel
Please wait... (this may take a while)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/tools/HadoopArchives
报错,待处理
使用unarchive可以反向操作
hive提供了各种保护,可以防止分区被查询或者被删除:
alter table tb_cust partition(city='shenzhen') ENABLE NO_DROP;
alter table tb_cust partition(city='shenzhen') ENABLE OFFLINE;
使用disable进行反向操作,以上验证失败;这些操作只能用于分区表;
数据操作
向管理表装载数据
hive没有行级别的插入,更新,删除操作,往表中装载数据唯一途径使用一种大量的数据装载操作。或通过其他方式,将文件写入正确的目录下。
hive> show partitions tb_cust;
OK
city=beijing
city=shenzhen
Time taken: 0.063 seconds, Fetched: 2 row(s)
hive> load data local inpath '/root/tmpdata/customers.txt' into table tb_cust partition(city='guangzhou');
Loading data to table gamedw.tb_cust partition (city=guangzhou)
OK
Time taken: 0.781 seconds
hive> show partitions tb_cust;
OK
city=beijing
city=guangzhou
city=shenzhen
Time taken: 0.059 seconds, Fetched: 3 row(s)
如果分区不存在load语句会先创建分区,然后再将数据拷贝到这个目录下。
如果不是分区表,则去掉partition子句
通常指定的路径是一个目录,而不是一个单独的文件,hive会将所有文件拷贝到这个目录下:
[root@host ~]# ls /root/tmpdata/suzhou
customers2.txt customers.txt
hive> load data local inpath '/root/tmpdata/suzhou' into table tb_cust partition(city='suzhou');
Loading data to table gamedw.tb_cust partition (city=suzhou)
OK
Time taken: 0.596 seconds
hive> show partitions tb_cust;
OK
city=beijing
city=guangzhou
city=shenzhen
city=suzhou
Time taken: 0.055 seconds, Fetched: 4 row(s)
在hdfs下查看suzhou分区:
drwx-wx-wx - root supergroup 0 2018-04-02 16:59 /user/hive/warehouse/gamedw.db/tb_cust/city=suzhou
-rwx-wx-wx 1 root supergroup 96 2018-04-02 16:59 /user/hive/warehouse/gamedw.db/tb_cust/city=suzhou/customers.txt
-rwx-wx-wx 1 root supergroup 96 2018-04-02 16:59 /user/hive/warehouse/gamedw.db/tb_cust/city=suzhou/customers2.txt
由于使用了local,使用的本地系统的路径,省略掉local,使用的就是分布式系统的路径。
hive要求源文件 目标文件以及目录应该在同一个文件系统中 ,例如,load data不可以将数据从一个集群的hdfs转移到另个一个集群的hdfs中.
指定全路径,也可以指定相对路径
如果用户指定了overwrite关键字,那么目标文件夹中的之前存在的数据将会被删除掉。如果没有这个关键字,仅仅会把新增的文件增加到目标文件夹而不删除之前的数据。然而如果目标文件夹中存在与装载的文件同名的文件,那么旧的文件就会覆盖重写。(事实上,没有overwrite关键字,同名的文件会被保留,新文件会被改名为 文件名_序列号)
inpath路径中不能包含任何文件夹
hive不会验证装载的数据与表的模式是否匹配,然而hive会验证文件格式是否与表定义的一致。
通过查询语句向表中插入数据
hive> select * from cust;
OK
cust.custname cust.sex cust.nianling
tianyt_touch100 1 50
wangwu 1 85
zhangsan 1 20
liuqin 0 56
wangwu 0 47
liuyang 1 32
hello 0 100
Time taken: 0.151 seconds, Fetched: 7 row(s)
hive> create external table tb_cust_externl
> (custname string,sex int,nianling int) row format delimited fields terminated by ' '
> location '/root/tmpdata/customers.txt';
OK
Time taken: 0.288 seconds
[root@host ~]# hdfs dfs -ls -R /tmp/201804*
-rw-r--r-- 1 root supergroup 14 2018-04-12 16:50 /tmp/20180410/customer1.txt
hive> alter table tb_cust_externl set location '/tmp/20171024/customers1.txt';
OK
Time taken: 0.146 seconds
hive> select * from tb_cust_externl;
OK
tb_cust_externl.custname tb_cust_externl.sex tb_cust_externl.nianling
nihao 1 5
Time taken: 0.176 seconds, Fetched: 1 row(s)
查看HDFS:
drwxr-xr-x - root supergroup 0 2018-04-13 11:41 /user/hive/warehouse/gamedw.db/cust
-rwxr-xr-x 1 root supergroup 96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
通过spark插入数据:
scala> hivecon.sql("insert into table gamedw.cust select * from gamedw.tb_cust_externl").show
18/04/13 11:50:56 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
++
||
++
++
hive> select * from cust;
OK
cust.custname cust.sex cust.nianling
tianyt_touch100 1 50
wangwu 1 85
zhangsan 1 20
liuqin 0 56
wangwu 0 47
liuyang 1 32
hello 0 100
nihao 1 5
Time taken: 0.137 seconds, Fetched: 8 row(s)
发现tb_cust_externl外部表的nihao成功插入
然后查看HDFS:
drwxr-xr-x - root supergroup 0 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust
-rwxr-xr-x 1 root supergroup 96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
-rwxr-xr-x 1 root supergroup 10 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000
发现cust多了一个文件,咱们来查看一下文件内容:
[root@host ~]# hdfs dfs -text /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000
nihao 1 5
也就是说插入的数据会生成一个新的文件
如果插入使用overwrite而不是into语句,管理表的数据将会被覆盖。
hive 上篇的更多相关文章
- 从零自学Hadoop(16):Hive数据导入导出,集群数据迁移上
阅读目录 序 导入文件到Hive 将其他表的查询结果导入表 动态分区插入 将SQL语句的值插入到表中 模拟数据文件下载 系列索引 本文版权归mephisto和博客园共有,欢迎转载,但须保留此段声明,并 ...
- 从零自学Hadoop(17):Hive数据导入导出,集群数据迁移下
阅读目录 序 将查询的结果写入文件系统 集群数据迁移一 集群数据迁移二 系列索引 本文版权归mephisto和博客园共有,欢迎转载,但须保留此段声明,并给出原文链接,谢谢合作. 文章是哥(mephis ...
- Hadoop、Pig、Hive、Storm、NOSQL 学习资源收集
(一)hadoop 相关安装部署 1.hadoop在windows cygwin下的部署: http://lib.open-open.com/view/1333428291655 http://blo ...
- 大数据时代的技术hive:hive的数据类型和数据模型
在上篇文章里,我列举了一个简单的hive操作实例,创建了一张表test,并且向这张表加载了数据,这些操作和关系数据库操作类似,我们常把hive和关系数据库进行比较,也正是因为hive很多知识点和关系数 ...
- 一脸懵逼学习Hive的元数据库Mysql方式安装配置
1:要想学习Hive必须将Hadoop启动起来,因为Hive本身没有自己的数据管理功能,全是依赖外部系统,包括分析也是依赖MapReduce: 2:七个节点跑HA集群模式的: 第一步:必须先将Zook ...
- Houdini技术体系 基础管线(四) :Houdini驱动的UE4植被系统 上篇
背景 之前在<Houdini技术体系 过程化地形系统(一):Far Cry5的植被系统分析>一文中已经对AAA游戏中过程化植被的需求有了一定的定义,后续工作就是如何用Houdini开发功能 ...
- Hadoop3集群搭建之——hive添加自定义函数UDTF (一行输入,多行输出)
上篇: Hadoop3集群搭建之——虚拟机安装 Hadoop3集群搭建之——安装hadoop,配置环境 Hadoop3集群搭建之——配置ntp服务 Hadoop3集群搭建之——hive安装 Hadoo ...
- Hadoop3集群搭建之——hive添加自定义函数UDTF
上篇: Hadoop3集群搭建之——虚拟机安装 Hadoop3集群搭建之——安装hadoop,配置环境 Hadoop3集群搭建之——配置ntp服务 Hadoop3集群搭建之——hive安装 Hadoo ...
- Hadoop3集群搭建之——hive添加自定义函数UDF
上篇: Hadoop3集群搭建之——虚拟机安装 Hadoop3集群搭建之——安装hadoop,配置环境 Hadoop3集群搭建之——配置ntp服务 Hadoop3集群搭建之——hive安装 Hadoo ...
随机推荐
- java语法基础练习
1.阅读示例: EnumTest.java,并运行.分析结果 代码: public class EnumTest { public static void main(String[] args) { ...
- 一键分享到各个SNS插件
使用百度分享:http://share.baidu.com/code/advance#toid 例: HTML: <div class="bdsharebuttonbox" ...
- ES6必知必会 (三)—— 数组和对象的拓展
数组的扩展 1.拓展运算符('...'),它相当于rest参数的逆运算,用于将一个数组转换为用逗号分隔的参数序列: console.log(...[1, 2, 3]) // 1 2 3 console ...
- Oracle集合类型
Oracle集合类型介绍 集合类型 1. 使用条件: a. 单行单列的数据,使用标量变量 . b. 单行多列数据,使用记录 c. 单列多行数据,使用集合 *集 ...
- Opengl研究4.0 走样与反走样
Opengl研究4.0 走样与反走样 DionysosLai(906391500@qq.com) 2014-06-25 走样与反走样,也叫混淆与反混淆.所谓走样,是因为使用离散量(像 ...
- Java Web Service 学习笔记
一.服务端 1. 创建Java工程 2. 创建接口HostipalServiceInterface package ws_server; import javax.jws.WebMethod; imp ...
- DKH大数据分析平台解决方案优势说明
大数据技术的发展与应用已经在深刻地改变和影响我们的日常生活与工作,可以预见的是在大数据提升为国家战略层面后,未来的几年里大数据技术将会被更多的行业应用. 相信很多人对于大数据技术的应用还是处于一个非常 ...
- [C++ Primer] : 第10章: 泛型算法
概述 泛型算法: 称它们为"算法", 是因为它们实现了一些经典算法的公共接口, 如搜索和排序; 称它们是"泛型的", 是因为它们可以用于不同类型的元素和多种容器 ...
- BaseDao的编写:实现 增,删,改,查,分页这些公共方法的编写
1.BaseDao接口 package com.learning.crm.base; import java.io.Serializable; import java.util.List; publi ...
- jquery.validate.js remote (php)
网上的人不厚道呀 validate 这玩意的异步是 返回的 echo 'true' 或者 echo 'false';很少有人说呀~.~ 转载了一篇原文: jquery.validate.js对于数 ...