【原创】大叔经验分享(7)创建hive表时格式如何选择
常用格式
textfile
需要定义分隔符,占用空间大,读写效率最低,非常容易发生冲突(分隔符)的一种格式,基本上只有需要导入数据的时候才会使用,比如导入csv文件;
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
json
hive3.0后官方支持json格式,之前需要使用第三方,导入jar,http://www.congiu.net/hive-json-serde/,
add jar hdfs://nn/jarpath/json-udf-1.3.8-jar-with-dependencies.jar;
add jar hdfs://nn/jarpath/json-serde-1.3.8-jar-with-dependencies.jar;
占用空间最大,读写效率低,基本上只有需要导入数据的时候才会使用,比如导入json文件;
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
lzo
相比textfile多了lzo压缩,占用空间更小;
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
orc
列式存储,占用空间最小,非常适合用来做数仓;
STORED AS ORC
压缩
STORED AS ORC TBLPROPERTIES ("orc.compression"="ZLIB")
STORED AS ORC TBLPROPERTIES ("orc.compression"="SNAPPY")
注意设置orc压缩格式前一定要先设置:
set hive.exec.orc.compression.strategy=COMPRESSION;
否则压缩不生效;
parquet
列式存储,占用空间居中,如果后期使用spark来处理,parquet是最佳格式;
STORED AS PARQUET
parquet+snappy
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY")
对比测试
测试表:test_table
测试行数:10亿
测试sql类型:aggregation
测试sql:select col_1, count(1) from test_table group by col_1;
测试结果
|
fs |
hdfs |
kudu |
|||||
|
format |
textfile |
lzo |
parquet |
parquet snappy |
orc |
orc snappy |
|
|
capacity |
464.0 G |
169.4 G |
177.2 G |
111.3 G |
71.5 G |
65.7G |
184 G |
|
100% |
36% |
37% |
23% |
15% |
14% |
39% |
|
|
Hive2.3.4 |
816 s |
711 s |
250 s |
158 s |
130 s |
127 s |
|
|
Hive2.3.4 Tuning |
251 s |
163 s |
109 s |
96 s |
|||
|
Hive2.3.4 On spark2.4.0 |
54 s |
47 s |
149 s |
138 s |
|||
|
Spark2.1.1 |
371 s |
293 s |
17 s |
16 s |
51 s |
||
|
Spark2.4.0 |
496 s |
297 s |
16 s |
16 s |
21 s |
21 s |
|
|
Drill1.15.0 |
59 s |
57 s |
75 s |
45 s |
|||
|
Impala2.12 |
15 s |
16 s |
|||||
|
Presto0.215 |
25 s |
21 s |
13 s |
12 s |
|||
|
|
|||||||
|
|
|||||||
- 从数据大小和查询效率上看,表现最好的是presto+orc+snappy;
- hive下最佳格式为orc snappy,数据大小最小,并且查询最快;
- hive切换engine为spark后,对parquet格式的查询有一些提升,但是占用相同资源的情况下,远不如直接使用spark sql快;
- spark2.3以后对orc格式相比之前有很大优化,已经很接近parquet格式;
- impala+parquet+hdfs的性能和impala+kudu差不多,kudu的好处是支持实时更新;
- drill看起来没有必要;
- spark2.4.0中的parquet为2.4,parquet从2.5开始支持column index,预计以后的spark版本对parquet的查询会更快;
- impala对orc的支持从3.1开始作为实验功能的一部分;
详细数据
yarn 200g 50core
1 hive-2.3.4
set mapreduce.map.memory.mb=4096;
set mapreduce.map.java.opts=-Xmx3072m;
hive-textfile:
Time taken: 816.202 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 1831 Reduce: 1009 Cumulative CPU: 27614.77 sec HDFS Read: 498267775168 HDFS Write: 88861 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 40 minutes 14 seconds 770 msec
hive-lzo:
Time taken: 711.266 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 183 Reduce: 711 Cumulative CPU: 13949.24 sec HDFS Read: 181881436157 HDFS Write: 62935 SUCCESS
Total MapReduce CPU Time Spent: 0 days 3 hours 52 minutes 29 seconds 240 msec
hive-orc:
Time taken: 130.194 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 275 Reduce: 300 Cumulative CPU: 4368.67 sec HDFS Read: 626004573 HDFS Write: 27178 SUCCESS
Total MapReduce CPU Time Spent: 0 days 1 hours 12 minutes 48 seconds 670 msec
hive-orc snappy:
Time taken: 127.803 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 191 Reduce: 276 Cumulative CPU: 4374.74 sec HDFS Read: 580889407 HDFS Write: 25090 SUCCESS
Total MapReduce CPU Time Spent: 0 days 1 hours 12 minutes 54 seconds 740 msec
hive-orc-tuning:
Time taken: 109.539 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 275 Reduce: 300 Cumulative CPU: 3051.67 sec HDFS Read: 627064673 HDFS Write: 40321 SUCCESS
Total MapReduce CPU Time Spent: 50 minutes 51 seconds 670 msec
hive-orc snappy-tuning:
Time taken: 94.135 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 191 Reduce: 276 Cumulative CPU: 2393.92 sec HDFS Read: 581727151 HDFS Write: 37201 SUCCESS
Total MapReduce CPU Time Spent: 39 minutes 53 seconds 920 msec
hive-parquet:
Time taken: 250.786 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 642 Reduce: 744 Cumulative CPU: 10919.85 sec HDFS Read: 873784253 HDFS Write: 65806 SUCCESS
Total MapReduce CPU Time Spent: 0 days 3 hours 1 minutes 59 seconds 850 msec
hive-parquet snappy:
Time taken: 158.009 seconds, Fetched: 32 row(s)
Stage-Stage-1: Map: 367 Reduce: 467 Cumulative CPU: 6246.0 sec HDFS Read: 721915438 HDFS Write: 41707 SUCCESS
Total MapReduce CPU Time Spent: 0 days 1 hours 44 minutes 6 seconds 0 msec
2 hive-2.3.4 on spark-2.4.0
set spark.driver.memory=4g;
set spark.executor.memory=4g;
set spark.executor.instances=10;
hive on spark-parquet:
Time taken: 54.446 seconds, Fetched: 32 row(s)
hive on spark-parquet snappy:
Time taken: 47.364 seconds, Fetched: 32 row(s)
hive on spark-orc:
Time taken: 149.901 seconds, Fetched: 32 row(s)
hive on spark-orc snappy:
Time taken: 138.844 seconds, Fetched: 32 row(s)
3 impala-2.12
MEM_LIMIT=20g * 3
impala-parquet snappy:
Fetched 32 row(s) in 15.10s
+--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+
| 04:EXCHANGE | 1 | 211.45us | 211.45us | 32 | 50 | 208.00 KB | 0 B | UNPARTITIONED |
| 03:AGGREGATE | 3 | 2.58ms | 2.91ms | 32 | 50 | 34.03 MB | 128.00 MB | FINALIZE |
| 02:EXCHANGE | 3 | 29.23us | 30.92us | 96 | 1.04B | 32.00 KB | 0 B | HASH(cpp_addr_province) |
| 01:AGGREGATE | 3 | 13.29s | 13.97s | 96 | 1.04B | 34.05 MB | 128.00 MB | STREAMING |
| 00:SCAN HDFS | 3 | 723.09ms | 760.01ms | 1.04B | 1.04B | 36.55 MB | 88.00 MB | temp.app_ba_userprofile_prop_nonpolar_view_ext_ps |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+
impala-kudu:
Fetched 32 row(s) in 15.61s
4 drill-1.15
10g+10g+1g+1g * 3
drill-parquet:
32 rows selected (59.501 seconds)
drill-parquet snappy:
32 rows selected (57.653 seconds)
drill-orc:
32 rows selected (75.749 seconds)
drill-orc snappy:
32 rows selected (45.323 seconds)
5 spark-sql --master yarn --num-executors 10 --executor-memory 4g --driver-memory 4g
5.1 spark-2.1.1
spark sql-textfile:
Time taken: 371.77 seconds, Fetched 32 row(s)
spark sql-lzo:
Time taken: 293.391 seconds, Fetched 32 row(s)
spark sql-parquet:
Time taken: 17.338 seconds, Fetched 32 row(s)
spark sql-parquet snappy:
Time taken: 16.609 seconds, Fetched 32 row(s)
spark sql-orc:
Time taken: 51.959 seconds, Fetched 32 row(s)
5.2 spark-2.4.0
spark sql-textfile:
Time taken: 496.395 seconds, Fetched 32 row(s)
spark sql-lzo:
Time taken: 297.142 seconds, Fetched 32 row(s)
spark sql-parquet:
Time taken: 16.728 seconds, Fetched 32 row(s)
spark sql-parquet snappy:
Time taken: 16.879 seconds, Fetched 32 row(s)
spark sql-orc:
Time taken: 21.432 seconds, Fetched 32 row(s)
spark sql-orc snappy:
Time taken: 21.935 seconds, Fetched 32 row(s)
6 presto
presto-parquet:
Splits: 3,182 total, 3,182 done (100.00%)
0:25 [1.04B rows, 612MB] [42.2M rows/s, 24.9MB/s]
presto-parquet snappy:
Splits: 2,088 total, 2,088 done (100.00%)
0:21 [1.04B rows, 584MB] [49.3M rows/s, 27.8MB/s
presto-orc:
Splits: 1,532 total, 1,532 done (100.00%)
0:13 [1.04B rows, 850MB] [81.7M rows/s, 66.8MB/s]
presto-orc snappy:
Splits: 1,353 total, 1,353 done (100.00%)
0:12 [1.04B rows, 1.13GB] [87.5M rows/s, 97.4MB/s]
【原创】大叔经验分享(7)创建hive表时格式如何选择的更多相关文章
- 【原创】大叔经验分享(8)创建hive表时用内部表还是外部表
内部表和外部表最主要的一个差别就是删除表或者删除分区时,底层的文件是否自动删除,内部表会自动删除,外部表不会自动删除,所以基础数据表一定要用外部表,即使误删表或分区之后,还可以很容易的恢复回来. 虽然 ...
- 【原】创建Hive表,分号分隔符“;”引起的异常
[障碍再现] 在创建支持Map数据结构的Hive表时,抛出如下异常 hive> create table tab_map(name string,info map<string,strin ...
- 【原创】经验分享:一个小小emoji尽然牵扯出来这么多东西?
前言 之前也分享过很多工作中踩坑的经验: 一个线上问题的思考:Eureka注册中心集群如何实现客户端请求负载及故障转移? [原创]经验分享:一个Content-Length引发的血案(almost.. ...
- 【原创】大叔经验分享(25)hive通过外部表读写hbase数据
在hive中创建外部表: CREATE EXTERNAL TABLE hive_hbase_table(key string, name string,desc string) STORED BY ' ...
- 【原创】大叔经验分享(65)spark读取不到hive表
spark 2.4.3 spark读取hive表,步骤: 1)hive-site.xml hive-site.xml放到$SPARK_HOME/conf下 2)enableHiveSupport Sp ...
- 【原创】大叔经验分享(60)hive和spark读取kudu表
从impala中创建kudu表之后,如果想从hive或spark sql直接读取,会报错: Caused by: java.lang.ClassNotFoundException: com.cloud ...
- 【原创】大叔经验分享(62)kudu副本数量
kudu的副本数量是在表上设置,可以通过命令查看 # sudo -u kudu kudu cluster ksck $master ... Summary by table Name | RF | S ...
- 【原创】大叔经验分享(52)ClouderaManager修改配置报错
Cloudera Manager中修改配置可能报错: Incorrect string value: '\xE7\xA8\x8B\xE5\xBA\x8F...' for column 'MESSAGE ...
- [Hive]使用HDFS文件夹数据创建Hive表分区
描写叙述: Hive表pms.cross_sale_path建立以日期作为分区,将hdfs文件夹/user/pms/workspace/ouyangyewei/testUsertrack/job1Ou ...
随机推荐
- Elasticsearch 通关教程(三): 索引别名Aliases问题
业务问题 业务需求是不断变化迭代的,也许我们之前写的某个业务逻辑在下个版本就变化了,我们可能需要修改原来的设计,例如数据库可能需要添加一个字段或删减一个字段,而在搜索中也会发生这件事,即使你认为现在的 ...
- Linux(Ubuntu)使用日记------自定义命令的使用
Linux如何自定义自己的命令呢?修改 系统中的 ~/.bashrc 文件即可 在这个文件最后面使用alias命令重定义命令. 例如: # novel-git begin alias n.r='les ...
- 其他综合-使用Xshell远程连接管理Linux实践
使用Xshell远程连接管理Linux实践 1. Xshell整体优化 1)点击 工具 ,然后选择 选项 2)在 常规 选项中,下面的存放路径根据个人爱好修改(可选默认) 3)在 更新 选项中,将 √ ...
- 7.docker日志收集
默认情况下,docker logs或者docker service logs显示命令的输出,就像在终端中以交互方式运行命令时一样.UNIX和Linux命令通常开在运行时间上三个I / O流,所谓的 S ...
- codeforces660C
Hard Process CodeForces - 660C You are given an array a with n elements. Each element of a is either ...
- 后缀自动机(SAM)学习笔记
目录 定义 SAM 的状态集 一些性质 SAM 的后缀链接 SAM 的转移函数 一些性质 算法构造 构造方法 时间复杂度证明 状态的数量 转移的数量 代码实现 实际应用 统计本质不同的子串个数 计算任 ...
- 深入理解ES6箭头函数中的this
简要介绍:箭头函数中的this,指向与一般function定义的函数不同,比较容易绕晕,箭头函数this的定义:箭头函数中的this是在定义函数的时候绑定,而不是在执行函数的时候绑定. 1.何为定义时 ...
- python 类的私有变量和私有方法
#!/usr/bin/env python # -*- coding:utf-8 -*- # @Time : 2017/11/08 8:46 # @Author : lijunjiang # @Fil ...
- Mock7 moco框架重定向
新建一个startupWithRedirect.json [ { "description": "重定向到百度", "request": { ...
- django+mysql简单总结
1.工程下建立APP(以WIN10+PYTHON3.6为例) C:\Users\WYS>django-admin startproject myweb #建立项目 C:\Users\WYS&g ...