【原创】大数据基础之Hive(5)性能调优Performance Tuning
1 compress & mr
hive默认的execution engine是mr
hive> set hive.execution.engine;
hive.execution.engine=mr
所以针对mr的优化就是hive的优化,比如压缩和临时目录
mapred-site.xml
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property>
1.1 hive启用压缩
set hive.exec.compress.output=true;
This controls whether the final outputs of a query (to a local/hdfs file or a Hive table) is compressed. The compression codec and other options are determined from Hadoop configuration variables mapred.output.compress* .
set hive.exec.compress.intermediate=true;
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop configuration variables mapred.output.compress*.
2 decompose
decomposing table data sets into more manageable parts
将表数据分解成多个部分(文件或目录),这样就可以根据where条件跳过不需要的数据,有3种方式
2.1 partition
分区
Hive Partitioning provides a way of segregating hive table data into multiple files/directories.
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
常见的比如按时间分区(time Partitioning)
PARTITIONED BY(dt STRING, country STRING)
2.2 bucket
分桶
Bucketed tables are fantastic in that they allow much more efficient sampling than do non-bucketed tables, and they may later allow for time saving operations such as mapside joins. However, the bucketing specified at table creation is not enforced when the table is written to, and so it is possible for the table's metadata to advertise properties which are not upheld by the table's actual layout. This should obviously be avoided.
How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.
Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based.
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
set hive.enforce.bucketing = true; (for Hive 0.x and 1.x)
参考:
https://cwiki-test.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
https://data-flair.training/blogs/bucketing-in-hive/
2.3 skewed
倾斜
This feature can be used to improve performance for tables where one or more columns have skewed values. By specifying the values that appear very often (heavy skew) Hive will split those out into separate files (or directories in case of list bucketing) automatically and take this fact into account during queries so that it can skip or include the whole file (or directory in case of list bucketing) if possible.
SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES]
3 index
Indexing Is Removed since 3.0
索引在3.0版本中被删除,替代的方式是物化视图(Materialized View,熟悉oracle的人应该不陌生),或者使用列存储格式(比如parquet、orc);
There are alternate options which might work similarily to indexing:
- Materialized views with automatic rewriting can result in very similar results. Hive 2.3.0 adds support for materialzed views.
- Using columnar file formats (Parquet, ORC) – they can do selective scanning; they may even skip entire files/blocks.
3.1 Materialized view
物化视图
Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the pre-computation of relevant summaries or materialized views.
Using a materialized view, the optimizer can compare old and new tables, rewrite queries to accelerate processing, and manage maintenance of the materialized view when data updates occur. The optimizer can use a materialized view to fully or partially rewrite projections, filters, joins, and aggregations. Hive stores materialized views in the Hive warehouse or Druid.
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name AS <query>;
参考:
https://cwiki.apache.org/confluence/display/Hive/Materialized+views
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing
4 format
数据存储格式
4.1 lzo
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
4.2 orc
STORED AS ORC
STORED AS ORC TBLPROPERTIES ("orc.compression"="ZLIB")
注意设置orc压缩格式前一定要先设置:
set hive.exec.orc.compression.strategy=COMPRESSION;
否则压缩不生效;
4.2.1
set hive.orc.zerocopy=true;
ORC can use the new HDFS Caching APIs and the ZeroCopy readers to avoid extra data copies into memory while scanning files.
4.3 parquet
STORED AS PARQUET
4.4 snappy
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY")
STORED AS ORC TBLPROPERTIES ("orc.compression"="SNAPPY")
lzo支持详见:https://www.cnblogs.com/barneywill/p/10439181.html
有关格式的对比测试详见:https://www.cnblogs.com/barneywill/p/10109508.html
5 vectorization
Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory.
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.vectorized.execution.reduce.groupby.enabled=true;
参考:https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
6 join
6.1 Common Join
即shuffle
Use Mappers to do the parallel sort of the tables on the join keys, which are then passed on to reducers. All of the tuples with same key is given to same reducer. A reducer may get tuples for more than one key. Key for tuple will also include table id, thus sorted output from two different tables with same key can be recognized. Reducers will merge the sorted stream to get join output.
6.2 Map Join
Useful for star schema joins, this joining algorithm keeps all of the small tables (dimension tables) in memory in all of the mappers and big table (fact table) is streamed over it in the mapper. This avoids shuffling cost that is inherent in Common-Join. For each of the small table (dimension table) a hash table would be created using join key as the hash table key.
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 10000000;
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:
- Local work:
- read records via standard table scan (including filters and projections) from source on local machine
- build hashtable in memory
- write hashtable to local disk
- upload hashtable to dfs
- add hashtable to distributed cache
- Map task
- read hashtable from local disk (distributed cache) into memory
- match records' keys against hashtable
- combine matches and write to output
- No reduce task
6.3 Bucket Map Join
If the joining keys of map-join are bucketed then instead of keeping whole of small table (dimension table) in every mapper, only the matching buckets will be kept. This reduces the memory footprint of the map-join.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
6.4 SMB Join
This is an optimization on Bucket Map Join; if data to be joined is already sorted on joining keys then hash table creation is avoided and instead a sort merge join algorithm is used.
set hive.optimize.bucketmapjoin.sortedmerge = true;
SMB joins are used wherever the tables are sorted and bucketed. The join boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary map-join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key.
6.5 Skew Join
If the distribution of data is skewed for some specific values, then join performance may suffer since some of the instances of join operators (reducers in map-reduce world) may get over loaded and others may get under utilized. On user hint, hive would rewrite a join query around skew value as union of joins.
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
Whether to enable skew join optimization. The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce job, process those skewed keys. The same key need not be skewed for all the tables, and so, the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a map-join.
6.6 semi join
LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore.
参考:
https://medium.com/hotels-com-technology/skew-join-optimization-in-hive-b66a1f4cc6ba
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
7 merge small files (too many mappers)
set hive.hadoop.supports.splittable.combineinputformat=true;
Whether to combine small input files so that fewer mappers are spawned.
set mapreduce.input.fileinputformat.split.minsize=1;
set mapreduce.input.fileinputformat.split.maxsize=256000000;
The minimum size chunk that map input should be split into. Note that some file formats may have minimum split sizes that take priority over this setting.
set hive.merge.mapfiles=true;
Merge small files at the end of a map-only job.
set hive.merge.mapredfiles=true;
Merge small files at the end of a map-reduce job.
8 Stats & CBO(Cost-Based Optimizer)
Most of the existing query optimizations in Hive are about minimizing shuffling cost. Currently user would have to submit an optimized query to Hive with right join order for query to be executed efficiently. Logical optimizations in Hive are limited to filter push down, projection pruning and partition pruning. Cost based logical optimizations can significantly improve Apache Hive’s query latency and ease of use.
Join reordering and join algorithm selection are few of the optimizations that can benefit from a cost based optimizer. Cost based optimizer would free up user from having to rearrange joins in the right order or from having to specify join algorithm by using query hints and configuration options. This can potentially free up users to model their reporting and ETL needs close to business process without having to worry about query optimizations.
Calcite is an open source cost based query optimizer and query execution framework. Calcite currently has more than fifty query optimization rules that can rewrite query tree, and an efficient plan pruner that can select cheapest query plan in an optimal manner.
CBO will be introduced in to Hive in a Phased manner. In the first phase, Calcite would be used to reorder joins and to pick right join algorithm so as to reduce query latency. Table cardinality and Boundary statistics will be used for this cost based optimizations.
Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.
The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, a query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.
Calcite applies various optimizations such as query rewrite, JOIN reordering, and deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.
Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimized and converted to Tez jobs, and then executed on the Hadoop cluster.
Enabling Cost-Based Optimization
set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;
Generating Hive Statistics
ANALYZE TABLE [table_name] COMPUTE STATISTICS;
ANALYZE TABLE [table_name] PARTITION(partition_column) COMPUTE STATISTICS;
ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];
Viewing Generated Statistics
DESCRIBE [EXTENDED] table_name;
DESCRIBE FORMATTED [db_name.]table_name.column_name;
参考:https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
9 correlation
set hive.optimize.correlation=true;
In Hadoop environments, an SQL query submitted to Hive will be evaluated in distributed systems. Thus, after generating a query operator tree representing the submitted SQL query, Hive needs to determine what operations can be executed in a task which will be evalauted in a single node. Also, since a MapReduce job can shuffle data data once, Hive also needs to cut the tree to multiple MapReduce jobs. It is important to cut an operator tree to multiple MapReduce in a good way, so the generated plan can evaluate the query efficiently.
In a more complex query, correlation-unaware query planning can generate a very inefficient execution plan and result in poor performance.
参考:https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer
10 write a good sql
10.1 execution plan
看懂执行计划
hive> explain $sql;
10.2 good practice
11 engine
切换engine
set hive.query.engine=spark;
12 other
12.1 parallel
set hive.exec.parallel=true;
Whether to execute jobs in parallel. Applies to MapReduce jobs that can run in parallel, for example jobs processing different source tables before a join. As of Hive 0.14, also applies to move tasks that can run in parallel, for example moving files to insert targets during multi-insert.
12.2 limit is very slow
limit默认会被转换为本地fetch执行,如果在大表上加复杂查询条件的limit需要指定提交到yarn执行
set hive.fetch.task.conversion=more;
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.
Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
set hive.fetch.task.conversion.threshold=1073741824;
Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold.
12.3 sample
只想快速查看部分数据
set hive.limit.optimize.enable=true;
Whether to enable to optimization to trying a smaller subset of data for simple LIMIT first.
set hive.optimize.sampling.orderby=true;
Uses sampling on order-by clause for parallel execution.
调优
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.vectorized.execution.enabled = true;
set hive.optimize.skewjoin=true;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.optimize.correlation=true;
set hive.exec.parallel=true;
set hive.orc.zerocopy=true;set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask = true;
set hive.map.aggr=true;
set hive.optimize.reducededuplication=true;
set hive.optimize.distinct.rewrite=true;
set hive.optimize.groupby=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.orcfile.stripe.level=true;
set hive.optimize.sort.dynamic.partition=true;
set hive.mapjoin.lazy.hashtable=true;
参考:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
https://streever.atlassian.net/wiki/spaces/HADOOP/pages/1933314/Hive+Performance+Tips
https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hive-performance-tuning/content/ch_cost-based-optimizer.html
【原创】大数据基础之Hive(5)性能调优Performance Tuning的更多相关文章
- 大数据技术 - MapReduce的Shuffle及调优
本章内容我们学习一下 MapReduce 中的 Shuffle 过程,Shuffle 发生在 map 输出到 reduce 输入的过程,它的中文解释是 “洗牌”,顾名思义该过程涉及数据的重新分配,主要 ...
- 【原创】大数据基础之Hive(2)Hive SQL执行过程之SQL解析过程
Hive SQL解析过程 SQL->AST(Abstract Syntax Tree)->Task(MapRedTask,FetchTask)->QueryPlan(Task集合)- ...
- 【原创】大数据基础之Hive(1)Hive SQL执行过程之代码流程
hive 2.1 hive执行sql有两种方式: 执行hive命令,又细分为hive -e,hive -f,hive交互式: 执行beeline命令,beeline会连接远程thrift server ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- 【原创】大数据基础之Hive(3)最简绿色部署
hadoop部署参考:https://www.cnblogs.com/barneywill/p/10428098.html 1 拷贝到所有服务器上并解压 # ansible all-servers - ...
- 【原创】大数据基础之Hive(4)hive元数据库核心表结构
1 dbs +-------+-----------------------+----------------------------------------------+------------+- ...
- MySQL性能调优与架构设计——第5章 备份与恢复
第5章 备份与恢复 前言 数据库的备份与恢复一直都是 DBA 工作中最为重要的部分之一,也是基本工作之一.任何正式环境的数据库都必须有完整的备份计划和恢复测试,本章内容将主要介绍 MySQL数据库的备 ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 大数据篇:Hive
大数据篇:Hive hive.apache.org Hive是什么? Hive是Facebook开源的用于解决海量结构化日志的数据统计,是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射 ...
随机推荐
- Win10升级.NET Framework 3.5或2.0遇到错误0x800f081f怎么办
为了让win10系统更好地运行游戏,很多玩家都会在电脑中安装.NET Framework 3.5或.NET Framework 2.0.不过,部分用户在更新升级.NET Framework 3.5和2 ...
- git多人协作式开发时分支管理策略
什么是 git-flow? Git Flow是一套使用Git进行源代码管理时的一套行为规范 主分支Master 首先,代码库应该有一个.且仅有一个主分支.所有提供给用户使用的正式版本,都在这个主分支上 ...
- Silverlight 样式的灵活使用
众所周知,Silverlight将界面设计与代码实现分开.即便如此,如果不能灵活地运用样式Style,开发的效率依然会比较低.比如,针对类似的TextBlock,你可能需要反复地在设计器xaml中复制 ...
- mosh
mosh 是一款使用 UDP 连接 C/S 的终端工具, 服务器只需安装好 mosh 套件, 并启动 SSH 服务, 等待 Client 连接即可. Client (mosh-client) 连接时, ...
- 酷狗.kgtemp文件加密算法逆向
该帖转载于孤心浪子--http://www.cnblogs.com/KMBlog/p/6877752.html 酷狗音乐上的一些歌曲是不能免费下载的,然而用户仍然可以离线试听,这说明有缓存文件,并且极 ...
- Spring AutoWire
AutoWire 有 ByType ,ByName两种主要使用方式 public class Boss { @Autowired private Car car; public Car getCar( ...
- HDB3 译码器
一.HDB3译码方案一: 插入V/B的情况是有两种,两个或三个零两端同极性,也就是要把代码二进制表示的“+1 0 0 0 +1”或“-1 0 0 0 -1”变成“1 0 0 0”,把“+1 0 0 + ...
- ASP.NET MVC4在部署IIS后,运行时显示的是整个Web的目录列表
页面出现如下图: 第一种解决方案: 刚安装好IIS,这时需要注册IIS. 在Dos中进入Framework的安装文件夹 你将要发布的系统是什么.Net Framework版本,就注册什么版本 4.0版 ...
- Hive思维导图
- tcp黏包
转载https://www.cnblogs.com/wade-luffy/p/6165671.html 无论是服务端还是客户端,当我们读取或者发送消息的时候,都需要考虑TCP底层的粘包/拆包机制. 回 ...