Spark SQL configuration
# export by:
spark.sql("SET -v").show(n=200, truncate=False)
key | value | meaning |
---|---|---|
spark.sql.adaptive.enabled | false | When true, enable adaptive query execution. |
spark.sql.adaptive.shuffle.targetPostShuffleInputSize | 67108864b | The target post-shuffle input size in bytes of a task. |
spark.sql.autoBroadcastJoinThreshold | 10485760 | Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run, and file-based data source tables where the statistics are computed directly on the files of data. |
spark.sql.broadcastTimeout | 300 | Timeout in seconds for the broadcast wait time in broadcast joins. |
spark.sql.cbo.enabled | false | Enables CBO for estimation of plan statistics when set true. |
spark.sql.cbo.joinReorder.dp.star.filter | false | Applies star-join filter heuristics to cost based join enumeration. |
spark.sql.cbo.joinReorder.dp.threshold | 12 | The maximum number of joined nodes allowed in the dynamic programming algorithm. |
spark.sql.cbo.joinReorder.enabled | false | Enables join reorder in CBO. |
spark.sql.cbo.starSchemaDetection | false | When true, it enables join reordering based on star schema detection. |
spark.sql.columnNameOfCorruptRecord | _corrupt_record | The name of internal column for storing raw/un-parsed JSON and CSV records that fail to parse. |
spark.sql.crossJoin.enabled | false | When false, we will throw an error if a query contains a cartesian product without explicit CROSS JOIN syntax. |
spark.sql.extensions | Name of the class used to configure Spark Session extensions. The class should implement Function1[SparkSessionExtension, Unit], and must have a no-args constructor. | |
spark.sql.files.ignoreCorruptFiles | false | Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. |
spark.sql.files.maxPartitionBytes | 134217728 | The maximum number of bytes to pack into a single partition when reading files. |
spark.sql.files.maxRecordsPerFile | 0 | Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit. |
spark.sql.groupByAliases | true | When true, aliases in a select list can be used in group by clauses. When false, an analysis exception is thrown in the case. |
spark.sql.groupByOrdinal | true | When true, the ordinal numbers in group by clauses are treated as the position in the select list. When false, the ordinal numbers are ignored. |
spark.sql.hive.caseSensitiveInferenceMode | INFER_AND_SAVE | Sets the action to take when a case-sensitive schema cannot be read from a Hive table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don't attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring). |
spark.sql.hive.filesourcePartitionFileCacheSize | 262144000 | When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled. |
spark.sql.hive.manageFilesourcePartitions | true | When true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning. |
spark.sql.hive.metastorePartitionPruning | true | When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information). |
spark.sql.hive.thriftServer.singleSession | false | When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database. |
spark.sql.hive.verifyPartitionPath | false | When true, check all the partition paths under the table's root directory when reading data stored in HDFS. |
spark.sql.optimizer.metadataOnly | true | When true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics. |
spark.sql.orc.filterPushdown | false | When true, enable filter pushdown for ORC files. |
spark.sql.orderByOrdinal | true | When true, the ordinal numbers are treated as the position in the select list. When false, the ordinal numbers in order/sort by clause are ignored. |
spark.sql.parquet.binaryAsString | false | Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. |
spark.sql.parquet.cacheMetadata | true | Turns on caching of Parquet schema metadata. Can speed up querying of static data. |
spark.sql.parquet.compression.codec | snappy | Sets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, lzo. |
spark.sql.parquet.enableVectorizedReader | true | Enables vectorized parquet decoding. |
spark.sql.parquet.filterPushdown | true | Enables Parquet filter push-down optimization when set to true. |
spark.sql.parquet.int64AsTimestampMillis | false | When true, timestamp values will be stored as INT64 with TIMESTAMP_MILLIS as the extended type. In this mode, the microsecond portion of the timestamp value will betruncated. |
spark.sql.parquet.int96AsTimestamp | true | Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. |
spark.sql.parquet.mergeSchema | false | When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available. |
spark.sql.parquet.respectSummaryFiles | false | When true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn't be enabled before knowing what it means exactly. |
spark.sql.parquet.writeLegacyFormat | false | Whether to follow Parquet's format specification when converting Parquet schema to Spark SQL schema and vice versa. |
spark.sql.pivotMaxValues | 10000 | When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without error. |
spark.sql.session.timeZone | Etc/UTC | The ID of session local timezone, e.g. "GMT", "America/Los_Angeles", etc. |
spark.sql.shuffle.partitions | 80 | The default number of partitions to use when shuffling data for joins or aggregations. |
spark.sql.sources.bucketing.enabled | true | When false, we will treat bucketed table as normal table |
spark.sql.sources.default | parquet | The default data source to use in input/output. |
spark.sql.sources.parallelPartitionDiscovery.threshold | 32 | The maximum number of paths allowed for listing files at driver side. If the number of detected paths exceeds this value during partition discovery, it tries to list the files with another Spark distributed job. This applies to Parquet, ORC, CSV, JSON and LibSVM data sources. |
spark.sql.sources.partitionColumnTypeInference.enabled | true | When true, automatically infer the data types for partitioned columns. |
spark.sql.statistics.fallBackToHdfs | false | If the table statistics are not available from table metadata enable fall back to hdfs. This is useful in determining if a table is small enough to use auto broadcast joins. |
spark.sql.streaming.checkpointLocation | The default location for storing checkpoint data for streaming queries. | |
spark.sql.streaming.metricsEnabled | false | Whether Dropwizard/Codahale metrics will be reported for active streaming queries. |
spark.sql.streaming.numRecentProgressUpdates | 100 | The number of progress updates to retain for a streaming query |
spark.sql.thriftserver.scheduler.pool | Set a Fair Scheduler pool for a JDBC client session. | |
spark.sql.thriftserver.ui.retainedSessions | 200 | The number of SQL client sessions kept in the JDBC/ODBC web UI history. |
spark.sql.thriftserver.ui.retainedStatements | 200 | The number of SQL statements kept in the JDBC/ODBC web UI history. |
spark.sql.variable.substitute | true | This enables substitution using syntax like ${var} ${system:var} and ${env:var}. |
spark.sql.warehouse.dir | file:/home/buildbot/datacalc/spark-warehouse/ | The default location for managed databases and tables. |
other Spark SQL config:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
https://github.com/unnunique/Conclusions/blob/master/AADocs/bigdata-docs/compute-components-docs/sparkbasic-docs/standalone.md
Spark SQL configuration的更多相关文章
- Spark SQL 之 Data Sources
#Spark SQL 之 Data Sources 转载请注明出处:http://www.cnblogs.com/BYRans/ 数据源(Data Source) Spark SQL的DataFram ...
- Spark SQL 官方文档-中文翻译
Spark SQL 官方文档-中文翻译 Spark版本:Spark 1.5.2 转载请注明出处:http://www.cnblogs.com/BYRans/ 1 概述(Overview) 2 Data ...
- Spark SQL 之 Performance Tuning & Distributed SQL Engine
Spark SQL 之 Performance Tuning & Distributed SQL Engine 转载请注明出处:http://www.cnblogs.com/BYRans/ 缓 ...
- SparkSQL使用之Spark SQL CLI
Spark SQL CLI描述 Spark SQL CLI的引入使得在SparkSQL中通过hive metastore就可以直接对hive进行查询更加方便:当前版本中还不能使用Spark SQL C ...
- Apache Spark 2.2.0 中文文档 - Spark SQL, DataFrames and Datasets Guide | ApacheCN
Spark SQL, DataFrames and Datasets Guide Overview SQL Datasets and DataFrames 开始入门 起始点: SparkSession ...
- Spark官方1 ---------Spark SQL和DataFrame指南(1.5.0)
概述 Spark SQL是用于结构化数据处理的Spark模块.它提供了一个称为DataFrames的编程抽象,也可以作为分布式SQL查询引擎. Spark SQL也可用于从现有的Hive安装中读取数据 ...
- Spark SQL官方文档阅读--待完善
1,DataFrame是一个将数据格式化为列形式的分布式容器,类似于一个关系型数据库表. 编程入口:SQLContext 2,SQLContext由SparkContext对象创建 也可创建一个功能更 ...
- 【原创】大叔经验分享(23)spark sql插入表时的文件个数研究
spark sql执行insert overwrite table时,写到新表或者新分区的文件个数,有可能是200个,也有可能是任意个,为什么会有这种差别? 首先看一下spark sql执行inser ...
- 【慕课网实战】八、以慕课网日志分析为例 进入大数据 Spark SQL 的世界
用户行为日志:用户每次访问网站时所有的行为数据(访问.浏览.搜索.点击...) 用户行为轨迹.流量日志 日志数据内容: 1)访问的系统属性: 操作系统.浏览器等等 2)访问特征:点击的ur ...
随机推荐
- 杭电acm2059-龟兔赛跑 java
一看题就知道是动态规划,不过这要看下如何设置变化数组了 先分析这道题:兔子到达终点的时间时固定的,因此只需要考虑乌龟了,乌龟骑电车和骑自行车的时间,然后计算,因为中间有N个充电站,可以看做N个点(到起 ...
- 【分治-前缀积后缀积】JS Window @2018acm徐州邀请赛G
问题 G: JS Window 时间限制: 2 Sec 内存限制: 512 MB 题目描述 JSZKC has an array A of N integers. More over, he has ...
- ubuntu下使用crontab
创建crontab任务 参考:https://www.cnblogs.com/Icanflyssj/p/5138851.html 3. crontab常用的几个命令格式 crontab -l //显示 ...
- Windows7下安装与破解IntelliJ IDEA2017
IDEA 全称 IntelliJ IDEA,是java语言开发的集成环境,IntelliJ在业界被公认为最好的java开发工具之一,尤其在智能代码助手.代码自动提示.重构.J2EE支持.各类版本工具( ...
- Android 使用easeui 3.0 集成环信即时通讯 我踩过的坑
0.关于注冊账号就不用说了. 1.创建应用.获取appkey 0.创建应用 1.填写信息 2.获取appkey 2.集成 0.首先新建一个project 1.这里主要介绍使用easeui来集成环信的即 ...
- SSE图像算法优化系列二十九:基础的拉普拉斯金字塔融合用于改善图像增强中易出现的过增强问题(一)
拉普拉斯金字塔融合是多图融合相关算法里最简单和最容易实现的一种,我们在看网络上大部分的文章都是在拿那个苹果和橙子融合在一起,变成一个果橙的效果作为例子说明.在这方面确实融合的比较好.但是本文我们主要讲 ...
- 运维笔记10 (Linux软件的安装与管理(rpm,yum))
概述:用rpm安装和管理软件(rpm解决依赖性),用yum安装与管理软件(yum解决依赖性). 1.linux的软件 linux能够说是一款改变时代的操作系统,可是一个操作系统再优秀假设没有好用的应用 ...
- fiddle扩展
扩展地址:http://www.telerik.com/fiddler/add-ons 证书选择 ios设置证书生成 (CertMaker for iOS and Android) 证书查看 (Fid ...
- PowerShell 显示气球提示框 1
#加载 Winform 程序集,使用Out-Null抑制输出 [system.Reflection.Assembly]::LoadWithPartialName('System.Windows.For ...
- 内核中的锁机制--RCU
一. 引言 众所周知,为了保护共享数据,需要一些同步机制,如自旋锁(spinlock),读写锁(rwlock),它们使用起来非常简单,而且是一种很有效的同步机制,在UNIX系统和Linux系统中得到了 ...