spark通过合理设置spark.default.parallelism参数提高执行效率

spark中有partition的概念（和slice是同一个概念，在spark1.2中官网已经做出了说明），一般每个partition对应一个task。在我的测试过程中，如果没有设置spark.default.parallelism参数，spark计算出来的partition非常巨大，与我的cores非常不搭。我在两台机器上（8cores *2 +6g * 2）上，spark计算出来的partition达到2.8万个，也就是2.9万个tasks，每个task完成时间都是几毫秒或者零点几毫秒，执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后，任务数减少到10，执行一次计算过程从minute降到20second。

参数可以通过spark_home/conf/spark-default.conf配置文件设置。

eg.

 spark.master                       spark://master:7077

 spark.default.parallelism

 spark.driver.memory                2g

 spark.serializer                   org.apache.spark.serializer.KryoSerializer

 spark.sql.shuffle.partitions

Property Name Default Meaning

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey` and `join`, the largest number of partitions in a parent RDD. For operations like`parallelize` with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations likeparallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

spark通过合理设置spark.default.parallelism参数提高执行效率的更多相关文章

Eclipse:设置自动补全，提高编程效率
一.设置自动补全 1.进入eclipse的window里的perferences页面 2.找到java->Editor->Content Assist设置界面 3.在Auto activa ...
spark系列-7、spark调优
官网说明:http://spark.apache.org/docs/2.1.1/tuning.html#data-serialization 一.JVM调优 1.1.Java虚拟机垃圾回收调优的背景 ...
spark提交命令 spark-submit 的参数 executor-memory、executor-cores、num-executors、spark.default.parallelism分析
转载:https://blog.csdn.net/zimiao552147572/article/details/96482120 nohup spark-submit --master yarn - ...
spark.sql.shuffle.partitions和spark.default.parallelism的区别
在关于spark任务并行度的设置中,有两个参数我们会经常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什 ...
[Spark]What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used ...
streaming优化：spark.default.parallelism调整处理并行度
官方是这么说的: Cluster resources can be under-utilized if the number of parallel tasks used in any stage o ...
【Spark调优】提交job资源参数调优
[场景] Spark提交作业job的时候要指定该job可以使用的CPU.内存等资源参数,生产环境中,任务资源分配不足会导致该job执行中断.失败等问题,所以对Spark的job资源参数分配调优非常重要 ...
【Spark调优】内存模型与参数调优
[Spark内存模型] Spark在一个executor中的内存分为3块:storage内存.execution内存.other内存. 1. storage内存:存储broadcast,cache,p ...
[spark]-Spark2.x集群搭建与参数详解
在前面的Spark发展历程和基本概念中介绍了Spark的一些基本概念,熟悉了这些基本概念对于集群的搭建是很有必要的.我们可以了解到每个参数配置的作用是什么.这里将详细介绍Spark集群搭建以及xml参 ...

随机推荐

Bitcoin 比特币, LTC（litecoin）莱特币,
1.挖掘机 http://asicme.com/index 2.官方: http://bitcoin.org/zh_CN/ =============== 1\ 最好的比特币资讯网站 ht ...
tengine + lua 实现流量拷贝
环境搭建参考地址:http://www.cnblogs.com/cp-miao/p/7505910.html cp.lua local res1, res2, action action = ngx. ...
cs-SelectTree-DropTreeNode, SelectTreeList
ylbtech-Unitity: cs-SelectTree-DropTreeNode, SelectTreeList DropTreeNode.cs SelectTreeList.cs 1.A,效果 ...
阿里云RDS(云数据库)之产品简介
参考阿里产品文档:https://docs.aliyun.com/?spm=5176.100054.3.1.ywnrMX#/pub/rds/product-introduce/overview& ...
JavaScript basics: 2 ways to get child elements with JavaScript
原文: https://blog.mrfrontend.org/2017/10/2-ways-get-child-elements-javascript/ Along the lines of oth ...
Python下opencv使用笔记（二）（简单几何图像绘制）
简单几何图像一般包含点.直线.矩阵.圆.椭圆.多边形等等.首先认识一下opencv对像素点的定义. 图像的一个像素点有1或者3个值.对灰度图像有一个灰度值,对彩色图像有3个值组成一个像素值.他们表现出 ...
正则表达式表示 ja.resx 所在行
[^\n]*ja.resx[^\n]*\n?正则表达式表示 ja.resx 所在行用ultraEdit 删除关键字所在行的下一行或是上一行,所在行保留删除关键字所在行的前3行: (^.*?(\ ...
倍福TwinCAT(贝福Beckhoff)应用教程11.1 TwinCAT应用小程序1 如何读写数字量模拟量输入输出(DI,DO,AI,AO)
常见的模拟量模块(还有更高端和更低端的,使用方法都一样) EL3054和EL4024(4路模拟量输入和输出模块) 常见的数字量模块(还有更高端和更低端的,使用方法都一样) EL1809和EL280 ...
[转]SQL Server 性能调优（io）
目录诊断磁盘io问题常见的磁盘问题容量替代了性能负载隔离配置有问题分区对齐配置有问题总结关于io这一块,前面的东西如磁盘大小,磁盘带宽,随机读取写入,顺序读取写入,raid选择,DA ...
UE4 场景展示Demo
使用到的level blueprint如下:

spark通过合理设置spark.default.parallelism参数提高执行效率

Level of Parallelism

spark通过合理设置spark.default.parallelism参数提高执行效率的更多相关文章

随机推荐

热门专题