Parallelism , Partitioner

转：spark通过合理设置spark.default.parallelism参数提高执行效率

spark中有partition的概念（和slice是同一个概念，在spark1.2中官网已经做出了说明），一般每个partition对应一个task。在我的测试过程中，如果没有设置spark.default.parallelism参数，spark计算出来的partition非常巨大，与我的cores非常不搭。我在两台机器上（8cores *2 +6g * 2）上，spark计算出来的partition达到2.8万个，也就是2.9万个tasks，每个task完成时间都是几毫秒或者零点几毫秒，执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后，任务数减少到10，执行一次计算过程从minute降到20second。

参数可以通过spark_home/conf/spark-default.conf配置文件设置。

eg.

spark.master spark://master:7077

spark.default.parallelism 10

spark.driver.memory 2g

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.sql.shuffle.partitions 50

Property Name Default Meaning

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey` and `join`, the largest number of partitions in a parent RDD. For operations like`parallelize` with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations likeparallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

原文地址：http://www.cnblogs.com/wrencai/p/4231966.html

Parallelism , Partitioner的更多相关文章

Concurrency != Parallelism
前段时间在公司给大家分享GO语言的一些特性,然后讲到了并发概念,大家表示很迷茫,然后分享过程中我拿来了Rob Pike大神的Slides <Concurrency is not Parallel ...
Hadoop学习笔记—9.Partitioner与自定义Partitioner
一.初步探索Partitioner 1.1 再次回顾Map阶段五大步骤在第四篇博文<初识MapReduce>中,我们认识了MapReduce的八大步凑,其中在Map阶段总共五个步骤,如下 ...
parallelism
COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION Traditionally, the co ...
Concurrency vs. Parallelism
http://getakka.net/docs/concepts/terminology Terminology and Concepts In this chapter we attempt to ...
Max Degree of Parallelism最大并行度配置
由于公司的业务在急速增长中,发现数据库服务器已经基本撑不住这么多并发.一方面,要求开发人员调整并发架构,利用缓存减少查询.一方面从数据库方面改善并发.数据库的并行度可设置如下: 1)cost thre ...
MapReduce中的分区方法Partitioner
在进行MapReduce计算时,有时候需要把最终的输出数据分到不同的文件中,比如按照省份划分的话,需要把同一省份的数据放到一个文件中:按照性别划分的话,需要把同一性别的数据放到一个文件中.我们知道最终 ...
Spark自定义分区(Partitioner)
我们都知道Spark内部提供了HashPartitioner和RangePartitioner两种分区策略,这两种分区策略在很多情况下都适合我们的场景.但是有些情况下,Spark内部不能符合咱们的需求 ...
MapReduce框架Partitioner分区方法
前言:对于二次排序相信大家也是似懂非懂,我也是一样,对其中的很多方法都不理解诶,所有只有暂时放在一边,当你接触到其他的函数,你知道的越多时你对二次排序的理解也就更深入了,同时建议大家对wordcoun ...
Partitioner没有被调用的情况
map的输出,通过分区函数决定要发往哪个reducer. 有2种情况,我们自定义的Partitioner不会被调用 1) reducer个数为0 这种情况,没有reducer,不需要分区 2) red ...

随机推荐

RS报表设计采用Total汇总过滤出错
错误信息: DMR 子查询计划失败,并产生意外错误.: java.lang.NullPointerException 如图原因是在RS过滤器中添加了: total([门诊人次] for [明细科室] ...
数据库:mongodb与关系型数据库相比的优缺点zz (转)
与关系型数据库相比,MongoDB的优点:①弱一致性(最终一致),更能保证用户的访问速度:举例来说,在传统的关系型数据库中,一个COUNT类型的操作会锁定数据集,这样可以保证得到“当前”情况下的精确值 ...
时间记录APP———Time Meter
关注过时间管理的人可能都听过大名鼎鼎的柳比歇夫的时间记录法,在几年前,大多人都推荐纸笔的记录方法,但是纸笔总是会忘,越来越智能的手机可是总不会忘得,所以我始终在寻找一款手机端好用的APP. 不管是时间 ...
DAS存储未死，再次欲获重生
当我们回想存储发展形态时.我们会想到DAS是最原始最主要的存储方式,在个人电脑.server等低端市场和场景上随处可见. 存储的原始形态也来自于DAS,常见的用于连接DAS和主机系统的协议/标准主要 ...
Linux 监测网络常用的工具sar iftop netstat ping nping fping mtr
Linux 监测网络常用的工具sar iftop netstat ping nping fping mtr # sar -n DEV 1 2 # iftop # netstat -i # ping n ...
Linux cmp命令——比较二进制文件（转）
Linux cmp命令用于比较两个文件是否有差异. 当相互比较的两个文件完全一样时,则该指令不会显示任何信息.若发现有所差异,预设会标示出第一个不同之处的字符和列数编号.若不指定任何文件名称或是所给予 ...
【laravel54】创建控制器、模型
1.创建控制器(可以带上下一级目录)=>(需要带Controller后缀) > php artisan make:controller self/StudentController; 2. ...
composer error when run composer update
本篇文章由:http://xinpure.com/composer-error-when-run-composer-update/ 错误很多时候即使是常用的命令也会出现一些奇奇怪怪的错误, 难以预知 ...
LeetCode-344：Reverse String
This is a "Pick One" Problem :[Problem:344-Reverse String] Write a function that takes a ...
mybatis WARN No appenders could be found for logger的解决方法
log4j:WARN No appenders could be found for logger (org.apache.ibatis.logging.LogFactory).log4j:WARN ...

Parallelism , Partitioner

Level of Parallelism

Parallelism , Partitioner的更多相关文章

随机推荐

热门专题