Summary

Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory. --> Propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs. --> speeds up iterative computations.
Spark use least recently used (LRU) replacement algorithm to evict RDDs, which only consider the usage of the RDDs. --> a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions.

Preliminary Information

Cache mechanism in Spark

When RDD partitions have been cached in memory during the iterative computation, an operation which needs the partitions will get them by CacheManager.
All operations including reading or caching in CacheManager mainly depend on the API of BlockManager. BlockManager decides whether partitions are obtained from memory or disks.

Scheduling model

The LRU algorithm only considers whether those partitions are recently used while ignores the partitions computation cost and the sizes of the partitions.
The number of use for partitions can be known from the DAG before tasks are performed.
Let Nij be the number of use of j-th partition of RDDi.
Let Sij be the size of j-th partition or RDDi.
The computation time is also an important part. --> Each partition of RDDi starting time STij and finishing time FTij can roughly express its execution and communication time.
Consider the computation cost of partition as Costj = FTij - STij.
After that, we set up a scheduling model and obtain the weight of Pij, which can be expressed as:

where k is the correction parameter, and it's set to a constant.
Finally, we assume that there are h partitions in RDDi, so the weight of RDDi is:

Proposed Algorithm

Selection algorithm

For a given DAG graph,we can get the num of uses for each RDD, expressioned as NRDDi.
The pseudocode:

Replacement algorithm

In this paper, we use weight of partition to evaluate the importance of the partitions.
When many partitions are cached in memory, we use QuickSort algorithm to sort the partitions according to the value of the partitions.
The pseudocode:

Experiments

five servers, six virtual machines, each vm has 100G disk, 2.5GHZ and runs Ubuntu 12.04 operation system while memory is variable, and we set it as 1G, 2G, or 4G in different conditions.
Hadoop 2.10.4 and Spark-1.1.0.
use ganglia to observe the memory usage.
use pageRank algorithm to do expirement, it's iterative.

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

Partitioned Replacement for Cache Memory
In a particular embodiment, a circuit device includes a translation look-aside buffer (TLB) configur ...
Flash-aware Page Replacement Algorithm
1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...
Inside Amazon's Kafkaesque "Performance Improvement Plans"
Amazon CEO and brilliant prick Jeff Bezos seems to have lost his magic touch lately. Investors, empl ...
Hive-Container killed by YARN for exceeding memory limits. 9.2 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task times, most recen ...
Spring Boot Memory Performance
The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docke ...
计算机系统结构总结_Memory Hierarchy and Memory Performance
Textbook: <计算机组成与设计——硬件/软件接口> HI <计算机体系结构——量化研究方法> QR 这是youtube上一个非常好的memory syst ...
PatentTips - Control register access virtualization performance improvement
BACKGROUND OF THE INVENTION A conventional virtual-machine monitor (VMM) typically runs on a compute ...
SQL Performance Improvement Techniques（转）
原文地址:http://www.codeproject.com/Tips/1023621/SQL-Performance-Improvement-Techniques This article pro ...
Ceilometer Polling Performance Improvement
Ceilometer的数据采集agent会定期对nova/keystone/neutron/cinder等服务调用其API的获取信息,默认是20秒一次, # Polling interval for ...

随机推荐

11月24日 layouts and rendering in rails(部分没有看)
http://guides.rubyonrails.org/layouts_and_rendering.html 中文 This guide covers the basic layout feat ...
进程状态TASK_UNINTERRUPTIBLE
进程拥有以下几种状态:就绪/运行状态.等待状态(可以被中断打断).等待状态(不可以被中断打断).停止状态和僵死状态. TASK_RUNNING: 正在运行或处于就绪状态:就绪状态是指进程申请到了CPU ...
PHP流程控制笔记
一.运算符(Operator) 1.运算符 2.运算符分类 (1)按功能分 (2)按操作数个数分 3.按功能分 (1)算术运算符 (2)递增递减 (3)字符运算符 (4)赋值运 ...
2.5 UML顺序图
相关概念交互对象之间为实现某一功能而必须实施的协作过程.动态行为,称为交互消息对象间的协作与交流表现为一个对象以某种方式启动另一个对象的活动,这种交流在 UML里被定义为消息顺序图的建模元素 ...
Linux系统常见内核问题修复（转发）
Linux系统常见内核问题修复(转发) 常见Linux系统破坏修复 http://blog.csdn.net/jmilk/article/details/49619587
bzoj2342: [Shoi2011]双倍回文 pam
题解:先建pam,然后在fail树上dfs,从上到下的链如果有当前长度最远回文串的一半,那么更新答案 //#pragma GCC optimize(2) //#pragma GCC optimize( ...
51Nod 1001 数组中和等于K的数对
http://www.51nod.com/onlineJudge/questionCode.html#!problemId=1001一开始的想法是排序后二分搜索,发现会进行非常多不必要的遍历,十分耗时 ...
Apache+PHP+MySQL+phpMyAdmin+WordPress搭建
一 .安装Apache 下载地址:http://www.apachelounge.com/download/,选择Apache 2.4.25 Win64,解压缩,修改配置文件中如下地方: 1.Serv ...
SpringBoot 使用Thymeleaf解决静态页面跳转问题
参考:springboot配置跳转html页面 1,首先在pom文件中引入模板引擎jar包 <dependency> <groupId>org.springframework. ...
【转】vs IIS破除文件上传限制最全版
今天在测试一下上传文件的时候发现iis和配置存在上传文件大小限制(IIS默认大小30M,最大运行为2g:2147483647),百度了一部分资料有些发布到IIS好使,但是在VS调试中不好使.于是自己不 ...

[Paper] Selection and replacement algorithm for memory performance improvement in Spark

Summary

Preliminary Information

Cache mechanism in Spark

Scheduling model

Proposed Algorithm

Selection algorithm

Replacement algorithm

Experiments

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

随机推荐

热门专题