Summary

  • Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory.  --> Propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs.   --> speeds up iterative computations.

  • Spark use least recently used (LRU) replacement algorithm to evict RDDs, which only consider the usage of the RDDs.  --> a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the number of use for partitions, and the sizes of the partitions.

Preliminary Information

Cache mechanism in Spark

  • When RDD partitions have been cached in memory during the iterative computation, an operation which needs the partitions will get them by CacheManager.
  • All operations including reading or caching in CacheManager mainly depend on the API of BlockManager. BlockManager decides whether partitions are obtained from memory or disks.

Scheduling model

  • The LRU algorithm only considers whether those partitions are recently used while ignores the partitions computation cost and the sizes of the partitions.
  • The number of use for partitions can be known from the DAG before tasks are performed.
    Let Nij be the number of use of j-th partition of RDDi.
    Let Sij be the size of j-th partition or RDDi.
  • The computation time is also an important part. --> Each partition of RDDi starting time STij and finishing time FTij can roughly express its execution and communication time.
    Consider the computation cost of partition as Costj = FTij - STij.
  • After that, we set up a scheduling model and obtain the weight of Pij, which can be expressed as:

    where k is the correction parameter, and it's set to a constant.
  • Finally, we assume that there are h partitions in RDDi, so the weight of RDDi is:

Proposed Algorithm

Selection algorithm

  • For a given DAG graph,we can get the num of uses for each RDD, expressioned as NRDDi.
  • The pseudocode:

Replacement algorithm

  • In this paper, we use weight of partition to evaluate the importance of the partitions.
  • When many partitions are cached in memory, we use QuickSort algorithm to sort the partitions according to the value of the partitions.
  • The pseudocode:

Experiments

  • five servers, six virtual machines, each vm has 100G disk, 2.5GHZ and runs Ubuntu 12.04 operation system while memory is variable, and we set it as 1G, 2G, or 4G in different conditions.
  • Hadoop 2.10.4 and Spark-1.1.0.
  • use ganglia to observe the memory usage.
  • use pageRank algorithm to do expirement, it's iterative.

[Paper] Selection and replacement algorithm for memory performance improvement in Spark的更多相关文章

  1. Partitioned Replacement for Cache Memory

    In a particular embodiment, a circuit device includes a translation look-aside buffer (TLB) configur ...

  2. Flash-aware Page Replacement Algorithm

    1.Abstract:(1)字体太乱,单词中有空格(2) FAPRA此名词第一出现时应有“ FAPRA(Flash-aware Page Replacement Algorithm)”说明. 2.in ...

  3. Inside Amazon's Kafkaesque "Performance Improvement Plans"

    Amazon CEO and brilliant prick Jeff Bezos seems to have lost his magic touch lately. Investors, empl ...

  4. Hive-Container killed by YARN for exceeding memory limits. 9.2 GB of 9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

    Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task times, most recen ...

  5. Spring Boot Memory Performance

    The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docke ...

  6. 计算机系统结构总结_Memory Hierarchy and Memory Performance

    Textbook: <计算机组成与设计——硬件/软件接口>  HI <计算机体系结构——量化研究方法>       QR 这是youtube上一个非常好的memory syst ...

  7. PatentTips - Control register access virtualization performance improvement

    BACKGROUND OF THE INVENTION A conventional virtual-machine monitor (VMM) typically runs on a compute ...

  8. SQL Performance Improvement Techniques(转)

    原文地址:http://www.codeproject.com/Tips/1023621/SQL-Performance-Improvement-Techniques This article pro ...

  9. Ceilometer Polling Performance Improvement

    Ceilometer的数据采集agent会定期对nova/keystone/neutron/cinder等服务调用其API的获取信息,默认是20秒一次, # Polling interval for ...

随机推荐

  1. android--------自定义控件 之 基本实现篇

    前面简单的讲述了Android中自定义控件中的几个方法,今天通过代码来实现一个简单的案例 自定义一个扇形图 自定义控件示例: 这里先介绍继承View的方式为例 public class Circula ...

  2. android--------内存泄露分析工具—Android Monitor

    Android Studio 内置了四种性能监测工具Memory Monitor.Network Monitor.CPU Monitor.GPU Monitor,我们可以使用这些工具监测APP的状态, ...

  3. 如何阻止div中的子div触发div的事件

    <div class="sideFrame" v-on:click="hideside"> <div class="sideFram ...

  4. Vrrp和Hsrp的区别

    VRRP原理协议简述简单来说,VRRP是一种容错协议,它为具有组播或广播能力的局域网(如以太网)设计,它保证当局域网内主机的下一跳路由器出现故障时,可以及时的由另一台路由器来代替,从而保持通讯的连续性 ...

  5. 非常可乐 HDU - 1495

    大家一定觉的运动以后喝可乐是一件很惬意的事情,但是seeyou却不这么认为.因为每次当seeyou买了可乐以后,阿牛就要求和seeyou一起分享这一瓶可乐,而且一定要喝的和seeyou一样多.但see ...

  6. luffy后端之跨域corf的解决方法

    跨域CORS 我们现在为前端和后端分别设置两个不同的域名 window 系统: C:\Windows\System32\drivers\etc\host linux/mac系统: /etc/hosts ...

  7. C/S,B/S的区别与联系

    C/S 是Client/Server 的缩写.服务器通常采用高性能的PC.工作站或小型机,并采用 大型数据库系统,如Oracle.Sybase.Informix 或SQL Server.客户端需要安装 ...

  8. 深入理解php内核

    目录 第一部分 基本原理 第一章 准备工作和背景知识 第一节 环境搭建 第二节 源码布局及阅读方法 第三节 常用代码 第四节 小结 第二章 用户代码的执行 第一节 PHP生命周期 第二节 从SAPI开 ...

  9. visio画等分树状图

    一 树状图形状 Search里搜索Tree,找到Double Tree或者Multi Tree的形状 二 分出更多branch 按住主干上的黄色小方块,拖出更多分支. 三 等分分支 将每个分支和对应的 ...

  10. python中的apscheduler模块

    1.简介 apscheduler是python中的任务定时模块,它包含四个组件:触发器(trigger),作业存储(job store),执行器(executor),调度器(scheduler). 2 ...