Abstract

  • Classical strategies do not aware of recovery cost, which could cause system performance degradation.   -->  a cost aware eviction strategt can obviously reduces the total recovery cost.
  • A strategy named LCS(Least cost strategy) -->  gets the dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future.

Introduction

  • Current eviction strategies:

    • FIFO: focuses on the create time.
    • LRU: focuses on access history for better hit ratio.
  • Many eviction algorithms take access history and costs of cache items into consideration. But for spark, the execution logic of upcoming phase is known, access history has no help to eviction strategy.
  • LCS has three steps:
    1. Gets the dependencies of RDD by analyzing application, and predicts how many times cache partitions will be reused.
    2. Collects information during partition creation, and predicts the recovery cost.
    3. Maintains the eviction order using above two information, and evicts the partition that incurs the least cost when memory is not sufficient.

Design and Implementation

Overall Architecture

  • Three necessary steps:
    1. Analyzer in driver node analyzes the application by  the DAG strcutures provided by DAGScheduler.
    2. Collector in each executor node records information about each cache partition during its creation.
    3. Eviction Decision provides an efficient eviction strategy to evict the optimal cache partition set when remaining memory space for cache storage is not efficient, and decide whether remove it from MemoryStore or serialize it to DiskStore.

Analyzer

  • DAG start points:

    • DFS, files on it can be read from local or remote disk directly;
    • ShuffledRDD, which can be generated by fetching remote shuffle data.

  This indicates the longest running path of task: when all the cache RDDs are missing, task needs to run from the starting points. (Only needs to run part of the path from cache RDD by referring dependencies between RDDs).

  • The aim of Analyzer is classifying cache RDDs and analyzing the dependency information between them before each stage runs.
  • Analyzer only runs in driver node and will transfer result to executors when driver schedules tasks to them.
  • By pre-registering RDD that needs to be unpresist, and checking whether it is used in each stage, we put it to the RemovableRDDs list of the last stage to use it. The removable partition can be evicted directly, and will not waste the memory.
  • Cache RDDs of a stage will be classified to:
    • current running cache RDDs (targetCacheRDDs)
    • RDDs participate in current stage (relatedCacheRDDs)
    • other cache RDDs

Colletor

  • collector will collect information about each cache partition during task running.
  • Information that needs to be observed:
    • Create cost: Time spent, called Ccreate.
    • Eviction cost: Time costs when evicting a partition from memory, called Ceviction. (If partition is serialized to disk, the eviction cost is the time spent on serializing and writing to disk, denoted as Cser. If removed directly, the eviction cost is 0.)
    • Recovery cost: Time costs when partition data are not found in memory, named Crecovery. If partition is serialized to disk, the recovery cost is the time spent in reading from disk and deserilization, denoted as Cdeser. Otherwise, recomputed by lineage information, represented as Crecompute.

Eviction Decision

  • Through using information provided by Colletor, each cache partition has a WCPM value:
    WCPM = min (CPM * reus, SPM + DPM * reus).
    CPMrenew = (CPMancestor * sizeancestor + CPM * size) / size
    SPM refers to serialization, DPM refers to deserialization, resu refers to reusability

Evaluation

Evaluation Environment and Method

  • PR, CC, KMeans algorithms...
  • LCS compare to LRU & FIFO

[Paper] LCS: An Efficient Data Eviction Strategy for Spark的更多相关文章

  1. Zore copy(翻译《Efficient data transfer through zero copy》)

    原文:https://www.ibm.com/developerworks/library/j-zerocopy/ <Efficient data transfer through zero c ...

  2. Efficient data transfer through zero copy

    Efficient data transfer through zero copy https://www.ibm.com/developerworks/library/j-zerocopy/ Eff ...

  3. PatentTips - Apparatus and method for a generic, extensible and efficient data manager for virtual peripheral component interconnect devices (VPCIDs)

    BACKGROUND A single physical platform may be segregated into a plurality of virtual networks. Here, ...

  4. Provably Delay Efficient Data Retrieving in Storage Clouds---INFOCOM 2015

    [标题] [作者] [来源] [对本文评价] [why] 存在的问题 [how] [不足] assumption future work [相关方法或论文] [重点提示] [其它]

  5. Big Data, MapReduce, Hadoop, and Spark with Python

    此书不错,很短,且想打通PYTHON和大数据架构的关系. 先看一次,计划把这个文档作个翻译. 先来一个模拟MAPREDUCE的东东... mapper.py class Mapper: def map ...

  6. [Big Data]从Hadoop到Spark的架构实践

    摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...

  7. 搭建Data Mining环境(Spark版本)

    前言:工欲善其事,必先利其器.倘若不懂得构建一套大数据挖掘环境,何来谈Data Mining!何来领悟“Data Mining Engineer”中的工程二字!也仅仅是在做数据分析相关的事罢了!此文来 ...

  8. ### Paper about Event Detection

    Paper about Event Detection. #@author: gr #@date: 2014-03-15 #@email: forgerui@gmail.com 看一些相关的论文. 1 ...

  9. In-Stream Big Data Processing

    http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/   Overview In recent y ...

随机推荐

  1. uva10564

    路径条数很好找.记录最小路径,就记录到各点的最小字符串,存储起来. #include <iostream> #include <cstdio> #include <cma ...

  2. MySQL 分支和循环结构

    1.if else分支 mysql:用elseif               Oracle:用elsif 可以用select直接查看获取的值或者一个变量. create procedure pd_t ...

  3. 关于final static修饰的常量部署后没有更新的问题

    出现问题的场景是这样的: 项目中有个专门放流程Key值常量的类FlowConstants.java,其中这些常量都用了final static 修饰.某天因为修改了流程,相应的key值也改变了,所以直 ...

  4. hdu-5889-最短路+网络流/最小割

    Barricade Time Limit: 3000/1000 MS (Java/Others)    Memory Limit: 65536/65536 K (Java/Others)Total S ...

  5. C# 3.0 / C# 3.5 扩展方法

    概述 扩展方法是一种特殊的静态方法,可以像扩展类型上的实例方法一样进行调用,能向现有类型“添加”方法,而无须创建新的派生类型.重新编译或以其他方式修改原始类型. 扩展方法的定义实现: public s ...

  6. Hive QL的操作

    一.数据定义DDL操作 创建表: --create table为创建一个指定名字的表 create(external) table table_name --external关键字可以让用户创建一个外 ...

  7. 彻底理解MapReduce shuffle过程原理

    彻底理解MapReduce shuffle过程原理 MapReduce的Shuffle过程介绍 Shuffle的本义是洗牌.混洗,把一组有一定规则的数据尽量转换成一组无规则的数据,越随机越好.MapR ...

  8. Mysql for Linux安装配置之—— 源码安装

    1.安装 --假设已经有mysql-5.5.10.tar.gz以及cmake-2.8.4.tar.gz两个源码压缩文件1)先安装cmake(mysql5.5以后是通过cmake来编译的)   # ta ...

  9. Eclipse错误:The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path

    该报错是由于缺少servlet-api.jar造成的,将servlet-api.jar复制到项目下的WEB-INF/lib目录下即可 servlet-api.jar在tomcat的lib目录下有,可以 ...

  10. oracle 11g创建数据库教程

    cd /oracle/app/oracle/product//dbhome_1/bin ./dbca 自定义用户表空间大小. 安装过程半个小时是需要的. 2.配置oracle系统用户环境变量 使用vi ...