spark 持久化机制

spark的持久化机制做的相对隐晦一些，没有一个显示的调用入口。

首先通过rdd.persist(newLevel: StorageLevel)对此rdd的StorageLevel进行赋值，同checkpoint一样，本身没有进行之久化操作。真正进行持久化操作实在之后的第一个action 中通过iterator方法进行调用：

  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {

    if (storageLevel != StorageLevel.NONE) {

      getOrCompute(split, context)

    } else {

      computeOrReadCheckpoint(split, context)

    }

  }

其中调用过持久化的rdd的StorageLevel不为NONE，所以会执行getOrCompute方法

if (storageLevel != StorageLevel.NONE) {

      getOrCompute(split, context)

    }

  /**

    * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.

    */

  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {

    //TODO block和partition的关系

    val blockId = RDDBlockId(id, partition.index)

    var readCachedBlock = true

    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.

    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {

      readCachedBlock = false

      computeOrReadCheckpoint(partition, context)

    }) match {

      case Left(blockResult) =>

        if (readCachedBlock) {

          val existingMetrics = context.taskMetrics().inputMetrics

          existingMetrics.incBytesRead(blockResult.bytes)

          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {

            override def next(): T = {

              existingMetrics.incRecordsRead(1)

              delegate.next()

            }

          }

        } else {

          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])

        }

      case Right(iter) =>

        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])

    }

  }

getOrCompute方法中，调用了blockManager.getOrElseUpdate方法实现了block的读取和持久化操作：

SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {

      readCachedBlock = false

      computeOrReadCheckpoint(partition, context)

    })

在getOrElseUpdate中调用doPutIterator方法，具体实现存储方式和级别的逻辑判断进而调用相应的存储实现MemoryStore或者DiskStore进行具体实现。

private def  doPutIterator[T](

      blockId: BlockId,

      iterator: () => Iterator[T],

      level: StorageLevel,

      classTag: ClassTag[T],

      tellMaster: Boolean = true,

      keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {

    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>

      val startTimeMs = System.currentTimeMillis

      var iteratorFromFailedMemoryStorePut: Option[PartiallyUnrolledIterator[T]] = None

      // Size of the block in bytes

      var size = 0L

      if (level.useMemory) {

        // Put it in memory first, even if it also has useDisk set to true;

        // We will drop it to disk later if the memory store can't hold it.

        if (level.deserialized) {

          memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {

            case Right(s) =>

              size = s

            case Left(iter) =>

              // Not enough space to unroll this block; drop to disk if applicable

              if (level.useDisk) {

                logWarning(s"Persisting block $blockId to disk instead.")

                diskStore.put(blockId) { fileOutputStream =>

                  serializerManager.dataSerializeStream(blockId, fileOutputStream, iter)(classTag)

                }

                size = diskStore.getSize(blockId)

              } else {

                iteratorFromFailedMemoryStorePut = Some(iter)

              }

          }

        } else { // !level.deserialized

          memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {

            case Right(s) =>

              size = s

            case Left(partiallySerializedValues) =>

              // Not enough space to unroll this block; drop to disk if applicable

              if (level.useDisk) {

                logWarning(s"Persisting block $blockId to disk instead.")

                diskStore.put(blockId) { fileOutputStream =>

                  partiallySerializedValues.finishWritingToStream(fileOutputStream)

                }

                size = diskStore.getSize(blockId)

              } else {

                iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)

              }

          }

        }

      } else if (level.useDisk) {

        diskStore.put(blockId) { fileOutputStream =>

          serializerManager.dataSerializeStream(blockId, fileOutputStream, iterator())(classTag)

        }

        size = diskStore.getSize(blockId)

      }

      val putBlockStatus = getCurrentBlockStatus(blockId, info)

      val blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid

      if (blockWasSuccessfullyStored) {

        // Now that the block is in either the memory or disk store, tell the master about it.

        info.size = size

        if (tellMaster && info.tellMaster) {

          reportBlockStatus(blockId, putBlockStatus)

        }

        addUpdatedBlockStatusToTaskMetrics(blockId, putBlockStatus)

        logDebug("Put block %s locally took %s".format(blockId, Utils.getUsedTimeMs(startTimeMs)))

        if (level.replication > 1) {

          val remoteStartTime = System.currentTimeMillis

          val bytesToReplicate = doGetLocalBytes(blockId, info)

          // [SPARK-16550] Erase the typed classTag when using default serialization, since

          // NettyBlockRpcServer crashes when deserializing repl-defined classes.

          // TODO(ekl) remove this once the classloader issue on the remote end is fixed.

          val remoteClassTag = if (!serializerManager.canUseKryo(classTag)) {

            scala.reflect.classTag[Any]

          } else {

            classTag

          }

          try {

            replicate(blockId, bytesToReplicate, level, remoteClassTag)

          } finally {

            bytesToReplicate.unmap()

          }

          logDebug("Put block %s remotely took %s"

            .format(blockId, Utils.getUsedTimeMs(remoteStartTime)))

        }

      }

      assert(blockWasSuccessfullyStored == iteratorFromFailedMemoryStorePut.isEmpty)

      iteratorFromFailedMemoryStorePut

    }

  }

spark 持久化机制的更多相关文章

60、Spark Streaming：缓存与持久化机制、Checkpoint机制
一.缓存与持久化机制与RDD类似,Spark Streaming也可以让开发人员手动控制,将数据流中的数据持久化到内存中.对DStream调用persist()方法,就可以让Spark Stream ...
Spark 概念学习系列之Spark存储管理机制
Spark存储管理机制概要 01 存储管理概述 02 RDD持久化 03 Shuffle数据存储 04 广播变量与累加器 01 存储管理概述思考: RDD,我们可以直接使用而无须关心它的实现细节, ...
Redis提供的持久化机制（RDB和AOF）
Redis提供的持久化机制 Redis是一种面向"key-value"类型数据的分布式NoSQL数据库系统,具有高性能.持久存储.适应高并发应用场景等优势.它虽然起步较晚,但发展却 ...
ActiveMQ的几种消息持久化机制
为了避免意外宕机以后丢失信息,需要做到重启后可以恢复消息队列,消息系统一般都会采用持久化机制. ActiveMQ的消息持久化机制有JDBC,AMQ,KahaDB和LevelDB,无论使用哪种持久化方式 ...
Spark工作机制简述
Spark工作机制主要模块调度与任务分配 I/O模块通信控制模块容错模块 Shuffle模块调度层次应用作业 Stage Task 调度算法 FIFO FAIR(公平调度) Spark应 ...
Redis 学习之持久化机制、发布订阅、虚拟内存
一.持久化机制 Redis是一个支持持久化的内存数据库,redis会经常将内存中的数据同步到硬盘上来保证数据持久化,从而避免服务器宕机数据丢失问题,或者减少服务器内存消耗提高性能. 持久化方式: 1. ...
ActiveMQ的消息持久化机制
为了避免意外宕机以后丢失信息,需要做到重启后可以恢复消息队列,消息系统一般都会采用持久化机制. ActiveMQ的消息持久化机制有JDBC,AMQ,KahaDB和LevelDB,无论使用哪种持久化方式 ...
Redis学习-持久化机制
Redis持久化的意义在于故障恢复比如你部署了一个redis,作为cache缓存,当然也可以保存一些较为重要的数据如果没有持久化的话,redis遇到灾难性故障的时候(断电.宕机),就会丢失所有的 ...
分析RedisRDB和AOF两种持久化机制的工作原理及优劣势
一.RDB和AOF两种持久化机制的介绍 RDB持久化机制,对redis中的数据执行周期性的持久化 AOF机制对每条写入命令作为日志,以append-only(追加)的模式写入一个日志文件中,在redi ...

随机推荐

Spring学习-- Bean 的作用域
Bean 的作用域: 在 Spring 中 , 可以在 <bean> 元素的 scope 属性里设置 bean 的作用域. 默认情况下 , Spring 只为每个在 IOC 容器里声明的 ...
struts2和jstl有关循环的写法
一:前言其实觉得自己现在就是个码农啊,对于struts2的标签和jstl的标签我一直都是只会用,但是觉得自己老是会混淆这种概念性的问题.所以我自己在代码里面就试着用了几种方式,实现同一种效果,下面就 ...
bzoj1503: [NOI2004]郁闷的出纳员 fhqtreap版
这道题写法和之前差不多但是fhqtreap在加点的时候为了同时维护大根堆以及二叉排序树的性质所以插入时也要注意分裂 fhqteap需要判断指针是否为空不然就会re 这个我调了很久 #include ...
MyBatis系列三之使用getMapper剔除掉Dao的实现类
MyBatis系列三之使用getMapper剔除掉Dao的实现类我们在系列一中我们使用的是Dao的实现类来操作底层数据库,今天我们使用getMapper()来替换Dao的实现类, ...
bzoj 2005 NOI 2010 能量采集
我们发现对于一个点(x,y),与(0,0)连线上的点数是gcd(x,y)-1 那么这个点的答案就是2*gcd(x,y)-1,那么最后的答案就是所有点的gcd值*2-n*m,那么问题转化成了求每个点的 ...
LCD实验学习笔记（四）：系统时钟
一般CPU频率(FCLK)高于内存.网卡等设备频率(HCLK),而串口.USB.I2C等设备频率(PCLK)更低. 系统时钟: 系统时钟源为晶振,初始频率12MHz. 通过设置MPLLCON寄存器的M ...
linux基础的基础命令操作
一.开启Linux操作系统,要求以root用户登录GNOME图形界面,语言支持选择为汉语操作:su - root 二.使用快捷键切换到虚拟终端2,使用普通用户身份登录,查看系统提示符操作:ctrl ...
pychart
Pychart PyChart is a Python library for creating high quality Encapsulated Postscript, PDF, PNG, or ...
Bean利用Resource接口获取资源的几种方式
Resources的类型获取resource的方式(xml配置正常进行):
centos 7 卸载自带的jdk
# 查看jdk安装信息 rpm -qa|grep java 卸载已安装的jdk: # yum -y remove java java-1.7.0-*

spark 持久化机制

spark 持久化机制的更多相关文章

随机推荐

热门专题