spark streaming checkpoint

Checkpoint机制

通过前期对Spark Streaming的理解，我们知道，Spark Streaming应用程序如果不手动停止，则将一直运行下去，在实际中应用程序一般是24小时*7天不间断运行的，因此Streaming必须对诸如系统错误、JVM出错等与程序逻辑无关的错误（failures ）具体很强的弹性，具备一定的非应用程序出错的容错性。Spark Streaming的Checkpoint机制便是为此设计的，它将足够多的信息checkpoint到某些具备容错性的存储系统如HDFS上，以便出错时能够迅速恢复。有两种数据可以chekpoint：

（1）Metadata checkpointing
将流式计算的信息保存到具备容错性的存储上如HDFS，Metadata Checkpointing适用于当streaming应用程序Driver所在的节点出错时能够恢复，元数据包括：
Configuration（配置信息） - 创建streaming应用程序的配置信息
DStream operations - 在streaming应用程序中定义的DStreaming操作
Incomplete batches - 在列队中没有处理完的作业

（2）Data checkpointing
将生成的RDD保存到外部可靠的存储当中，对于一些数据跨度为多个bactch的有状态tranformation操作来说，checkpoint非常有必要，因为在这些transformation操作生成的RDD对前一RDD有依赖，随着时间的增加，依赖链可能会非常长，checkpoint机制能够切断依赖链，将中间的RDD周期性地checkpoint到可靠存储当中，从而在出错时可以直接从checkpoint点恢复。

具体来说，metadata checkpointing主要还是从drvier失败中恢复，而Data Checkpoing用于对有状态的transformation操作进行checkpointing

http://blog.csdn.net/wisgood/article/details/55667612

http://www.cnblogs.com/dt-zhw/p/5664663.html

import java.io.File

import java.nio.charset.Charset

import com.google.common.io.Files

import org.apache.spark.SparkConf

import org.apache.spark.rdd.RDD

import org.apache.spark.streaming.{Time, Seconds, StreamingContext}

import org.apache.spark.util.IntParam

/**

 * Counts words in text encoded with UTF8 received from the network every second.

 *

 * Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> <output-file>

 *   <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive

 *   data. <checkpoint-directory> directory to HDFS-compatible file system which checkpoint data

 *   <output-file> file to which the word counts will be appended

 *

 * <checkpoint-directory> and <output-file> must be absolute paths

 *

 * To run this on your local machine, you need to first run a Netcat server

 *

 *      `$ nc -lk 9999`

 *

 * and run the example as

 *

 *      `$ ./bin/run-example org.apache.spark.examples.streaming.RecoverableNetworkWordCount \

 *              localhost 9999 ~/checkpoint/ ~/out`

 *

 * If the directory ~/checkpoint/ does not exist (e.g. running for the first time), it will create

 * a new StreamingContext (will print "Creating new context" to the console). Otherwise, if

 * checkpoint data exists in ~/checkpoint/, then it will create StreamingContext from

 * the checkpoint data.

 *

 * Refer to the online documentation for more details.

 */

object RecoverableNetworkWordCount {

  def createContext(ip: String, port: Int, outputPath: String, checkpointDirectory: String)

    : StreamingContext = {

    //程序第一运行时会创建该条语句，如果应用程序失败，则会从checkpoint中恢复，该条语句不会执行

    println("Creating new context")

    val outputFile = new File(outputPath)

    if (outputFile.exists()) outputFile.delete()

    val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount").setMaster("local[4]")

    // Create the context with a 1 second batch size

    val ssc = new StreamingContext(sparkConf, Seconds())

    ssc.checkpoint(checkpointDirectory)

    //将socket作为数据源

    val lines = ssc.socketTextStream(ip, port)

    val words = lines.flatMap(_.split(" "))

    val wordCounts = words.map(x => (x, )).reduceByKey(_ + _)

    wordCounts.foreachRDD((rdd: RDD[(String, Int)], time: Time) => {

      val counts = "Counts at time " + time + " " + rdd.collect().mkString("[", ", ", "]")

      println(counts)

      println("Appending to " + outputFile.getAbsolutePath)

      Files.append(counts + "\n", outputFile, Charset.defaultCharset())

    })

    ssc

  }

  //将String转换成Int

  private object IntParam {

  def unapply(str: String): Option[Int] = {

    try {

      Some(str.toInt)

    } catch {

      case e: NumberFormatException => None

    }

  }

}

  def main(args: Array[String]) {

    if (args.length != ) {

      System.err.println("You arguments were " + args.mkString("[", ", ", "]"))

      System.err.println(

        """

          |Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory>

          |     <output-file>. <hostname> and <port> describe the TCP server that Spark

          |     Streaming would connect to receive data. <checkpoint-directory> directory to

          |     HDFS-compatible file system which checkpoint data <output-file> file to which the

          |     word counts will be appended

          |

          |In local mode, <master> should be 'local[n]' with n >

          |Both <checkpoint-directory> and <output-file> must be absolute paths

        """.stripMargin

      )

      System.exit()

    }

   val Array(ip, IntParam(port), checkpointDirectory, outputPath) = args

    //getOrCreate方法，从checkpoint中重新创建StreamingContext对象或新创建一个StreamingContext对象

    val ssc = StreamingContext.getOrCreate(checkpointDirectory,

      () => {

        createContext(ip, port, outputPath, checkpointDirectory)

      })

    ssc.start()

    ssc.awaitTermination()

  }

}

spark streaming checkpoint的更多相关文章

Spark Streaming Checkpoint反序列化问题分析
转载自:https://mp.weixin.qq.com/s/EQgDUSf3TK0oVg1xmg-49Q Checkpoint是Spark Streaming中的核心机制,它为应用程序的7*24小时 ...
Spark Streaming之四：Spark Streaming 与 Kafka 集成分析
前言 Spark Streaming 诞生于2013年,成为Spark平台上流式处理的解决方案,同时也给大家提供除Storm 以外的另一个选择.这篇内容主要介绍Spark Streaming 数据接收 ...
Apache Kafka + Spark Streaming Integration
1.目标为了构建实时应用程序,Apache Kafka - Spark Streaming Integration是最佳组合.因此,在本文中,我们将详细了解Kafka中Spark Streamin ...
Spark Streaming metadata checkpoint
Checkpointing 一个流应用程序必须全天候运行,所有必须能够解决应用程序逻辑无关的故障(如系统错误,JVM崩溃等).为了使这成为可能,Spark Streaming需要checkpoint足 ...
Spark Streaming揭秘 Day33 checkpoint的使用
Spark Streaming揭秘 Day33 checkpoint的使用今天谈下sparkstreaming中,另外一个至关重要的内容Checkpoint. 首先,我们会看下checkpoint的 ...
Spark Streaming源码分析 – Checkpoint
PersistenceStreaming没有做特别的事情,DStream最终还是以其中的每个RDD作为job进行调度的,所以persistence就以RDD为单位按照原先Spark的方式去做就可以了, ...
60、Spark Streaming：缓存与持久化机制、Checkpoint机制
一.缓存与持久化机制与RDD类似,Spark Streaming也可以让开发人员手动控制,将数据流中的数据持久化到内存中.对DStream调用persist()方法,就可以让Spark Stream ...
Spark踩坑记——Spark Streaming+Kafka
[TOC] 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...
Spark Streaming+Kafka
Spark Streaming+Kafka 前言在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端, ...

随机推荐

Create PDB with Sample schemas in 12C
查看: SQL> select * from ( 2 select username, account_status from dba_users order by created desc) ...
android studio 如何让包名展开
通常我们新建一个包名的时候,会发现他们连在一起,根本无法在创建一个同级的包工具/原料电脑,android studio 方法/步骤 1,我们先在包名下建一个包,变成了这样,根本无法在同 ...
win10 标注工具LabelImg 安装使用
安装步骤(默认已经安装了Python3.X ): pip 安装PyQt5 进入cmd(Win键 + R键,输入cmd,enter键入),输入: >>pip install PyQt5 如果 ...
常用代码之四：创建jason，jason转换为字符串，字符串转换回jason，c#反序列化jason字符串的几个代码片段
1.创建jason,并JSON.stringify()将之转换为字符串. 直接使用var customer={}, 然后直接customer.属性就可以直接赋值了. 也可以var customer = ...
SpringMVC之RequestContextHolder分析
最近遇到的问题是在service获取request和response,正常来说在service层是没有request的,然而直接从controlller传过来的话解决方法太粗暴,后来发现了Spring ...
windows如何查看某个端口被谁占用
我们在启动应用的时候经常发现我们需要使用的端口被别的程序占用,但是我们又不知道是被谁占用,这时候我们需要找出“真凶”,如何做到呢? cmd命令中输入命令:netstat -ano,列出所有端口的情况 ...
Outlook 如何初始化邮箱
首先我们找到邮箱的安装位置,我们可以右键Outlook,然后看其属性.找到其安装位置,复制下来,比如说 "C:\Program Files\Microsoft Office\root\Off ...
JDK1.5新特性，基础类库篇，集合框架（Collections）
集合框架在JDK1.5中增强特性如下: 一. 新语言特性的增强泛型(Generics)- 增加了集合框架在编译时段的元素类型检查,节省了遍历元素时类型转换代码量. For-Loop循环(Enhanc ...
jQuery添加/改变/移除CSS类
转自:http://www.jbxue.com/article/24589.html 在jquery中用到removeClass移除CSS类.addClass添加CSS类.toggleClass添加或 ...
git clean 小结
删除一些没有 Git add 的文件: git clean 参数 -n 显示将要删除的文件和目录 -f 删除文件,-df 删除文件和目录 git clean -n git c ...

spark streaming checkpoint

Checkpoint机制

spark streaming checkpoint的更多相关文章

随机推荐

热门专题