Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk. 除非生成RDD的计算逻辑非常复杂,否则不要溢写到磁盘,性能低。当数据丢失重新计算都比读取磁盘快。
2 spark 加速器新特性:
While this code used the built-in support for accumulators of type Long, programmers can also create their own types by subclassing AccumulatorV2. The AccumulatorV2 abstract class has several methods which one has to override: reset
for resetting the accumulator to zero, add
for adding another value into the accumulator, merge
for merging another same-type accumulator into this one. Other methods that must be overridden are contained in the API documentation. For example, supposing we had a MyVector
class representing mathematical vectors, we could write:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map()
val accum = sc.longAccumulator { x => accum.add(x); x }
// Here, accum is still 0 because no actions have caused the map operation to be computed.
Launching Spark jobs from Java / Scala
The org.apache.spark.launcher package provides classes for launching Spark jobs as child processes using a simple Java API.
4:关于Structured Streaming文档新特性
Structured Streaming在Spark2.3之前是一款基于Sparl SQL引擎的流处理计算引擎,将流数据抽象为意向无界限的表,每当有流数据到来时候进行追加或者更新结果;但是当Spark2.3的到到来,Spark提供了一种新的连续更低延迟的连续流处理引擎,直白说就是不是基于微批次的而是真正的流处理引擎。
Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.
5:Structured Streaming处理流数据的过程
Note that Structured Streaming does not materialize the entire table. It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source only keeps around the minimal intermediate state data as required to update the result 。
a: watermarking是什么?
Since Spark 2.1, we have support for watermarking which allows the user to specify the threshold of late data, and allows the engine to accordingly clean up old state.
图中迟到的记录 (12:04 donkey)看到的watermark=12:11,12:04记录属于12:00-12:10窗口中,但是由于12:11之前的数据都被清除了,所以12:00-12:10数据被清除,所以12:04的数据是无法被处理的,因此丢丢弃。
b: watermarking如何工作?
In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped .
7:Continuous Processing
Continuous Processing是一个新的计算引擎,是一个真正的连续流处理引擎,高容错,低延迟等特性,但是一次性语义保证是:最少一次,即: at-least-once。
