spark structured-streaming 最全的使用总结
一、spark structured-streaming 介绍
我们都知道spark streaming 在v2.4.5 之后 就进入了维护阶段,不再有新的大版本出现,而且 spark streaming 一直是按照微批来处理streaming 数据的,只能做到准实时,无法像flink一样做到数据的实时数据处理。所以在spark streaming 进入到不再更新的维护阶段后,spark 推出了 structured-streaming 来同flink 进行竞争,structured-streaming 支持窗口计算,watermark、state 等flink 一样的特性。
二、spark structured-streaming 如何对接kafka 数据源
jar依赖:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.2.0
1、从kafka数据源读取数据
// Subscribe to 1 topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)] // Subscribe to 1 topic, with headers
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("includeHeaders", "true")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")
.as[(String, String, Array[(String, Array[Byte])])] // Subscribe to multiple topics
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)] // Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
2、Creating a Kafka Source for Batch Queries
// Subscribe to 1 topic defaults to the earliest and latest offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)] // Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)] // Subscribe to a pattern, at the earliest and latest offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
schema:
Column | Type |
---|---|
key | binary |
value | binary |
topic | string |
partition | int |
offset | long |
timestamp | timestamp |
timestampType | int |
headers (optional) | array |
Option参数:
Option | value | meaning |
---|---|---|
assign | json string {"topicA":[0,1],"topicB":[2,4]} | Specific TopicPartitions to consume. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source. |
subscribe | A comma-separated list of topics | The topic list to subscribe. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source. |
subscribePattern | Java regex string | The pattern used to subscribe to topic(s). Only one of "assign, "subscribe" or "subscribePattern" options can be specified for Kafka source. |
kafka.bootstrap.servers | A comma-separated list of host:port | The Kafka "bootstrap.servers" configuration. |
Option可选参数:
Option | value | default | query type | meaning |
---|---|---|---|---|
startingTimestamp | timestamp string e.g. "1000" | none (next preference is startingOffsetsByTimestamp ) |
streaming and batch | The start point of timestamp when a query is started, a string specifying a starting timestamp for all partitions in topics being subscribed. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the behavior will follow to the value of the option startingOffsetsByTimestampStrategy
Note1: Note2: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. |
startingOffsetsByTimestamp | json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """ | none (next preference is startingOffsets ) |
streaming and batch | The start point of timestamp when a query is started, a json string specifying a starting timestamp for each TopicPartition. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the behavior will follow to the value of the option startingOffsetsByTimestampStrategy
Note1: Note2: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. |
startingOffsets | "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """ | "latest" for streaming, "earliest" for batch | streaming and batch | The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. |
endingTimestamp | timestamp string e.g. "1000" | none (next preference is endingOffsetsByTimestamp ) |
batch query | The end point when a batch query is ended, a json string specifying an ending timestamp for all partitions in topics being subscribed. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the offset will be set to latest.
Note: |
endingOffsetsByTimestamp | json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """ | none (next preference is endingOffsets ) |
batch query | The end point when a batch query is ended, a json string specifying an ending timestamp for each TopicPartition. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the offset will be set to latest.
Note: |
endingOffsets | latest or json string {"topicA":{"0":23,"1":-1},"topicB":{"0":-1}} | latest | batch query | The end point when a batch query is ended, either "latest" which is just referred to the latest, or a json string specifying an ending offset for each TopicPartition. In the json, -1 as an offset can be used to refer to latest, and -2 (earliest) as an offset is not allowed. |
failOnDataLoss | true or false | true | streaming and batch | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. |
kafkaConsumer.pollTimeoutMs | long | 120000 | streaming and batch | The timeout in milliseconds to poll data from Kafka in executors. When not defined it falls back to spark.network.timeout . |
fetchOffset.numRetries | int | 3 | streaming and batch | Number of times to retry before giving up fetching Kafka offsets. |
fetchOffset.retryIntervalMs | long | 10 | streaming and batch | milliseconds to wait before retrying to fetch Kafka offsets |
maxOffsetsPerTrigger | long | none | streaming query | Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume. |
minOffsetsPerTrigger | long | none | streaming query | Minimum number of offsets to be processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume. Note, if the maxTriggerDelay is exceeded, a trigger will be fired even if the number of available offsets doesn't reach minOffsetsPerTrigger. |
maxTriggerDelay | time with units | 15m | streaming query | Maximum amount of time for which trigger can be delayed between two triggers provided some data is available from the source. This option is only applicable if minOffsetsPerTrigger is set. |
minPartitions | int | none | streaming and batch | Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint : the number of Spark tasks will be approximately minPartitions . It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data. |
groupIdPrefix | string | spark-kafka-source | streaming and batch | Prefix of consumer group identifiers (group.id ) that are generated by structured streaming queries. If "kafka.group.id" is set, this option will be ignored. |
kafka.group.id | string | none | streaming and batch | The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use a specific authorized group id to read data. You can optionally set the group id. However, do this with extreme caution as it can cause unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the same group id are likely interfere with each other causing each query to read only part of the data. This may also occur when queries are started/restarted in quick succession. To minimize such issues, set the Kafka consumer session timeout (by setting option "kafka.session.timeout.ms") to be very small. When this is set, option "groupIdPrefix" will be ignored. |
includeHeaders | boolean | false | streaming and batch | Whether to include the Kafka headers in the row. |
startingOffsetsByTimestampStrategy | "error" or "latest" | "error" | streaming and batch | The strategy will be used when the specified starting offset by timestamp (either global or per partition) doesn't match with the offset Kafka returned. Here's the strategy name and corresponding descriptions:
"error": fail the query and end users have to deal with workarounds requiring manual steps. "latest": assigns the latest offset for these partitions, so that Spark can read newer records from these partitions in further micro-batches. |
Consumer Caching
It’s time-consuming to initialize Kafka consumers, especially in streaming scenarios where processing time is a key factor. Because of this, Spark pools Kafka consumers on executors, by leveraging Apache Commons Pool.
The caching key is built up from the following information:
- Topic name
- Topic partition
- Group ID
The following properties are available to configure the consumer pool:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.kafka.consumer.cache.capacity | 64 | The maximum number of consumers cached. Please note that it's a soft limit. | 3.0.0 |
spark.kafka.consumer.cache.timeout | 5m (5 minutes) | The minimum amount of time a consumer may sit idle in the pool before it is eligible for eviction by the evictor. | 3.0.0 |
spark.kafka.consumer.cache.evictorThreadRunInterval | 1m (1 minute) | The interval of time between runs of the idle evictor thread for consumer pool. When non-positive, no idle evictor thread will be run. | 3.0.0 |
spark.kafka.consumer.cache.jmx.enable | false | Enable or disable JMX for pools created with this configuration instance. Statistics of the pool are available via JMX instance. The prefix of JMX name is set to "kafka010-cached-simple-kafka-consumer-pool". | 3.0.0 |
The size of the pool is limited by spark.kafka.consumer.cache.capacity
, but it works as “soft-limit” to not block Spark tasks.
Idle eviction thread periodically removes consumers which are not used longer than given timeout. If this threshold is reached when borrowing, it tries to remove the least-used entry that is currently not in use.
If it cannot be removed, then the pool will keep growing. In the worst case, the pool will grow to the max number of concurrent tasks that can run in the executor (that is, number of task slots).
If a task fails for any reason, the new task is executed with a newly created Kafka consumer for safety reasons. At the same time, we invalidate all consumers in pool which have same caching key, to remove consumer which was used in failed execution. Consumers which any other tasks are using will not be closed, but will be invalidated as well when they are returned into pool.
Along with consumers, Spark pools the records fetched from Kafka separately, to let Kafka consumers stateless in point of Spark’s view, and maximize the efficiency of pooling. It leverages same cache key with Kafka consumers pool. Note that it doesn’t leverage Apache Commons Pool due to the difference of characteristics.
The following properties are available to configure the fetched data pool:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.kafka.consumer.fetchedData.cache.timeout | 5m (5 minutes) | The minimum amount of time a fetched data may sit idle in the pool before it is eligible for eviction by the evictor. | 3.0.0 |
spark.kafka.consumer.fetchedData.cache.evictorThreadRunInterval | 1m (1 minute) | The interval of time between runs of the idle evictor thread for fetched data pool. When non-positive, no idle evictor thread will be run. | 3.0.0 |
特别注意:
// "kafka.auto.offset.reset" -> "latest", //latest自动重置偏移量为最新的偏移量earliest,structured-streaming not support
"auto.offset.reset" -> "latest", //latest自动重置偏移量为最新的偏移量earliest 无效,应该写为 kafka.auto.offset.reset
// "kafka.enable.auto.commit" -> "true", //如果是true,则这个消费者的偏移量会在后台自动提交 structured-streaming not support ,通过.option("startingOffsets","latest")
"enable.auto.commit" -> "true", //如果是true,则这个消费者的偏移量会在后台自动提交 无效,应该写为kafka.enable.auto.commit
其他的一些可选参数:比如连接kafka走ssl,
"kafka.security.protocol" -> "SSL",
"kafka.ssl.keystore.location" -> "/dbfs/FileStore/xxxxx/client_keystore.jks",
"kafka.ssl.keystore.password" -> "xxxxx",
"kafka.ssl.key.password" -> "xxxxxx",
"kafka.ssl.truststore.location" -> "/dbfs/FileStore/xxxxx/client_truststore.jks",
"kafka.ssl.truststore.password" -> "kafka",
"kafka.ssl.endpoint.identification.algorithm" -> "",
"kafka.request.timeout.ms" -> "100000",
"kafka.max.poll.records" -> "100",
"kafka.fetch.max.bytes" -> "12428800"
1、可以通过.option("maxOffsetsPerTrigger","30000") 限制每秒从kafka读取的最大数据条数。
2、.option("startingOffsets", "latest") 只针对首次没有进行过任何消费的时候生效,如果之前已经提交过offset,那么此时不会再从latest进行消费,而是从上次的offser进行消费。
3、切记,structured-streaming 实际无法支持"enable.auto.commit" -> "true",都是由structured-streaming自身进行维护的,在 streaming 写入sink的时候,设置.option("checkpointLocation", "")
,类似flink的checkpoint 一样。设置后,structured-streaming 会自己负责维护offset提交到checkpointLocation路径下面。
3、streaming数据写入kafka sink
Kafka Sink for Streaming
/ Write key-value data from a DataFrame to a specific Kafka topic specified in an option
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start() // Write key-value data from a DataFrame to Kafka using a topic specified in the data
val ds = df
.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.start()
Writing the output of Batch Queries to Kafka
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save() // Write key-value data from a DataFrame to Kafka using a topic specified in the data
df.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.save()
The Dataframe being written to Kafka should have the following columns in schema:
Column | Type |
---|---|
key (optional) | string or binary |
value (required) | string or binary |
headers (optional) | array |
topic (*optional) | string |
partition (optional) | int |
* The topic column is required if the “topic” configuration option is not specified.
The value column is the only required option. If a key column is not specified then a null
valued key column will be automatically added (see Kafka semantics on how null
valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column. If a “partition” column is not specified (or its value is null
) then the partition is calculated by the Kafka producer. A Kafka partitioner can be specified in Spark by setting the kafka.partitioner.class
option. If not present, Kafka default partitioner will be used.
The following options must be set for the Kafka sink for both batch and streaming queries.
Option | value | meaning |
---|---|---|
kafka.bootstrap.servers | A comma-separated list of host:port | The Kafka "bootstrap.servers" configuration. |
The following configurations are optional:
Option | value | default | query type | meaning |
---|---|---|---|---|
topic | string | none | streaming and batch | Sets the topic that all rows will be written to in Kafka. This option overrides any topic column that may exist in the data. |
includeHeaders | boolean | false | streaming and batch | Whether to include the Kafka headers in the row. |
Producer Caching
Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.
The caching key is built up from the following information:
- Kafka producer configuration
This includes configuration for authorization, which Spark will automatically include when delegation token is being used. Even we take authorization into account, you can expect same Kafka producer instance will be used among same Kafka producer configuration. It will use different Kafka producer when delegation token is renewed; Kafka producer instance for old delegation token will be evicted according to the cache policy.
The following properties are available to configure the producer pool:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.kafka.producer.cache.timeout | 10m (10 minutes) | The minimum amount of time a producer may sit idle in the pool before it is eligible for eviction by the evictor. | 2.2.1 |
spark.kafka.producer.cache.evictorThreadRunInterval | 1m (1 minute) | The interval of time between runs of the idle evictor thread for producer pool. When non-positive, no idle evictor thread will be run. | 3.0.0 |
Idle eviction thread periodically removes producers which are not used longer than given timeout. Note that the producer is shared and used concurrently, so the last used timestamp is determined by the moment the producer instance is returned and reference count is 0.
Kafka Specific Configurations
Kafka’s own configurations can be set via DataStreamReader.option
with kafka.
prefix, e.g, stream.option("kafka.bootstrap.servers", "host:port")
. For possible kafka parameters, see Kafka consumer config docs for parameters related to reading data, and Kafka producer config docs for parameters related to writing data.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
- group.id: Kafka source will create a unique group id for each query automatically. The user can set the prefix of the automatically generated group.id’s via the optional source option
groupIdPrefix
, default value is “spark-kafka-source”. You can also set “kafka.group.id” to force Spark to use a special group id, however, please read warnings for this option and use it with caution. - auto.offset.reset: Set the source option
startingOffsets
to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note thatstartingOffsets
only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. Note that when the offsets consumed by a streaming application no longer exist in Kafka (e.g., topics are deleted, offsets are out of range, or offsets are removed after retention period), the offsets will not be reset and the streaming application will see data loss. In extreme cases, for example the throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows of a batch might be gradually reduced until zero when the offset ranges of the batch are completely not in Kafka. EnablingfailOnDataLoss
option can ask Structured Streaming to fail the query for such cases. - key.deserializer: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys.
- value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
- key.serializer: Keys are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the keys into either strings or byte arrays.
- value.serializer: values are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the values into either strings or byte arrays.
- enable.auto.commit: Kafka source doesn’t commit any offset.
- interceptor.classes: Kafka source always read keys and values as byte arrays. It’s not safe to use ConsumerInterceptor as it may break the query.
三、spark structured-streaming 如何对接Azure Event hub数据源
jar依赖:
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-eventhubs-spark_${scala.binary.version}</artifactId>
<version>2.3.21</version>
<scope>compile</scope>
</dependency>
1、从event hub数据源读取数据
val eventHubsConf = EventHubsConf("xxxx")//连接字符串 //https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#eventhubsconf
.setConsumerGroup("xxxx")
.setMaxEventsPerTrigger(3000)
// .setStartingPosition(EventPosition.fromEndOfStream) //EventPosition.fromOffset("246812")
//.setStartingPosition(EventPosition.fromStartOfStream)
.setReceiverTimeout(30)
val df = spark
.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load().withColumnRenamed("value", "body").select(col("body").cast("string"))
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition } // To connect to an Event Hub, EntityPath is required as part of the connection string.
// Here, we assume that the connection string from the Azure portal does not have the EntityPath part.
val connectionString = ConnectionStringBuilder("{EVENT HUB CONNECTION STRING FROM AZURE PORTAL}")
.setEventHubName("{EVENT HUB NAME}")
.build
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPosition(EventPosition.fromEndOfStream) var streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
2、streaming数据写入 event hub
df.writeStream
.format("eventhubs")
.outputMode("append")
.options(eventHubsConfWrite.toMap)
.option("checkpointLocation", "xxxxxxxxxxx")
.trigger(ProcessingTime("25 seconds"))
.start().awaitTermination()
import org.apache.spark.sql.streaming.Trigger.ProcessingTime val query =
streamingSelectDF
.writeStream
.format("parquet")
.option("path", "/mnt/sample/test-data")
.option("checkpointLocation", "/mnt/sample/check")
.partitionBy("zip", "day")
.trigger(ProcessingTime("25 seconds"))
.start()
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition }
import org.apache.spark.sql.streaming.Trigger.ProcessingTime // The connection string for the Event Hub you will WRTIE to.
val connString = "{EVENT HUB CONNECTION STRING}"
val eventHubsConfWrite = EventHubsConf(connString) val source =
spark.readStream
.format("rate")
.option("rowsPerSecond", 100)
.load()
.withColumnRenamed("value", "body")
.select($"body" cast "string") val query =
source
.writeStream
.format("eventhubs")
.outputMode("update")
.options(eventHubsConfWrite.toMap)
.trigger(ProcessingTime("25 seconds"))
.option("checkpointLocation", "/checkpoint/")
.start()
JDBC Sink
import java.sql._ class JDBCSink(url:String, user:String, pwd:String) extends ForeachWriter[(String, String)] {
val driver = "com.mysql.jdbc.Driver"
var connection:Connection = _
var statement:Statement = _ def open(partitionId: Long,version: Long): Boolean = {
Class.forName(driver)
connection = DriverManager.getConnection(url, user, pwd)
statement = connection.createStatement
true
} def process(value: (String, String)): Unit = {
statement.executeUpdate("INSERT INTO zip_test " +
"VALUES (" + value._1 + "," + value._2 + ")")
} def close(errorOrNull: Throwable): Unit = {
connection.close
}
}
val url="jdbc:mysql://<mysqlserver>:3306/test"
val user ="user"
val pwd = "pwd" val writer = new JDBCSink(url,user, pwd)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()
更多详情请参考:https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#eventhubsconf
https://docs.microsoft.com/zh-cn/azure/databricks/_static/notebooks/structured-streaming-event-hubs-integration.html
注意:可以通过.setMaxEventsPerTrigger(3000) 控制每次从event hub读取的数据条数
四、spark structured-streaming 的write 输出模式
There are a few types of output modes.
Append mode (default) - This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change. Hence, this mode guarantees that each row will be output only once (assuming fault-tolerant sink). For example, queries with only
select
,where
,map
,flatMap
,filter
,join
, etc. will support Append mode.Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.
Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. More information to be added in future releases.
Different types of streaming queries support different output modes. Here is the compatibility matrix.
Query Type | Supported Output Modes | Notes | |
---|---|---|---|
Queries with aggregation | Aggregation on event-time with watermark | Append, Update, Complete | Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed). See the Late Data section for more details.Update mode uses watermark to drop old aggregation state. Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table. |
Other aggregations | Complete, Update | Since no watermark is defined (only defined in other category), old aggregation state is not dropped. Append mode is not supported as aggregates can update thus violating the semantics of this mode. |
|
Queries with mapGroupsWithState |
Update | Aggregations not allowed in a query with mapGroupsWithState . |
|
Queries with flatMapGroupsWithState |
Append operation mode | Append | Aggregations are allowed after flatMapGroupsWithState . |
Update operation mode | Update | Aggregations not allowed in a query with flatMapGroupsWithState . |
|
Queries with joins |
Append | Update and Complete mode not supported yet. See the support matrix in the Join Operations section for more details on what types of joins are supported. | |
Other queries | Append, Update | Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table. |
五、spark structured-streaming 的其他sink
Sink | Supported Output Modes | Options | Fault-tolerant | Notes |
---|---|---|---|---|
File Sink | Append | path : path to the output directory, must be specified.retention : time to live (TTL) for output files. Output files which batches were committed older than TTL will be eventually excluded in metadata log. This means reader queries which read the sink's output directory may not process them. You can provide the value as string format of the time. (like "12h", "7d", etc.) By default it's disabled.For file-format-specific options, see the related methods in DataFrameWriter (Scala/Java/Python/R). E.g. for "parquet" format options see DataFrameWriter.parquet() |
Yes (exactly-once) | Supports writes to partitioned tables. Partitioning by time may be useful. |
Kafka Sink | Append, Update, Complete | See the Kafka Integration Guide | Yes (at-least-once) | More details in the Kafka Integration Guide |
Foreach Sink | Append, Update, Complete | None | Yes (at-least-once) | More details in the next section |
ForeachBatch Sink | Append, Update, Complete | None | Depends on the implementation | More details in the next section |
Console Sink | Append, Update, Complete | numRows : Number of rows to print every trigger (default: 20)truncate : Whether to truncate the output if too long (default: true) |
No | |
Memory Sink | Append, Complete | None | No. But in Complete Mode, restarted query will recreate the full table. | Table name is the query name. |
// ========== DF with no aggregations ==========
val noAggDF = deviceDataDf.select("device").where("signal > 10") // Print new data to console
noAggDF
.writeStream
.format("console")
.start() // Write new data to Parquet files
noAggDF
.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.start() // ========== DF with aggregation ==========
val aggDF = df.groupBy("device").count() // Print updated aggregations to console
aggDF
.writeStream
.outputMode("complete")
.format("console")
.start() // Have all the aggregates in an in-memory table
aggDF
.writeStream
.queryName("aggregates") // this query name will be the table name
.outputMode("complete")
.format("memory")
.start() spark.sql("select * from aggregates").show() // interactively query in-memory table
Using Foreach and ForeachBatch
The foreach
and foreachBatch
operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach
allows custom write logic on every row, foreachBatch
allows arbitrary operations and custom logic on the output of each micro-batch. Let’s understand their usages in more detail.
ForeachBatch
foreachBatch(...)
allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
// Transform and write batchDF
}.start() streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}
Foreach
If foreachBatch
is not an option (for example, corresponding batch data writer does not exist, or continuous processing mode), then you can express your custom writer logic using foreach
. Specifically, you can express the data writing logic by dividing it into three methods: open
, process
, and close
. Since Spark 2.4, foreach
is available in Scala, Java and Python.
streamingDatasetOfString.writeStream.foreach(
new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = {
// Open connection
} def process(record: String): Unit = {
// Write string to connection
} def close(errorOrNull: Throwable): Unit = {
// Close the connection
}
}
).start() val spark: SparkSession = ... // Create a streaming DataFrame
val df = spark.readStream
.format("rate")
.option("rowsPerSecond", 10)
.load() // Write the streaming DataFrame to a table
df.writeStream
.option("checkpointLocation", "path/to/checkpoint/dir")
.toTable("myTable") // Check the table result
spark.read.table("myTable").show() // Transform the source dataset and write to a new table
spark.readStream
.table("myTable")
.select("value")
.writeStream
.option("checkpointLocation", "path/to/checkpoint/dir")
.format("parquet")
.toTable("newTable") // Check the new table result
spark.read.table("newTable").show()
六、spark structured-streaming 的其他State Store
State store is a versioned key-value store which provides both read and write operations. In Structured Streaming, we use the state store provider to handle the stateful operations across batches. There are two built-in state store provider implementations. End users can also implement their own state store provider by extending StateStoreProvider interface.
HDFS state store provider
The HDFS backend state store provider is the default implementation of [[StateStoreProvider]] and [[StateStore]] in which all the data is stored in memory map in the first stage, and then backed by files in an HDFS-compatible file system. All updates to the store have to be done in sets transactionally, and each set of updates increments the store’s version. These versions can be used to re-execute the updates (by retries in RDD operations) on the correct version of the store, and regenerate the store version.
RocksDB state store implementation
As of Spark 3.2, we add a new built-in state store implementation, RocksDB state store provider.
If you have stateful operations in your streaming query (for example, streaming aggregation, streaming dropDuplicates, stream-stream joins, mapGroupsWithState, or flatMapGroupsWithState) and you want to maintain millions of keys in the state, then you may face issues related to large JVM garbage collection (GC) pauses causing high variations in the micro-batch processing times. This occurs because, by the implementation of HDFSBackedStateStore, the state data is maintained in the JVM memory of the executors and large number of state objects puts memory pressure on the JVM causing high GC pauses.
In such cases, you can choose to use a more optimized state management solution based on RocksDB. Rather than keeping the state in the JVM memory, this solution uses RocksDB to efficiently manage the state in the native memory and the local disk. Furthermore, any changes to this state are automatically saved by Structured Streaming to the checkpoint location you have provided, thus providing full fault-tolerance guarantees (the same as default state management).
To enable the new build-in state store implementation, set spark.sql.streaming.stateStore.providerClass
to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
.
Here are the configs regarding to RocksDB instance of the state store provider:
Config Name | Description | Default Value |
---|---|---|
spark.sql.streaming.stateStore.rocksdb.compactOnCommit | Whether we perform a range compaction of RocksDB instance for commit operation | False |
spark.sql.streaming.stateStore.rocksdb.blockSizeKB | Approximate size in KB of user data packed per block for a RocksDB BlockBasedTable, which is a RocksDB's default SST file format. | 4 |
spark.sql.streaming.stateStore.rocksdb.blockCacheSizeMB | The size capacity in MB for a cache of blocks. | 8 |
spark.sql.streaming.stateStore.rocksdb.lockAcquireTimeoutMs | The waiting time in millisecond for acquiring lock in the load operation for RocksDB instance. | 60000 |
spark.sql.streaming.stateStore.rocksdb.resetStatsOnLoad | Whether we resets all ticker and histogram stats for RocksDB on load. | True |
State Store and task locality
The stateful operations store states for events in state stores of executors. State stores occupy resources such as memory and disk space to store the states. So it is more efficient to keep a state store provider running in the same executor across different streaming batches. Changing the location of a state store provider requires the extra overhead of loading checkpointed states. The overhead of loading state from checkpoint depends on the external storage and the size of the state, which tends to hurt the latency of micro-batch run. For some use cases such as processing very large state data, loading new state store providers from checkpointed states can be very time-consuming and inefficient.
The stateful operations in Structured Streaming queries rely on the preferred location feature of Spark’s RDD to run the state store provider on the same executor. If in the next batch the corresponding state store provider is scheduled on this executor again, it could reuse the previous states and save the time of loading checkpointed states.
However, generally the preferred location is not a hard requirement and it is still possible that Spark schedules tasks to the executors other than the preferred ones. In this case, Spark will load state store providers from checkpointed states on new executors. The state store providers run in the previous batch will not be unloaded immediately. Spark runs a maintenance task which checks and unloads the state store providers that are inactive on the executors.
By changing the Spark configurations related to task scheduling, for example spark.locality.wait
, users can configure Spark how long to wait to launch a data-local task. For stateful operations in Structured Streaming, it can be used to let state store providers running on the same executors across batches.
Specifically for built-in HDFS state store provider, users can check the state store metrics such as loadedMapCacheHitCount
and loadedMapCacheMissCount
. Ideally, it is best if cache missing count is minimized that means Spark won’t waste too much time on loading checkpointed state. User can increase Spark locality waiting configurations to avoid loading state store providers in different executors across batches.
七、spark structured-streaming Join
未完待续
spark structured-streaming 最全的使用总结的更多相关文章
- DataFlow编程模型与Spark Structured streaming
流式(streaming)和批量( batch):流式数据,实际上更准确的说法应该是unbounded data(processing),也就是无边界的连续的数据的处理:对应的批量计算,更准确的说法是 ...
- Spark Structured streaming框架(1)之基本使用
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
- Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
- Spark2.2(三十三):Spark Streaming和Spark Structured Streaming更新broadcast总结(一)
背景: 需要在spark2.2.0更新broadcast中的内容,网上也搜索了不少文章,都在讲解spark streaming中如何更新,但没有spark structured streaming更新 ...
- Spark2.2(三十八):Spark Structured Streaming2.4之前版本使用agg和dropduplication消耗内存比较多的问题(Memory issue with spark structured streaming)调研
在spark中<Memory usage of state in Spark Structured Streaming>讲解Spark内存分配情况,以及提到了HDFSBackedState ...
- Spark2.3(三十五)Spark Structured Streaming源代码剖析(从CSDN和Github中看到别人分析的源代码的文章值得收藏)
从CSDN中读取到关于spark structured streaming源代码分析不错的几篇文章 spark源码分析--事件总线LiveListenerBus spark事件总线的核心是LiveLi ...
- Spark2.3(三十四):Spark Structured Streaming之withWaterMark和windows窗口是否可以实现最近一小时统计
WaterMark除了可以限定来迟数据范围,是否可以实现最近一小时统计? WaterMark目的用来限定参数计算数据的范围:比如当前计算数据内max timestamp是12::00,waterMar ...
- Kafka:ZK+Kafka+Spark Streaming集群环境搭建(二十九):推送avro格式数据到topic,并使用spark structured streaming接收topic解析avro数据
推送avro格式数据到topic 源代码:https://github.com/Neuw84/structured-streaming-avro-demo/blob/master/src/main/j ...
- Spark Structured Streaming框架(3)之数据输出源详解
Spark Structured streaming API支持的输出源有:Console.Memory.File和Foreach.其中Console在前两篇博文中已有详述,而Memory使用非常简单 ...
- Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
随机推荐
- 哪5种IO模型?什么是select/poll/epoll?同步异步阻塞非阻塞有啥区别?全在这讲明白了!
系统中有哪5种IO模型?什么是 select/poll/epoll?同步异步阻塞非阻塞有啥区别? 本文地址http://yangjianyong.cn/?p=84转载无需经过作者本人授权 先解开第一个 ...
- Java面向对象系列(2)- 回顾方法的定义
方法的定义 修饰符 返回类型 break:跳出switch,结束循环和return的区别 方法名:注意规范,见名知意 参数列表:(参数类型,参数名) 异常抛出 package oop.demo01; ...
- 试玩Aid Learning
前言 记录一下步骤 下载安装 github官网 切换源 ## 打开Terminal复制回车即可 cd /etc/apt/&& cp sources.list sources.list. ...
- find_elements与find_element的区别
find_element不能使用len,find_elements可以使用len获取元素数量,判断页面有无某个元素,这个方法可以用来断言. 如添加用户后,判断是否添加成功. 删除用户后,判断是否删除成 ...
- Wannafly挑战赛10F-小H和遗迹【Trie,树状数组】
正题 题目链接:https://ac.nowcoder.com/acm/contest/72/F 题目大意 \(n\)个字符串,包括小写字母和\(\#\).其中\(\#\)可以替换为任意字符串.求有多 ...
- YbtOJ#662-交通运输【线段树合并,树状数组】
正题 题目链接:http://www.ybtoj.com.cn/contest/122/problem/2 题目大意 给出\(n\)个点的一棵有根树,对于每个\(x\)求,删除点\(x\)后修改某个点 ...
- Python爬虫--淘宝“泸州老窖”
爬虫淘宝--"泸州老窖" 爬去淘宝"泸州老窖" 相关信息: import requests import re import json import panda ...
- 通过Git在本地局域网中的两台电脑间同步代码
0.前言 一般情况下同步代码可以通过在GitHub/GitLab等网站新建远程仓库,所有机器都向仓库推送或者从仓库下拉更新. 上述过程步骤也不算复杂,不过有时候我们考虑到仓库的安全性等因素,只想在局域 ...
- nginx访问权限问题
1.问题 server { listen 8011; server_name test.cn; location ~ \.php?.*$ { root /home/zhj/; #fastcgi_pas ...
- Oracle中常用的to_char用法详解
Oracle函数to_char转化数字型指定小数点位数的用法 to_char,函数功能,就是将数值型或者日期型转化为字符型. 比如最简单的应用: -- 1.0123=>1.0123 SELECT ...