Kafka：ZK+Kafka+Spark Streaming集群环境搭建（十一）定制一个arvo格式文件发送到kafka的topic，通过Structured Streaming读取kafka的数据

将arvo格式数据发送到kafka的topic

第一步：定制avro schema:

{

      "type": "record",

      "name": "userlog",

      "fields": [

            {"name": "ip","type": "string"},

            {"name": "identity","type":"string"},

            {"name": "userid","type":"int"},

            {"name": "time","type": "string"},

            {"name": "requestinfo","type": "string"},

            {"name": "state","type": "int"},

            {"name": "responce","type": "string"},

            {"name": "referer","type": "string"},

            {"name": "useragent","type": "string"},

            {"name": "timestamp","type": "long"}

      ]

}

定义一个avro的schema文件userlog.avsc，内容如上。

该schema包含字段：ip:string,identity:string,userid:int,time:string,requestinfo:string,state:int,responce:string,referer:string,useragent:string,timestamp:long。这些字段用来描述一个网络请求日志。

第二步：创建发送数据到topic的producer对象：

要实现发送数据到kafka上，我们必须通过kafka api生成一个producer对象，用于向kafka生产数据：

    private static Producer<String, byte[]> createProducer() {

        Properties props = new Properties();

        props.put("acks", "all");

        props.put("retries", 0);

        props.put("batch.size", 16384);

        props.put("linger.ms", 1);

        props.put("buffer.memory", 33554432);

        props.put("key.serializer", StringSerializer.class.getName());

        props.put("value.serializer", ByteArraySerializer.class.getName());

        // 声明kafka broker

        props.put("bootstrap.servers", "192.168.0.121:9092");

        Producer<String, byte[]> procuder = new KafkaProducer<String, byte[]>(props);

        return procuder;

    }

此时需要引入kafka的开发jar包：kafka-clients-0.10.0.1.jar。

第三步：解析avro schema文件为Schema对象，并通过schema对象创建record对象（GenericRecord）

解析avro schema文件为Schema对象，需要依赖包：avro-1.7.5.jar

这里我们定义一个SchemaUtil.java类，该方法提供了一个getAvroSchemaFromHDFSFile方法用来实现从hdfs上读取avro文件，并把该avro文件解析为schema对象。

package com.dx.streaming.producer;

import java.io.IOException;

import java.io.InputStream;

import org.apache.avro.Schema;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

public class SchemaUtil {

    public static Schema getAvroSchemaFromHDFSFile(String hdfsAvroFile) throws Exception {

        InputStream inputStream;

        Path pt = new Path(hdfsAvroFile);

        Schema schema = null;

        FileSystem fs =null;

        try {

            fs = FileSystem.get(new Configuration());

            if (!fs.exists(pt)) {

                throw new Exception(pt+" file is not exists");

            }

            inputStream = fs.open(pt);

            Schema.Parser parser = new Schema.Parser();

            schema = parser.parse(inputStream);

        } catch (IOException e) {

            e.printStackTrace();

            throw e;

        } finally {

            if(fs!=null){

                try {

                    fs.close();

                } catch (IOException e) {

                    e.printStackTrace();

                }

            }

        }

        return schema;

    }

}

通过schema对象创建record对象（GenericRecord），该record存储了实际的生产数据。

            Random random = new Random();

            String ip = random.nextInt(255) + ":" + random.nextInt(255) + ":" + random.nextInt(255) + ":" + random.nextInt(255);

            String identity = UUID.randomUUID().toString();

            int userid = random.nextInt(100);

            SimpleDateFormat dfs = new SimpleDateFormat("yyyy-MM-dd ");

            Date date= new Date();

            String yyyyMMdd    =dfs.format(date);

            String time = yyyyMMdd+ random.nextInt(24) + ":" + random.nextInt(60) + ":" + random.nextInt(60);

            String requestInfo = "....";

            int state = random.nextInt(600);

            String responce = "...";

            String referer = "...";

            String useragent = "...";

            SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

            GenericRecord record = new GenericData.Record(schema);

            record.put("ip", ip);

            record.put("identity", identity);

            record.put("userid", userid);

            record.put("time", time);

            record.put("requestinfo", requestInfo);

            record.put("state", state);

            record.put("responce", responce);

            record.put("referer", referer);

            record.put("useragent", useragent);

            record.put("timestamp", format.parse(time).getTime());

备注：上边代码就是按照schema创建了一个GenericRecord对象，该GenericRecord对象用来存储了真是的数据。

而且record对象可以通过Injection<GenericRecord, byte[]>对象转化为byte[]，更便于在生产数据过程中传输。

String avroFilePath = "/user/dx/conf/avro/userlog.avsc";

Schema schema = SchemaUtil.getAvroSchemaFromHDFSFile(avroFilePath);

Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(schema);

byte[] bytes = recordInjection.apply(record);

实际上在consumer端，接收数据时：当consumer接收到数据时，可以通过Injection<GenericRecord, byte[]> recordInjection对象对接收到的byte[]数据进行avro解析，解析为一个GenericRecord对象。

        Logger logger = LoggerFactory.getLogger("AvroKafkaConsumer");

        Properties props = new Properties();

        props.put("bootstrap.servers", "192.168.0.121:9092,192.168.0.122:9092");

        props.put("group.id", "testgroup");

        props.put("key.deserializer", StringDeserializer.class.getName());

        props.put("value.deserializer", ByteArrayDeserializer.class.getName());

        KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<String, byte[]>(props);

        consumer.subscribe(Collections.singletonList(“topic name”));

        Schema.Parser parser = new Schema.Parser();

        Schema schema = parser.parse("avro schema file path");

        Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(schema);

        try {

            while (true) {

                ConsumerRecords<String, byte[]> records = consumer.poll(1000);

                for (ConsumerRecord<String, byte[]> record : records) {

                    GenericRecord genericRecord = recordInjection.invert(record.value()).get();

                    String info = String.format(String.format("topic = %s, partition = %s, offset = %d, customer = %s,country = %s\n", record.topic(), record.partition(), record.offset(), record.key(), genericRecord.get("str1")));

                    logger.info(info);

                }

            }

        } finally {

            consumer.close();

        }

第四步：通过producer发送数据到topic:

发送byte[]数据到kafka：需要先铜鼓kafka api生成一个producer对象，将上边的record数据转化为byte[]格式，调用producre的send方法发送数据。

Producer<String, byte[]> procuder = createProducer();

// 根据avro schema文件生成schema对象。

// 根据schema对象，生成record，并把数据存储到record中。

// 根据schema对象，生成record转化为byte[]的转化器Injection<GenericRecord, byte[]>。

try {

    byte[] bytes = recordInjection.apply(record);

    ProducerRecord<String, byte[]> msg = new ProducerRecord<String, byte[]>(topic, bytes);

    procuder.send(msg);

} catch (Exception e) {

    e.printStackTrace();

}

上边的四个步骤已经简单的介绍了如何把一个待生产的数据转化为record对象，并把record对象转化为byte[]类型，发送到kafka的几个重要步骤及其实现思路。下边的代码就是一个完整的实现：

package com.dx.streaming.producer;

import java.text.SimpleDateFormat;

import java.util.ArrayList;

import java.util.Date;

import java.util.List;

import java.util.Properties;

import java.util.Random;

import java.util.UUID;

import org.apache.avro.Schema;

import org.apache.avro.generic.GenericData;

import org.apache.avro.generic.GenericRecord;

import org.apache.kafka.clients.producer.KafkaProducer;

import org.apache.kafka.clients.producer.Producer;

import org.apache.kafka.clients.producer.ProducerRecord;

import org.apache.kafka.common.PartitionInfo;

import org.apache.kafka.common.serialization.ByteArraySerializer;

import org.apache.kafka.common.serialization.StringSerializer;

import org.apache.spark.SparkConf;

import org.apache.spark.sql.SparkSession;

import com.twitter.bijection.Injection;

import com.twitter.bijection.avro.GenericAvroCodecs;

public class TestProducer {

    private static final String avroFilePath = "D:\\Java_Study\\workspace\\kafka-streaming-learn\\conf\\avro\\userlog.avsc";

    //private static final String avroFilePath = "/user/dx/conf/avro/userlog.avsc";

    private static final String topic = "t-my";

    public static void main(String[] args) throws Exception {

        int size = 0;

        String appName = "Test Avro";

        SparkConf conf = new SparkConf().setMaster("local[2]").setAppName(appName);

        SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();

        Schema schema = SchemaUtil.getAvroSchemaFromHDFSFile(avroFilePath);

        Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(schema);

        Producer<String, byte[]> procuder = createProducer();

        while (true) {

            Random random = new Random();

            String ip = random.nextInt(255) + ":" + random.nextInt(255) + ":" + random.nextInt(255) + ":" + random.nextInt(255);

            String identity = UUID.randomUUID().toString();

            int userid = random.nextInt(100);

            SimpleDateFormat dfs = new SimpleDateFormat("yyyy-MM-dd ");

            Date date= new Date();

            String yyyyMMdd    =dfs.format(date);

            String time = yyyyMMdd+ random.nextInt(24) + ":" + random.nextInt(60) + ":" + random.nextInt(60);

            String requestInfo = "....";

            int state = random.nextInt(600);

            String responce = "...";

            String referer = "...";

            String useragent = "...";

            SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

            GenericRecord record = new GenericData.Record(schema);

            record.put("ip", ip);

            record.put("identity", identity);

            record.put("userid", userid);

            record.put("time", time);

            record.put("requestinfo", requestInfo);

            record.put("state", state);

            record.put("responce", responce);

            record.put("referer", referer);

            record.put("useragent", useragent);

            record.put("timestamp", format.parse(time).getTime());

            System.out.println("ip:" + ip + ",identity:" + identity + ",userid:" + userid + ",time:" + time + ",timestamp:" + format.parse(time).getTime() + "\r\n");

            try {

                byte[] bytes = recordInjection.apply(record);

                ProducerRecord<String, byte[]> msg = new ProducerRecord<String, byte[]>(topic, bytes);

                procuder.send(msg);

            } catch (Exception e) {

                e.printStackTrace();

            }

            size++;

            if (size % 100 == 0) {

                Thread.sleep(100);

                if (size > 1000) {

                    break;

                }

            }

        }

        // 列出topic的相关信息

        List<PartitionInfo> partitions = new ArrayList<PartitionInfo>();

        partitions = procuder.partitionsFor(topic);

        for (PartitionInfo p : partitions) {

            System.out.println(p);

        }

        System.out.println("send message over.");

        procuder.close(100, java.util.concurrent.TimeUnit.MILLISECONDS);

    }

    private static Producer<String, byte[]> createProducer() {

        Properties props = new Properties();

        props.put("acks", "all");

        props.put("retries", 0);

        props.put("batch.size", 16384);

        props.put("linger.ms", 1);

        props.put("buffer.memory", 33554432);

        props.put("key.serializer", StringSerializer.class.getName());

        props.put("value.serializer", ByteArraySerializer.class.getName());

        // 声明kafka broker

        props.put("bootstrap.servers", "192.168.0.121:9092");

        Producer<String, byte[]> procuder = new KafkaProducer<String, byte[]>(props);

        return procuder;

    }

}

此时pom.xm配置如下：

        <dependency>

            <groupId>com.twitter</groupId>

            <artifactId>bijection-avro_2.11</artifactId>

            <version>0.9.5</version>

        </dependency>

        <dependency>

            <groupId>com.databricks</groupId>

            <artifactId>spark-avro_2.11</artifactId>

            <version>3.2.0</version>

            <type>jar</type>

        </dependency>        

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-core_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-streaming_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

声明：若为了满足上边代码，这里的pom配置中个别dependency是多余的，但是下边的Structured Streaming端是需要的。

测试的打印结果：

ip:229:21:203:40,identity:ae6fde10-4687-4682-a760-d9076892eb45,userid:9,time:2018-07-12 12:57:24,timestamp:1531371444000

ip:105:224:103:61,identity:edef8c93-da4e-46d4-bfd3-551b74e6f4df,userid:1,time:2018-07-12 23:57:23,timestamp:1531411043000

ip:252:230:234:213,identity:80e00a81-f6dd-4bf6-93a1-95154babdd08,userid:59,time:2018-07-12 9:36:37,timestamp:1531359397000

ip:76:63:136:50,identity:630b66fb-95d7-4c63-a638-6f24396987d0,userid:33,time:2018-07-12 19:18:18,timestamp:1531394298000

Partition(topic = t-my, partition = 0, leader = 0, replicas = [0,], isr = [0,]

send message over.

通过Structured Streaming读取kafka的数据

注意事项：

下边是采用structured streaming方式来编程，而非spark streaming方式来编程；

它们的差别在于使用的API不同，原理上也不尽相同，需要开发人员自己清楚自己使用的是什么技术；

当使用structured streaming编程，且使用kafka+spark时，你需要引入的maven依赖如下：

<!-- spark-sql -->

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <!-- spark-sql -->

        <!-- spark-core -->

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-core_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <!-- spark-core -->

        <!-- Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher) http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html -->

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        <!-- Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher) http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html -->

        <!-- Spark Streaming Programming Guide http://spark.apache.org/docs/latest/streaming-programming-guide.html#spark-streaming-programming-guide -->

        <!--

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-streaming_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        -->

        <!-- Spark Streaming Programming Guide http://spark.apache.org/docs/latest/streaming-programming-guide.html#spark-streaming-programming-guide -->

        <!-- Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher) http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html -->

        <!--

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>

            <version>2.2.0</version>

        </dependency>

        -->

        <!-- Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0  or higher) http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html -->

        <!-- kafka client -->

        <dependency>

            <groupId>org.apache.kafka</groupId>

            <artifactId>kafka-clients</artifactId>

            <version>0.10.0.1</version>

        </dependency>

        <!-- kafka client -->

        <!-- avro -->

        <dependency>

            <groupId>org.slf4j</groupId>

            <artifactId>slf4j-simple</artifactId>

            <version>1.7.21</version>

        </dependency>

        <dependency>

            <groupId>org.apache.avro</groupId>

            <artifactId>avro</artifactId>

            <version>1.8.0</version>

        </dependency>

        <dependency>

            <groupId>com.twitter</groupId>

            <artifactId>bijection-avro_2.10</artifactId>

            <version>0.9.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.avro</groupId>

            <artifactId>avro</artifactId>

            <version>1.7.4</version>

        </dependency>

        <!-- avro -->

既然是读取kafka的avro的record的byte[]格式记录，这里就需要对其进行byte[]进行解析(解析为行：这里先将byte[]转化为record，再将record解析为了object[]，之后通过RowFactory.create(object[])转化为Row的格式)，解析函数独立定义了一个udf对象来处理：

package com.dx.streaming.producer;

import java.text.SimpleDateFormat;

import org.apache.avro.Schema;

import org.apache.avro.generic.GenericRecord;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.api.java.UDF1;

import com.twitter.bijection.Injection;

import com.twitter.bijection.avro.GenericAvroCodecs;

public class AvroParserUDF implements UDF1<byte[], Row> {

    private static final long serialVersionUID = -2369806025607566774L;

    private String avroSchemaFilePath=null;

    private transient Schema schema = null;

    private transient Injection<GenericRecord, byte[]> recordInjection = null;

    public AvroParserUDF(String avroSchemaFilePath) {

        this.avroSchemaFilePath=avroSchemaFilePath;

    }

    public Row call(byte[] data) throws Exception {

        if(this.recordInjection==null){

            this.schema = SchemaUtil.getAvroSchemaFromHDFSFile(this.avroSchemaFilePath);

            this.recordInjection = GenericAvroCodecs.toBinary(schema);

        }

        GenericRecord record = this.recordInjection.invert(data).get();

        int timeIndex = record.getSchema().getFields().indexOf(record.getSchema().getField("time"));

        int iColumns = record.getSchema().getFields().size();

        Object[] values = new Object[iColumns];

        for (int i = 0; i < iColumns; i++) {

            values[i] = record.get(i);

            if (values[i] instanceof org.apache.avro.util.Utf8) {

                values[i] = values[i].toString();

            }

        }

//        SimpleDateFormat dfs=new SimpleDateFormat("yyyy-MM-dd HH:MM:SS");

//        SimpleDateFormat df=new SimpleDateFormat("yyyy-MM-dd 00:00:00");

//        System.out.println(df.format(dfs.parse("2018-07-03 21:23:58")));

//        output 2018-07-03 00:00:00

        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:MM:SS");

        SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd 00:00:00");

        values[timeIndex] = df.format(sdf.parse((String) values[timeIndex]));

        return RowFactory.create(values);

    }

}

实现思路：使用sparkSession.readStream().format("kafka")方式读取kafka指定的topic，对kafka的byte[]格式数据转化（转化为Row）,对Row进行操作。

package com.dx.streaming.producer;

import java.text.SimpleDateFormat;

import java.util.HashMap;

import java.util.Map;

import org.apache.avro.Schema;

import org.apache.spark.SparkConf;

import org.apache.spark.sql.Dataset;

import org.apache.spark.sql.Encoders;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.SparkSession;

import org.apache.spark.sql.streaming.StreamingQuery;

import org.apache.spark.sql.streaming.StreamingQueryException;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructType;

import com.databricks.spark.avro.SchemaConverters;

public class TestConsumer {

    //private static final String avroFilePath = "D:\\Java_Study\\workspace\\kafka-streaming-learn\\conf\\avro\\userlog.avsc";

    private static final String avroFilePath = "/user/dx/conf/avro/userlog.avsc";

    private static final String topic = "t-my";

    public static void main(String[] args) throws Exception {

        String appName = "Test Avro";

        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName(appName);

        SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();

        Map<String, String> kafkaOptions = new HashMap<String, String>();

        kafkaOptions.put("kafka.bootstrap.servers", "192.168.0.121:9092");

        Schema schema = SchemaUtil.getAvroSchemaFromHDFSFile(avroFilePath);

        AvroParserUDF udf = new AvroParserUDF(avroFilePath);

        StructType type = (StructType) SchemaConverters.toSqlType(schema).dataType();

        sparkSession.udf().register("deserialize", udf, DataTypes.createStructType(type.fields()));

        Dataset<Row> stream = sparkSession.readStream().format("kafka").options(kafkaOptions).option("subscribe", topic).option("startingOffsets", "earliest").load().select("value").as(Encoders.BINARY())

                .selectExpr("deserialize(value) as row").select("row.*");

        stream.printSchema();

        // Print new data to console

        StreamingQuery query = stream.writeStream().format("console").start();

        try {

            query.awaitTermination();

            sparkSession.streams().awaitAnyTermination();

        } catch (StreamingQueryException e) {

            e.printStackTrace();

        }

    }

}

打包，提交用spark-submit：

[spark@master work]$ more submit.sh

#! /bin/bash

jars=""

for file in `ls /home/spark/work/jars/*.jar`

do

        jars=$file,$jars

        #echo $jars

done

echo "------------------------------------"

echo $jars

echo "------------------------------------"

/opt/spark-2.2.1-bin-hadoop2.7/bin/spark-submit \

--jars $jars \

--master yarn \

--verbose \

--driver-java-options "-XX:+TraceClassPaths" \

--num-executors 2 \

--executor-memory 1G \

--executor-cores 1 \

--driver-memory 1G \

--class com.dx.streaming.producer.TestConsumer \

/home/spark/work/kafka-streaming-test.jar

#--properties-file /home/spark/work/conf/spark-properties.conf \

jars：

[spark@master work]$ cd jars

[spark@master jars]$ ls

bijection-avro_2.-0.9..jar  kafka-clients-0.10.0.1.jar  spark-sql_2.-2.2..jar             spark-streaming_2.-2.2..jar

bijection-core_2.-0.9..jar  spark-avro_2.-3.2..jar   spark-sql-kafka--10_2.-2.2..jar  spark-streaming-kafka--10_2.-2.2..jar

打印结果（备注这里是使用spark-submit提交方式）：

+--------------+--------------------+------+-------------------+-----------+-----+--------+-------+---------+-------------+

|            ip|            identity|userid|               time|requestinfo|state|responce|referer|useragent|    timestamp|

+--------------+--------------------+------+-------------------+-----------+-----+--------+-------+---------+-------------+

|36:177:233:179|27be47c9-bcbc-4cd...|    27|2019-11-03 00:00:00|       ....|   88|     ...|    ...|      ...|1530624238000|

|251:92:177:212|d711ca29-e2a7-4fb...|    24|2020-04-03 00:00:00|       ....|  129|     ...|    ...|      ...|1530570507000|

|26:177:105:119|a98020dd-4fcb-4a0...|     4|2018-11-03 00:00:00|       ....|  322|     ...|    ...|      ...|1530619861000|

|161:25:246:252|11bd7af7-b9db-428...|     3|2021-10-03 00:00:00|       ....|  249|     ...|    ...|      ...|1530582412000|

| 48:131:47:112|c519b7cb-0265-4db...|     6|2021-09-03 00:00:00|       ....|  234|     ...|    ...|      ...|1530578717000|

|  43:74:113:73|e5888022-97ad-425...|    99|2019-02-03 00:00:00|       ....|  406|     ...|    ...|      ...|1530584052000|

|230:162:238:87|ae9ecc0d-6df5-418...|    55|2022-09-03 00:00:00|       ....|  128|     ...|    ...|      ...|1530561467000|

|  0:138:183:88|2565b673-baed-4c9...|    85|2019-03-03 00:00:00|       ....|  460|     ...|    ...|      ...|1530548103000|

|210:30:157:209|59a0f81c-7dfc-444...|    31|2021-07-03 00:00:00|       ....|  179|     ...|    ...|      ...|1530632595000|

| 129:251:8:241|5483365c-79ef-429...|    96|2022-03-03 00:00:00|       ....|  368|     ...|    ...|      ...|1530600670000|

|  32:70:106:42|d1dfa208-2a3f-4fe...|    40|2020-01-03 00:00:00|       ....|  184|     ...|    ...|      ...|1530559512000|

|95:109:238:129|709eebbc-13fc-4e9...|    11|2019-02-03 00:00:00|       ....|  463|     ...|    ...|      ...|1530623652000|

|123:171:142:15|0a4cc7d1-bdac-442...|    79|2022-08-03 00:00:00|       ....|  417|     ...|    ...|      ...|1530590205000|

| 72:141:54:221|b94d268a-a464-4d7...|    94|2021-07-03 00:00:00|       ....|    1|     ...|    ...|      ...|1530567806000|

|201:79:234:119|f1ca2db5-1688-459...|    66|2018-07-03 00:00:00|       ....|  531|     ...|    ...|      ...|1530565671000|

|188:41:197:190|fe3d9faf-5376-4bb...|    86|2022-08-03 00:00:00|       ....|  522|     ...|    ...|      ...|1530568567000|

| 197:115:58:51|1c9494e2-5dcc-4a4...|    73|2018-11-03 00:00:00|       ....|  214|     ...|    ...|      ...|1530630682000|

| 213:242:0:177|e06cd131-da6d-499...|    11|2022-05-03 00:00:00|       ....|  530|     ...|    ...|      ...|1530604390000|

| 70:109:32:120|37c95b44-d692-48e...|    66|2018-07-03 00:00:00|       ....|    7|     ...|    ...|      ...|1530576459000|

|100:203:217:78|cff08213-b679-4b2...|    51|2020-04-03 00:00:00|       ....|  128|     ...|    ...|      ...|1530548883000|

+--------------+--------------------+------+-------------------+-----------+-----+--------+-------+---------+-------------+

only showing top 20 rows

18/07/13 05:58:36 INFO streaming.StreamExecution: Streaming query made progress: {

  "id" : "efd34a20-36ae-48a5-89c3-2107bab3cbca",

  "runId" : "a73386c3-34cf-43ec-abe8-904671e269c8",

  "name" : null,

  "timestamp" : "2018-07-12T21:58:31.590Z",

  "numInputRows" : 19800,

  "processedRowsPerSecond" : 3887.6889848812093,

  "durationMs" : {

    "addBatch" : 3595,

    "getBatch" : 252,

    "getOffset" : 612,

    "queryPlanning" : 122,

    "triggerExecution" : 5092,

    "walCommit" : 487

  },

  "stateOperators" : [ ],

  "sources" : [ {

    "description" : "KafkaSource[Subscribe[t-my]]",

    "startOffset" : null,

    "endOffset" : {

      "t-my" : {

        "0" : 19800

      }

    },

    "numInputRows" : 19800,

    "processedRowsPerSecond" : 3887.6889848812093

  } ],

  "sink" : {

    "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@20bb170f"

  }

}

参考：

在Spark结构化流readStream、writeStream 输入输出，及过程ETL

Spark Structured Streaming入门编程指南

Structured Streaming 实现思路与实现概述

Spark结构式流编程指南

Kafka 如何读取offset topic内容 (__consumer_offsets)