spark与kafka整合需要引入spark-streaming-kafka.jar,该jar根据kafka版本有2个分支,分别是spark-streaming-kafka-0-8和spark-streaming-kafka-0-10。

jar包分支选择原则:0.10.0>kafka版本>=0.8.2.1,选择spark-streaming-kafka-0-8;kafka版本>=0.10.0,选择spark-streaming-kafka-0-10。

kafka0.8.2.1及之后版本依次是0.8.2.1(2015年3月11号发布)、0.8.2.2(2015年10月2号发布)、0.9.x、0.10.x(0.10.0.0于2016年5月22号发布)、0.11.x、1.0.x(1.0.0版本于2017年11月1号发布)、1.1.x、2.0.x(2.0.0版本于2018年7月30日发布)。

本次学习使用kafka1.0.0版本,故需要引入spark-streaming-kafka-0-10的jar,如下

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.1</version>
</dependency>

PS:从jar包的groupId可看出,该jar是由spark项目组开发的。

简单用例1:本例在spark2.4.0(scala2.12)、kafka2.2.0(scala2.12)环境测试通过

import org.apache.commons.collections.MapUtils;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import com.alibaba.fastjson.JSON; import java.util.*; public class SparkConsumerTest {
public static void main(String[] args) throws Exception {
System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1");
SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
Properties props = new Properties();
props.setProperty("bootstrap.servers", "192.168.56.100:9092");
props.setProperty("group.id", "my-test-consumer-group");
props.setProperty("enable.auto.commit", "true");
props.setProperty("auto.commit.interval.ms", "1000");
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
Map kafkaParams = new HashMap(8);
kafkaParams.putAll(props);
JavaInputDStream<ConsumerRecord<String, String>> javaInputDStream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(Arrays.asList("test"), kafkaParams));
javaInputDStream.persist(StorageLevel.MEMORY_AND_DISK_SER());
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
javaInputDStream.foreachRDD(rdd -> {
Dataset<Row> df = spark.createDataFrame(rdd.map(consumerRecord -> {
Map testMap = JSON.parseObject(consumerRecord.value(), Map.class);
return new DemoBean(MapUtils.getString(testMap, "id"),
MapUtils.getString(testMap, "name"),
MapUtils.getIntValue(testMap, "age"));
}), DemoBean.class);
DataFrameWriter writer = df.write();
String url = "jdbc:postgresql://192.168.56.100/postgres";
String table = "test";
Properties connectionProperties = new Properties();
connectionProperties.put("user", "postgres");
connectionProperties.put("password", "abc123");
connectionProperties.put("driver", "org.postgresql.Driver");
connectionProperties.put("batchsize", "3000");
writer.mode(SaveMode.Append).jdbc(url, table, connectionProperties);
});
jssc.start();
jssc.awaitTermination();
}
}

DemoBean是另外的一个实体类。

相应pom.xml:

    <dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.12</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-streams -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.2.0</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.9</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.0</version>
<!-- <scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency> <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.7.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.58</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.thoughtworks.paranamer/paranamer -->
<dependency>
<groupId>com.thoughtworks.paranamer</groupId>
<artifactId>paranamer</artifactId>
<version>2.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.2.5</version>
</dependency> </dependencies>

简单用例2:本例在spark1.6.0(scala2.11)、kafka0.10.2.0(scala2.11)环境测试通过

import com.alibaba.fastjson.JSON;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import kafka.utils.ZKGroupTopicDirs;
import org.apache.commons.lang3.StringUtils;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkException;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaCluster;
import org.apache.spark.streaming.kafka.KafkaUtils; import java.nio.charset.StandardCharsets;
import java.util.*; import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.kafka.OffsetRange;
import scala.collection.JavaConversions;
import scala.collection.Map$;
import scala.collection.immutable.Set;
import scala.collection.mutable.ArrayBuffer;
import scala.util.Either; public class SparkConsumerTest { public static CuratorFramework curatorFramework; static {
curatorFramework = CuratorFrameworkFactory.builder().connectString("192.168.56.103:2181")
.connectionTimeoutMs(30000)
.sessionTimeoutMs(30000)
.retryPolicy(new RetryUntilElapsed(1000, 1000))
.build();
curatorFramework.start();
} public static void main(String[] args) throws Exception {
String topic = "test";
String groupId = "spark-test-consumer-group"; System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1");
SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
// 每5s一个批次
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30));
Map<String, String> kafkaParams = new HashMap(4);
kafkaParams.put("bootstrap.servers", "192.168.56.103:9092");
// 生成fromOffsets,KafkaUtils.createDirectStream要使用
Map<TopicAndPartition, Long> fromOffsets = getFromOffsets(kafkaParams, topic, groupId);
SQLContext sqlContext = new SQLContext(jssc.sparkContext());
// Function不是jdk的类,是spark中的类
Function<MessageAndMetadata<String, String>, String> function = MessageAndMetadata::message;
kafkaParams.put("group.id", groupId);
JavaInputDStream<String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
fromOffsets,
function
);
messages.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
String firstValue = rdd.first();
System.out.println("message:" + firstValue);
JavaRDD<Person> personJavaRDD = rdd.mapPartitions(it -> {
List<String> list = new ArrayList();
while (it.hasNext()) {
list.add(it.next());
}
return list;
}).map(p -> {
try {
Person person = JSON.parseObject(StringUtils.deleteWhitespace(p), Person.class);
return person;
} catch (Exception e) {
e.printStackTrace();
}
return new Person();
}).filter(p -> StringUtils.isNotBlank(p.getId())
|| StringUtils.isNotBlank(p.getName())
|| p.getAge() != 0
);
if (!personJavaRDD.isEmpty()) {
DataFrame df = sqlContext.createDataFrame(personJavaRDD, Person.class).select("id", "name", "age");
df.show(5);
}
// 设置zookeeper 消费偏移量
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
Arrays.asList(offsetRanges).forEach(offsetRange -> {
String consumerOffsetDir = new ZKGroupTopicDirs(groupId, topic).consumerOffsetDir()
+ "/" + offsetRange.partition();
try {
curatorFramework.setData().forPath(consumerOffsetDir, String.valueOf(offsetRange.untilOffset()).getBytes(StandardCharsets.UTF_8));
} catch (Exception e) {
e.printStackTrace();
}
});
}
});
jssc.start();
jssc.awaitTermination();
} public static scala.collection.immutable.Map jMap2sMap(Map<String, String> map) {
scala.collection.mutable.Map mapTest = JavaConversions.mapAsScalaMap(map);
Object objTest = Map$.MODULE$.newBuilder().$plus$plus$eq(mapTest.toSeq());
Object resultTest = ((scala.collection.mutable.Builder) objTest).result();
scala.collection.immutable.Map resultTest2 = (scala.collection.immutable.Map) resultTest;
return resultTest2;
} public static Map<TopicAndPartition, Long> getFromOffsets(Map kafkaParams, String topic, String groupId) throws Exception {
// kafkaParams只有bootstrap.servers -> broker列表
KafkaCluster kc = new KafkaCluster(jMap2sMap(kafkaParams));
ArrayBuffer<String> arrayBuffer = new ArrayBuffer();
arrayBuffer.$plus$eq(topic);
Either<ArrayBuffer<Throwable>, Set<TopicAndPartition>> either = kc.getPartitions(arrayBuffer.toSet());
if (either.isLeft()) {
throw new SparkException("get partitions failed", either.left().toOption().get().last());
}
scala.collection.immutable.Set<TopicAndPartition> topicAndPartitions = either.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either2 = kc.getEarliestLeaderOffsets(topicAndPartitions);
if (either2.isLeft()) {
throw new SparkException("get earliestLeaderOffsets failed", either2.left().toOption().get().last());
}
scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> earliestLeaderOffsets = either2.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either3 = kc.getLatestLeaderOffsets(topicAndPartitions);
if (either3.isLeft()) {
throw new SparkException("get latestLeaderOffsets failed", either3.left().toOption().get().last());
} scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> latestLeaderOffsets = either3.right().get(); Map<TopicAndPartition, Long> fromOffsets = new HashMap();
ZKGroupTopicDirs zKGroupTopicDirs = new ZKGroupTopicDirs(groupId, topic);
// 从0分区开始
for (int i = 0; i < topicAndPartitions.size(); i++) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, i);
// 路径是/consumers/$group/offsets/$topic
String consumerOffsetDir = zKGroupTopicDirs.consumerOffsetDir() + "/" + i;
long zookeeperConsumerOffset = 0;
// 没有消费组偏移量目录,说明没有开始消费
if (curatorFramework.checkExists().forPath(consumerOffsetDir) == null) {
System.out.println(consumerOffsetDir + "目录不存在");
// 如果目录不存在的话,就创建目录,并设值为0
curatorFramework.create().creatingParentsIfNeeded().forPath(consumerOffsetDir, "0".getBytes(StandardCharsets.UTF_8));
} else {
// 拿到zookeeper节点存储的值
byte[] zookeeperConsumerOffsetBytes = curatorFramework.getData().forPath(consumerOffsetDir);
if (zookeeperConsumerOffsetBytes != null) {
zookeeperConsumerOffset = Long.parseLong(new String(zookeeperConsumerOffsetBytes, StandardCharsets.UTF_8));
}
}
long earliestLeaderOffset = earliestLeaderOffsets.get(topicAndPartition).get().offset();
long latestLeaderOffset = latestLeaderOffsets.get(topicAndPartition).get().offset();
long fromOffset;
if (zookeeperConsumerOffset < earliestLeaderOffset) {
fromOffset = earliestLeaderOffset;
} else if (zookeeperConsumerOffset > latestLeaderOffset) {
fromOffset = latestLeaderOffset;
} else {
fromOffset = zookeeperConsumerOffset;
}
fromOffsets.put(topicAndPartition, fromOffset);
}
return fromOffsets;
}
}

pom.xml:

    <dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.58</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.0</version>
<scope>provided</scope>
</dependency>
</dependencies>

这里用了spark-streaming-kafka_2.11-1.6.0.jar,而没有用spark-streaming-kafka-assembly_2.11-1.6.0.jar。这两个jar包是完全一样的,但是后面的assembly包死活找不到源码。注意,这里kafka服务器虽然是0.10.2.0版本,但是没有引用kafka_2.11-0.10.2.0.jar,因为实测会报java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker。根本原因是在kafka_2.11-0.8.2.1.jar中,kafka.api.PartitionMetadata类定义是case class PartitionMetadata(partitionId: Int, val leader: Option[Broker], replicas: Seq[Broker], isr: Seq[Broker] = Seq.empty, errorCode: Short = ErrorMapping.NoError),但是从kafka_2.11-0.10.0.0.jar开始,变成了case class PartitionMetadata(partitionId: Int, leader: Option[BrokerEndPoint], replicas: Seq[BrokerEndPoint], isr: Seq[BrokerEndPoint] = Seq.empty, errorCode: Short = Errors.NONE.code),成员变量的类型发生了变化。但是spark-streaming-kafka_2.11-1.6.0.jar包是和kafka_2.11-0.8.2.1.jar兼容的,所以在kafka_2.11-0.10.2.0.jar时会发生类型转换错误。

需要特别提醒的是,spark-streaming-kafka_2.11-1.6.0.jar包的KafkaCluster的内部类都是private的,引用KafkaCluster.LeaderOffset时一直报错。大量搜索后,发现从spark-streaming-kafka_2.11-2.0.jar版本开始,内部类才不是private的。解决办法是在项目中创建一个名为org.apache.spark.streaming.kafka的package,把spark-streaming-kafka_2.11-1.6.0.jar中的KafkaCluster类拷贝到这个包下,同时修改源码,把LeaderOffset的private[spark]标识符去掉。这样,引用的KafkaCluster类就是我们自己的了,不是spark-streaming-kafka_2.11-1.6.0.jar包中的了,KafkaCluster.LeaderOffset就可以用了。

以上场景都是kafka的一条消息对应数据库中的一条记录。如果一条kafka消息对应数据库中的多条记录呢?

简单用例3:和例2一样环境,pom也一样

import com.alibaba.fastjson.JSON;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import kafka.utils.ZKGroupTopicDirs;
import org.apache.commons.lang3.StringUtils;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkException;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaCluster;
import org.apache.spark.streaming.kafka.KafkaUtils; import java.nio.charset.StandardCharsets;
import java.util.*; import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.kafka.OffsetRange;
import scala.collection.JavaConversions;
import scala.collection.Map$;
import scala.collection.immutable.Set;
import scala.collection.mutable.ArrayBuffer;
import scala.util.Either; public class SparkConsumerTest { public static CuratorFramework curatorFramework; static {
curatorFramework = CuratorFrameworkFactory.builder().connectString("192.168.56.103:2181")
.connectionTimeoutMs(30000)
.sessionTimeoutMs(30000)
.retryPolicy(new RetryUntilElapsed(1000, 1000))
.build();
curatorFramework.start();
} public static void main(String[] args) throws Exception {
String topic = "test";
String groupId = "spark-test-consumer-group"; System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1");
SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.set("spark.streaming.kafka.maxRatePerPartition", "10000");
// 每5s一个批次
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, String> kafkaParams = new HashMap(4);
kafkaParams.put("bootstrap.servers", "192.168.56.103:9092");
// 生成fromOffsets,KafkaUtils.createDirectStream要使用
Map<TopicAndPartition, Long> fromOffsets = getFromOffsets(kafkaParams, topic, groupId); JavaSparkContext jsc = jssc.sparkContext();
// Function不是jdk的类,是spark中的类
Function<MessageAndMetadata<String, String>, String> messageHandler = MessageAndMetadata::message;
kafkaParams.put("group.id", groupId);
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("msg", DataTypes.StringType, false, Metadata.empty()),
new StructField("createdDate", DataTypes.StringType, false, Metadata.empty()),
new StructField("updatedDate", DataTypes.StringType, false, Metadata.empty())
});
SQLContext sqlContext = new SQLContext(jssc.sparkContext()); JavaInputDStream<String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
fromOffsets,
messageHandler
);
messages.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
JavaRDD<Object> javaRDD = rdd.mapPartitions(it -> {
List<Object> list = new ArrayList<>();
while (it.hasNext()) {
String str = it.next();
if (StringUtils.isNotBlank(str)) {
str = StringUtils.deleteWhitespace(str);
try {
List<Person> personList = JSON.parseArray(str, Person.class);
list.addAll(personList);
} catch (Exception e) {
list.add(str);
}
}
}
return list;
});
JavaRDD<Person> personRDD = javaRDD.filter(p -> p instanceof Person).map(p -> (Person) p);
DataFrame df;
DataFrameWriter writer;
if (!personRDD.isEmpty()) {
df = sqlContext.createDataFrame(personRDD, Person.class)
.select("id", "name", "age");
df.show(2);
}
JavaRDD<Row> rowRDD = javaRDD.filter(p -> p instanceof String)
.map(p -> RowFactory.create(UUID.randomUUID().toString(), p, String.valueOf(System.currentTimeMillis()), String.valueOf(System.currentTimeMillis())));
if (!rowRDD.isEmpty()) {
df = sqlContext.createDataFrame(rowRDD, schema);
df.show(2);
} // 设置zookeeper 消费偏移量
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
Arrays.asList(offsetRanges).forEach(offsetRange -> {
String consumerOffsetDir = new ZKGroupTopicDirs(groupId, topic).consumerOffsetDir()
+ "/" + offsetRange.partition();
try {
curatorFramework.setData().forPath(consumerOffsetDir, String.valueOf(offsetRange.untilOffset()).getBytes(StandardCharsets.UTF_8));
} catch (Exception e) {
e.printStackTrace();
}
});
}
});
jssc.start();
jssc.awaitTermination();
} public static scala.collection.immutable.Map jMap2sMap(Map<String, String> map) {
scala.collection.mutable.Map mapTest = JavaConversions.mapAsScalaMap(map);
Object objTest = Map$.MODULE$.newBuilder().$plus$plus$eq(mapTest.toSeq());
Object resultTest = ((scala.collection.mutable.Builder) objTest).result();
scala.collection.immutable.Map resultTest2 = (scala.collection.immutable.Map) resultTest;
return resultTest2;
} public static Map<TopicAndPartition, Long> getFromOffsets(Map kafkaParams, String topic, String groupId) throws
Exception {
// kafkaParams只有bootstrap.servers -> broker列表
KafkaCluster kc = new KafkaCluster(jMap2sMap(kafkaParams));
ArrayBuffer<String> arrayBuffer = new ArrayBuffer();
arrayBuffer.$plus$eq(topic);
Either<ArrayBuffer<Throwable>, Set<TopicAndPartition>> either = kc.getPartitions(arrayBuffer.toSet());
if (either.isLeft()) {
throw new SparkException("get partitions failed", either.left().toOption().get().last());
}
scala.collection.immutable.Set<TopicAndPartition> topicAndPartitions = either.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either2 = kc.getEarliestLeaderOffsets(topicAndPartitions);
if (either2.isLeft()) {
throw new SparkException("get earliestLeaderOffsets failed", either2.left().toOption().get().last());
}
scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> earliestLeaderOffsets = either2.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either3 = kc.getLatestLeaderOffsets(topicAndPartitions);
if (either3.isLeft()) {
throw new SparkException("get latestLeaderOffsets failed", either3.left().toOption().get().last());
} scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> latestLeaderOffsets = either3.right().get(); Map<TopicAndPartition, Long> fromOffsets = new HashMap();
ZKGroupTopicDirs zKGroupTopicDirs = new ZKGroupTopicDirs(groupId, topic);
// 从0分区开始
for (int i = 0; i < topicAndPartitions.size(); i++) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, i);
// 路径是/consumers/$group/offsets/$topic
String consumerOffsetDir = zKGroupTopicDirs.consumerOffsetDir() + "/" + i;
long zookeeperConsumerOffset = 0;
// 没有消费组偏移量目录,说明没有开始消费
if (curatorFramework.checkExists().forPath(consumerOffsetDir) == null) {
System.out.println(consumerOffsetDir + "目录不存在");
// 如果目录不存在的话,就创建目录,并设值为0
curatorFramework.create().creatingParentsIfNeeded().forPath(consumerOffsetDir, "0".getBytes(StandardCharsets.UTF_8));
} else {
// 拿到zookeeper节点存储的值
byte[] zookeeperConsumerOffsetBytes = curatorFramework.getData().forPath(consumerOffsetDir);
if (zookeeperConsumerOffsetBytes != null) {
String zookeeperConsumerOffsetStr = new String(zookeeperConsumerOffsetBytes, StandardCharsets.UTF_8);
zookeeperConsumerOffset = Long.parseLong(zookeeperConsumerOffsetStr);
}
}
long earliestLeaderOffset = earliestLeaderOffsets.get(topicAndPartition).get().offset();
long latestLeaderOffset = latestLeaderOffsets.get(topicAndPartition).get().offset();
long fromOffset;
if (zookeeperConsumerOffset < earliestLeaderOffset) {
fromOffset = earliestLeaderOffset;
} else if (zookeeperConsumerOffset > latestLeaderOffset) {
fromOffset = latestLeaderOffset;
} else {
fromOffset = zookeeperConsumerOffset;
}
fromOffsets.put(topicAndPartition, fromOffset);
}
System.out.println("fromOffsets= " + fromOffsets);
return fromOffsets;
}
}

Receiver DStream和Direct DStream分别是什么?spark从kafka拉取数据有哪几种方式?在oppo面试的时候有问过。

spark第十篇:Spark与Kafka整合的更多相关文章

  1. spark调优篇-spark on yarn web UI

    spark on yarn 的执行过程在 yarn RM 上无法直接查看,即 http://192.168.10.10:8088,这对于调试程序很不方便,所以需要手动配置 配置方法 1. 配置 spa ...

  2. spark第八篇:与Phoenix整合

    spark sql可以与hbase交互,比如说通过jdbc,但是实际使用时,一般是利用phoenix操作hbase.此时,需要在项目中引入phoenix-core-4.10.0-HBase-1.2.j ...

  3. Spark(十)Spark之数据倾斜调优

    一 调优概述 有的时候,我们可能会遇到大数据计算中一个最棘手的问题——数据倾斜,此时Spark作业的性能会比期望差很多.数据倾斜调优,就是使用各种技术方案解决不同类型的数据倾斜问题,以保证Spark作 ...

  4. Spark(十) -- Spark Streaming API编程

    本文测试的Spark版本是1.3.1 Spark Streaming编程模型: 第一步: 需要一个StreamingContext对象,该对象是Spark Streaming操作的入口 ,而构建一个S ...

  5. spark调优篇-Spark ON Yarn 内存管理(汇总)

    本文旨在解析 spark on Yarn 的内存管理,使得 spark 调优思路更加清晰 内存相关参数 spark 是基于内存的计算,spark 调优大部分是针对内存的,了解 spark 内存参数有也 ...

  6. 大数据之路【第十篇】:kafka消息系统

    一.简介 1.简介 简 介• Kafka是Linkedin于2010年12月份开源的消息系统• 一种分布式的.基于发布/订阅的消息系统 2.特点 – 消息持久化:通过O(1)的磁盘数据结构提供数据的持 ...

  7. 【转】Spark Streaming和Kafka整合开发指南

    基于Receivers的方法 这个方法使用了Receivers来接收数据.Receivers的实现使用到Kafka高层次的消费者API.对于所有的Receivers,接收到的数据将会保存在Spark ...

  8. Spark Streaming和Kafka整合保证数据零丢失

    当我们正确地部署好Spark Streaming,我们就可以使用Spark Streaming提供的零数据丢失机制.为了体验这个关键的特性,你需要满足以下几个先决条件: 1.输入的数据来自可靠的数据源 ...

  9. Spark Streaming和Kafka整合开发指南(二)

    在本博客的<Spark Streaming和Kafka整合开发指南(一)>文章中介绍了如何使用基于Receiver的方法使用Spark Streaming从Kafka中接收数据.本文将介绍 ...

随机推荐

  1. 【转载】java实现rabbitmq消息的发送接受

    原文地址:http://blog.csdn.net/sdyy321/article/details/9241445 本文不介绍amqp和rabbitmq相关知识,请自行网上查阅 本文是基于spring ...

  2. wc项目记录

    1.Github项目地址:https://github.com/3116004700/ruanjiangongcheng 2.预估时间见PSP表格. 3.解题思路描述: 在看到这个项目的时候我就想到了 ...

  3. [Erlang07] Erlang 做图形化编程的尝试:纯Erlang做2048游戏

    用Erlang久了,以为erlang做类似于As3,JS的图形化界面是绝对不可能的,多少次,多少次想用erlang做个炫酷的图形游戏.终于:折腾出来了结果:纯Erlang也可以做到! 因为以前接触过W ...

  4. nginx uwsgi flask相关配置

    一.安装Nginx 在 /home/download下下载压缩包 wget https://nginx.org/download/nginx-1.12.2.tar.gz 解压缩 tar zxvf ng ...

  5. python3.5在内存中获取远程图片尺寸

    自已看看,用于获取选程图片的尺寸 >>> from io import BytesIO>>> import requests as rs>>> f ...

  6. Dev 之 GridControl 列表 显示底部(包括底部统计)

    1.列表 Gridview 显示底部 2 底部增加统计

  7. elk日志分析平台安装

    ELK安装 前言 什么是ELK? 通俗来讲,ELK是由Elasticsearch.Logstash.Kibana 三个开源软件的组成的一个组合体,这三个软件当中,每个软件用于完成不同的功能,ELK 又 ...

  8. PLSQL Developer 13.0.0.1883 注册码

    PLSQL Developer 13.0.0.1883 注册码 product code: 4vkjwhfeh3ufnqnmpr9brvcuyujrx3n3le serial Number:22695 ...

  9. 支付机构MRC模

    一.电商RFM模型 RFM模型是一个简单的根据客户的活跃程度和交易金额贡献所做的分类.因为操作简单,所以较为常用. 近度R:R代表客户最近的活跃时间距离数据采集点的时间距离,R越大,表示客户越久未发生 ...

  10. [Maven实战-许晓斌]-[第二章]-2.4设置HTTP代理