  在这里我使用的是landoop公司开发的kafka-connect-hive插件,项目文档地址Hive Sink,接下来看看如何使用该插件的sink部分。
  Apache Kafka 2.11-2.1.0
  Apache Hadoop 2.6.3
  Apache Hive 1.2.1
  Java 1.8
  cd kafka_2.11-2.1.0
  bin/kafka-server-start.sh config/server.properties &
  cd confluent-5.1.0
  bin/schema-registry-start etc/schema-registry/schema-registry.properties &
  schema-registry组件提供了kafka topic的schema管理功能,保存了schema的各个演变版本,帮助我们解决新旧数据schema兼容问题。这里我们使用apache avro库来序列化kafka的key和value,因此需要依赖schema-registry组件,schema-registry使用默认的配置。
  # Sample configuration for a distributed Kafka Connect worker that uses Avro serialization and
  # integrates the the Schema Registry. This sample configuration assumes a local installation of
  # Confluent Platform with all services running on their default ports.
  # Bootstrap Kafka servers. If multiple servers are specified, they should be comma-separated.
  # The group ID is a unique identifier for the set of workers that form a single Kafka Connect
  # cluster
  # The converters specify the format of data in Kafka and how to translate it into Connect data.
  # Every Connect user will need to configure these based on the format they want their data in
  # when loaded from or stored into Kafka
  # Internal Storage Topics.
  # Kafka Connect distributed workers store the connector and task configurations, connector offsets,
  # and connector statuses in three internal topics. These topics MUST be compacted.
  # When the Kafka Connect distributed worker starts, it will check for these topics and attempt to create them
  # as compacted topics if they don't yet exist, using the topic name, replication factor, and number of partitions
  # as specified in these properties, and other topic-specific settings inherited from your brokers'
  # auto-creation settings. If you need more control over these other topic-specific settings, you may want to
  # manually create these topics before starting Kafka Connect distributed workers.
  # The following properties set the names of these three internal topics for storing configs, offsets, and status.
  # The following properties set the replication factor for the three internal topics, defaulting to 3 for each
  # and therefore requiring a minimum of 3 brokers in the cluster. Since we want the examples to run with
  # only a single broker, we set the replication factor here to just 1. That's okay for the examples, but
  # ALWAYS use a replication factor of AT LEAST 3 for production environments to reduce the risk of
  # losing connector offsets, configurations, and status.
  # The config storage topic must have a single partition, and this cannot be changed via properties.
  # Offsets for all connectors and tasks are written quite frequently and therefore the offset topic
  # should be highly partitioned; by default it is created with 25 partitions, but adjust accordingly
  # with the number of connector tasks deployed to a distributed worker cluster. Kafka Connect records
  # the status less frequently, and so by default the topic is created with 5 partitions.
  # The offsets, status, and configurations are written to the topics using converters specified through
  # the following required properties. Most users will always want to use the JSON converter without schemas.
  # Offset and config data is never visible outside of Connect in this format.
  # Confluent Control Center Integration -- uncomment these lines to enable Kafka client interceptors
  # that will report audit data that can be displayed and analyzed in Confluent Control Center
  # producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
  # consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
  # These are provided to inform the user about the presence of the REST host and port configs
  # Hostname & Port for the REST API to listen on. If this is set, it will bind to the interface used to listen to requests.
  # The Hostname & Port that www.dasheng178.com will be given out to other workers to connect to i.e. URLs that are routable from other servers.
  # Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
  # (connectors, converters, transformations). The list should consist of top level directories that include
  # any combination of:
  # a) directories immediately containing jars with plugins and their dependencies
  # b) uber-jars with plugins and their dependencies
  # c) directories immediately containing the package directory structure of classes of plugins and their dependencies
  # Examples:
  # plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
  # Replace the relative path below with an absolute path if you are planning to start Kafka Connect from within a
  # directory other than the home directory of Confluent Platform.
  cd confluent-5.1.0
  bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties
  # 创建hive_connect数据库
  create database hive_connect;
  # 创建cities_orc表
  use hive_connect;
  create table cities_orc (city string, state string, population int, country string) stored as orc;
  2、使用postman添加kafka-connect-hive sink的配置到kafka-connect:
  "name": "hive-sink-example",
  "config": {
  "name": "hive-sink-example",
  "connector.class": "com.landoop.streamreactor.connect.hive.sink.hiveSinkConnector",
  "tasks.max": 1,
  "topics": "hive_sink_orc",
  "connect.hive.kcql": "insert into cities_orc select * from hive_sink_orc AUTOCREATE PARTITIONBY state STOREAS ORC WITH_FLUSH_INTERVAL = 10 WITH_PARTITIONING = DYNAMIC",
  "connect.hive.database.name": "hive_connect",
  "connect.hive.hive.metastore": "thrift",
  "connect.hive.hive.metastore.uris": "thrift://quickstart.cloudera:9083",
  "connect.hive.fs.defaultFS": "hdfs://www.michenggw.com quickstart.cloudera:9001",
  "connect.hive.error.policy": "NOOP",
  "connect.progress.enabled": true
  启动kafka producer,写入测试数据,scala测试代码如下:
  class AvroTest {
  * 测试kafka使用avro方式生产数据
  * 参考 https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
  def testProducer: Unit = {
  // 设置kafka broker地址、序列化方式、schema registry组件的地址
  val props = new Properties()
  props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
  props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[io.confluent.kafka.serializers.KafkaAvroSerializer])
  props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[io.confluent.kafka.serializers.KafkaAvroSerializer])
  props.put("schema.registry.url", "http://localhost:8081")
  // 设置schema
  val schema = "{\"type\":\"record\",\"name\":\"myrecord\",\"fields\":[{\"name\":\"city\",\"type\":\"string\"},{\"name\":\"state\",\"type\":\"string\"},{\"name\":\"population\",\"type\":\"int\"},{\"name\":\"country\",\"type\":\"string\"}]}"
  val parser = new Schema.Parser()
  val schema = parser.parse(schema)
  // 构造测试数据
  val avroRecord1 = new GenericData.Record(schema)
  avroRecord1.put("city", "Philadelphia")
  avroRecord1.put("state", "PA")
  avroRecord1.put("population", 1568000)
  avroRecord1.put("country", "USA")
  val avroRecord2 = new GenericData.Record(schema)
  avroRecord2.put("city", "Chicago")
  avroRecord2.put("state", "IL")
  avroRecord2.put("population", 2705000)
  avroRecord2.put("country"www.lezongyule.com, "USA")
  val avroRecord3 = new GenericData.Record(schema)
  avroRecord3.put("city", "New York")
  avroRecord3.put("state", "NY")
  avroRecord3.put("population", 8538000)
  avroRecord3.put("country", "USA")
  // 生产数据
  val producer = new KafkaProducer[String, GenericData.Record](props)
  try {
  val recordList = List(avroRecord1, avroRecord2, avroRecord3)
  val key = "key1"
  for (elem <- recordList) {
  val record = new ProducerRecord("hive_sink_orc", key, elem)
  for (i <- 0 to 100) {
  val ack = producer.send(record).get()
  println(s"${ack.toString} written to partition ${ack.partition.toString}")
  } catch {
  case e: Throwable => e.printStackTrace()
  } finally {
  // When you're finished producing records, you can flush the producer to ensure it has all been written to Kafka and
  // then close the producer to free its resources.
  // 调用flush方法确保所有数据都被写入到Kafka
  // 调用close方法释放资源
  use hive_connect;
  select * from cities_orc;
  | cities_orc.city | cities_orc.population | cities_orc.country | cities_orc.state |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Chicago | 2705000 | USA | IL |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  | Philadelphia | 1568000 | USA | PA |
  WITH_SCHEMA_EVOLUTION:string类型,默认值是MATCH,表示hive schema和kafka topic record的schema的兼容策略,hive connector会使用该策略来添加或移除字段
  Kafka connect配置
  Kafka connect的配置项说明如下:
  tasks.max :int类型,默认值为1,表示connector的任务数量
  connector.class :string类型,表示connector类的名称,值必须是com.landoop.streamreactor.connect.hive.sink.HiveSinkConnector
  connect.hive.hive.metastore:string类型,表示连接hive metastore所使用的网络协议
  connect.hive.hive.metastore.uris:string类型,表示hive metastore的连接地址

