kafka-connect-hive是基于kafka-connect平台实现的hive数据读取和写入插件,主要由source、sink两部分组成,source部分完成hive表数据的读取任务,kafka-connect将这些数据写入到其他数据存储层中,比如hive到ES数据的流入。sink部分完成向hive表写数据的任务,kafka-connect将第三方数据源(如MySQL)里的数据读取并写入到hive表中。
  
  在这里我使用的是landoop公司开发的kafka-connect-hive插件,项目文档地址Hive Sink,接下来看看如何使用该插件的sink部分。
  
  环境准备
  
  Apache Kafka 2.11-2.1.0
  
  Confluent-5.1.0
  
  Apache Hadoop 2.6.3
  
  Apache Hive 1.2.1
  
  Java 1.8
  
  功能
  
  支持KCQL路由查询,允许将kafka主题中的所有字段或部分字段写入hive表中
  
  支持根据某一字段动态分区
  
  支持全量和增量同步数据,不支持部分更新
  
  开始使用
  
  启动依赖
  
  1、启动kafka:
  
  cd kafka_2.11-2.1.0
  
  bin/kafka-server-start.sh config/server.properties &
  
  2、启动schema-registry:
  
  cd confluent-5.1.0
  
  bin/schema-registry-start etc/schema-registry/schema-registry.properties &
  
  schema-registry组件提供了kafka topic的schema管理功能,保存了schema的各个演变版本,帮助我们解决新旧数据schema兼容问题。这里我们使用apache avro库来序列化kafka的key和value,因此需要依赖schema-registry组件,schema-registry使用默认的配置。
  
  3、启动kafka-connect:
  
  修改confluent-5.1.0/etc/schema-registry目录下connect-avro-distributed.properties文件的配置,修改后内容如下:
  
  # Sample configuration for a distributed Kafka Connect worker that uses Avro serialization and
  
  # integrates the the Schema Registry. This sample configuration assumes a local installation of
  
  # Confluent Platform with all services running on their default ports.
  
  # Bootstrap Kafka servers. If multiple servers are specified, they should be comma-separated.
  
  bootstrap.servers=localhost:9092
  
  # The group ID is a unique identifier for the set of workers that form a single Kafka Connect
  
  # cluster
  
  group.id=connect-cluster
  
  # The converters specify the format of data in Kafka and how to translate it into Connect data.
  
  # Every Connect user will need to configure these based on the format they want their data in
  
  # when loaded from or stored into Kafka
  
  key.converter=io.confluent.connect.avro.AvroConverter
  
  key.converter.schema.registry.url=http://localhost:8081
  
  value.converter=io.confluent.connect.avro.AvroConverter
  
  value.converter.schema.registry.url=http://localhost:8081
  
  # Internal Storage Topics.
  
  #
  
  # Kafka Connect distributed workers store the connector and task configurations, connector offsets,
  
  # and connector statuses in three internal topics. These topics MUST be compacted.
  
  # When the Kafka Connect distributed worker starts, it will check for these topics and attempt to create them
  
  # as compacted topics if they don't yet exist, using the topic name, replication factor, and number of partitions
  
  # as specified in these properties, and other topic-specific settings inherited from your brokers'
  
  # auto-creation settings. If you need more control over these other topic-specific settings, you may want to
  
  # manually create these topics before starting Kafka Connect distributed workers.
  
  #
  
  # The following properties set the names of these three internal topics for storing configs, offsets, and status.
  
  config.storage.topic=connect-configs
  
  offset.storage.topic=connect-offsets
  
  status.storage.topic=connect-statuses
  
  # The following properties set the replication factor for the three internal topics, defaulting to 3 for each
  
  # and therefore requiring a minimum of 3 brokers in the cluster. Since we want the examples to run with
  
  # only a single broker, we set the replication factor here to just 1. That's okay for the examples, but
  
  # ALWAYS use a replication factor of AT LEAST 3 for production environments to reduce the risk of
  
  # losing connector offsets, configurations, and status.
  
  config.storage.replication.factor=1
  
  offset.storage.replication.factor=1
  
  status.storage.replication.factor=1
  
  # The config storage topic must have a single partition, and this cannot be changed via properties.
  
  # Offsets for all connectors and tasks are written quite frequently and therefore the offset topic
  
  # should be highly partitioned; by default it is created with 25 partitions, but adjust accordingly
  
  # with the number of connector tasks deployed to a distributed worker cluster. Kafka Connect records
  
  # the status less frequently, and so by default the topic is created with 5 partitions.
  
  #offset.storage.partitions=25
  
  #status.storage.partitions=5
  
  # The offsets, status, and configurations are written to the topics using converters specified through
  
  # the following required properties. Most users will always want to use the JSON converter without schemas.
  
  # Offset and config data is never visible outside of Connect in this format.
  
  internal.key.converter=org.apache.kafka.connect.json.JsonConverter
  
  internal.value.converter=org.apache.kafka.connect.json.JsonConverter
  
  internal.key.converter.schemas.enable=false
  
  internal.value.converter.schemas.enable=false
  
  # Confluent Control Center Integration -- uncomment these lines to enable Kafka client interceptors
  
  # that will report audit data that can be displayed and analyzed in Confluent Control Center
  
  # producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
  
  # consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
  
  # These are provided to inform the user about the presence of the REST host and port configs
  
  # Hostname & Port for the REST API to listen on. If this is set, it will bind to the interface used to listen to requests.
  
  #rest.host.name=0.0.0.0
  
  #rest.port=8083
  
  # The Hostname & Port that www.dasheng178.com will be given out to other workers to connect to i.e. URLs that are routable from other servers.
  
  #rest.advertised.host.name=0.0.0.0
  
  #rest.advertised.port=8083
  
  # Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
  
  # (connectors, converters, transformations). The list should consist of top level directories that include
  
  # any combination of:
  
  # a) directories immediately containing jars with plugins and their dependencies
  
  # b) uber-jars with plugins and their dependencies
  
  # c) directories immediately containing the package directory structure of classes of plugins and their dependencies
  
  # Examples:
  
  # plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
  
  # Replace the relative path below with an absolute path if you are planning to start Kafka Connect from within a
  
  # directory other than the home directory of Confluent Platform.
  
  plugin.path=/kafka/confluent-5.1.0/plugins/lib
  
  这里需要设置plugin.path参数,该参数指定了kafka-connect插件包的保存地址,必须得设置。
  
  下载kafka-connect-hive-1.2.1-2.1.0-all.tar.gz,解压后将kafka-connect-hive-1.2.1-2.1.0-all.jar放到plugin.path指定的目录下,然后执行如下命令启动kafka-connect:
  
  cd confluent-5.1.0
  
  bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties
  
  准备测试数据
  
  1、在hive服务器上使用beeline执行如下命令:
  
  # 创建hive_connect数据库
  
  create database hive_connect;
  
  # 创建cities_orc表
  
  use hive_connect;
  
  create table cities_orc (city string, state string, population int, country string) stored as orc;
  
  2、使用postman添加kafka-connect-hive sink的配置到kafka-connect:
  
  URL:localhost:8083/connectors/
  
  请求类型:POST
  
  请求体如下:
  
  {
  
  "name": "hive-sink-example",
  
  "config": {
  
  "name": "hive-sink-example",
  
  "connector.class": "com.landoop.streamreactor.connect.hive.sink.hiveSinkConnector",
  
  "tasks.max": 1,
  
  "topics": "hive_sink_orc",
  
  "connect.hive.kcql": "insert into cities_orc select * from hive_sink_orc AUTOCREATE PARTITIONBY state STOREAS ORC WITH_FLUSH_INTERVAL = 10 WITH_PARTITIONING = DYNAMIC",
  
  "connect.hive.database.name": "hive_connect",
  
  "connect.hive.hive.metastore": "thrift",
  
  "connect.hive.hive.metastore.uris": "thrift://quickstart.cloudera:9083",
  
  "connect.hive.fs.defaultFS": "hdfs://www.michenggw.com quickstart.cloudera:9001",
  
  "connect.hive.error.policy": "NOOP",
  
  "connect.progress.enabled": true
  
  }
  
  }
  
  开始测试,查看结果
  
  启动kafka producer,写入测试数据,scala测试代码如下:
  
  class AvroTest {
  
  /**
  
  * 测试kafka使用avro方式生产数据
  
  * 参考 https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
  
  */
  
  @Test
  
  def testProducer: Unit = {
  
  // 设置kafka broker地址、序列化方式、schema registry组件的地址
  
  val props = new Properties()
  
  props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
  
  props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[io.confluent.kafka.serializers.KafkaAvroSerializer])
  
  props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[io.confluent.kafka.serializers.KafkaAvroSerializer])
  
  props.put("schema.registry.url", "http://localhost:8081")
  
  // 设置schema
  
  val schema = "{\"type\":\"record\",\"name\":\"myrecord\",\"fields\":[{\"name\":\"city\",\"type\":\"string\"},{\"name\":\"state\",\"type\":\"string\"},{\"name\":\"population\",\"type\":\"int\"},{\"name\":\"country\",\"type\":\"string\"}]}"
  
  val parser = new Schema.Parser()
  
  val schema = parser.parse(schema)
  
  // 构造测试数据
  
  val avroRecord1 = new GenericData.Record(schema)
  
  avroRecord1.put("city", "Philadelphia")
  
  avroRecord1.put("state", "PA")
  
  avroRecord1.put("population", 1568000)
  
  avroRecord1.put("country", "USA")
  
  val avroRecord2 = new GenericData.Record(schema)
  
  avroRecord2.put("city", "Chicago")
  
  avroRecord2.put("state", "IL")
  
  avroRecord2.put("population", 2705000)
  
  avroRecord2.put("country"www.lezongyule.com, "USA")
  
  val avroRecord3 = new GenericData.Record(schema)
  
  avroRecord3.put("city", "New York")
  
  avroRecord3.put("state", "NY")
  
  avroRecord3.put("population", 8538000)
  
  avroRecord3.put("country", "USA")
  
  // 生产数据
  
  val producer = new KafkaProducer[String, GenericData.Record](props)
  
  try {
  
  val recordList = List(avroRecord1, avroRecord2, avroRecord3)
  
  val key = "key1"
  
  for (elem <- recordList) {
  
  val record = new ProducerRecord("hive_sink_orc", key, elem)
  
  for (i <- 0 to 100) {
  
  val ack = producer.send(record).get()
  
  println(s"${ack.toString} written to partition ${ack.partition.toString}")
  
  }
  
  }
  
  } catch {
  
  case e: Throwable => e.printStackTrace()
  
  } finally {
  
  // When you're finished producing records, you can flush the producer to ensure it has all been written to Kafka and
  
  // then close the producer to free its resources.
  
  // 调用flush方法确保所有数据都被写入到Kafka
  
  producer.flush()
  
  // 调用close方法释放资源
  
  producer.close()
  
  }
  
  }
  
  }
  
  4、使用beeline查询hive数据:
  
  use hive_connect;
  
  select * from cities_orc;
  
  输出部分结果如下:
  
  +------------------+------------------------+---------------------+-------------------+--+
  
  | cities_orc.city | cities_orc.population | cities_orc.country | cities_orc.state |
  
  +------------------+------------------------+---------------------+-------------------+--+
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Chicago | 2705000 | USA | IL |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  | Philadelphia | 1568000 | USA | PA |
  
  配置说明
  
  KCQL配置
  
  connect.hive.kcql中的配置项说明如下:
  
  WITH_FLUSH_INTERVAL:long类型,表示文件提交的时间间隔,单位是毫秒
  
  WITH_FLUSH_SIZE:long类型,表示执行提交操作之前,已提交到HDFS的记录数
  
  WITH_FLUSH_COUNT:long类型,表示执行提交操作之前,未提交到HDFS的记录数
  
  WITH_SCHEMA_EVOLUTION:string类型,默认值是MATCH,表示hive schema和kafka topic record的schema的兼容策略,hive connector会使用该策略来添加或移除字段
  
  WITH_TABLE_LOCATION:string类型,表示hive表在HDFS中的存储位置,如果不指定的话,将使用hive中默认的配置
  
  WITH_OVERWRITE:boolean类型,表示是否覆盖hive表中已存在的记录,使用该策略时,会先删除已有的表,再新建
  
  PARTITIONBY:List<String>类型,保存分区字段。指定后,将从指定的列中获取分区字段的值
  
  WITH_PARTITIONING:string类型,默认值是STRICT,表示分区创建方式。主要有DYNAMIC和STRICT两种方式。DYNAMIC方式将根据PARTITIONBY指定的分区字段创建分区,STRICT方式要求必须已经创建了所有分区
  
  AUTOCREATE:boolean类型,表示是否自动创建表
  
  Kafka connect配置
  
  Kafka connect的配置项说明如下:
  
  name:string类型,表示connector的名称,在整个kafka-connect集群中唯一
  
  topics:string类型,表示保存数据的topic名称,必须与KCQL语句中的topic名称一致
  
  tasks.max :int类型,默认值为1,表示connector的任务数量
  
  connector.class :string类型,表示connector类的名称,值必须是com.landoop.streamreactor.connect.hive.sink.HiveSinkConnector
  
  connect.hive.kcql:string类型,表示kafka-connect查询语句
  
  connect.hive.database.name:string类型,表示hive数据库的名称
  
  connect.hive.hive.metastore:string类型,表示连接hive metastore所使用的网络协议
  
  connect.hive.hive.metastore.uris:string类型,表示hive metastore的连接地址
  
  connect.hive.fs.defaultF:string类型,表示HDFS的地址

kafka-connect-hive Sink插件入门指南的更多相关文章

  1. 《KAFKA官方文档》入门指南(转)

    1.入门指南 1.1简介 Apache的Kafka™是一个分布式流平台(a distributed streaming platform).这到底意味着什么? 我们认为,一个流处理平台应该具有三个关键 ...

  2. Streaming data from Oracle using Oracle GoldenGate and Kafka Connect

    This is a guest blog from Robin Moffatt. Robin Moffatt is Head of R&D (Europe) at Rittman Mead, ...

  3. Apache Kafka Connect - 2019完整指南

    今天,我们将讨论Apache Kafka Connect.此Kafka Connect文章包含有关Kafka Connector类型的信息,Kafka Connect的功能和限制.此外,我们将了解Ka ...

  4. Kafka Connect使用入门-Mysql数据导入到ElasticSearch

    1.Kafka Connect Connect是Kafka的一部分,它为在Kafka和外部存储系统之间移动数据提供了一种可靠且伸缩的方式,它为连接器插件提供了一组API和一个运行时-Connect负责 ...

  5. 《Three.js 入门指南》3.1.2 - 一份整齐的代码结构以及使用ORBIT CONTROLS插件(轨道控制)实现模型控制

    3.1.2 正式代码结构 & ORBIT CONTROLS插件(轨道控制) 说明 本节内容属于插入节,<Three.js入门指南>这本书中,只是简单的介绍了一些概念,是一本基础的入 ...

  6. Flume NG Getting Started(Flume NG 新手入门指南)

    Flume NG Getting Started(Flume NG 新手入门指南)翻译 新手入门 Flume NG是什么? 有什么改变? 获得Flume NG 从源码构建 配置 flume-ng全局选 ...

  7. 替代Flume——Kafka Connect简介

    我们知道过去对于Kafka的定义是分布式,分区化的,带备份机制的日志提交服务.也就是一个分布式的消息队列,这也是他最常见的用法.但是Kafka不止于此,打开最新的官网. 我们看到Kafka最新的定义是 ...

  8. 使用Kafka Connect创建测试数据生成器

    在最近的一些项目中,我使用Apache Kafka开发了一些数据管道.在性能测试方面,数据生成总是会在整个活动中引入一些样板代码,例如创建客户端实例,编写控制流以发送数据,根据业务逻辑随机化有效负载等 ...

  9. Microsoft Orleans 之 入门指南

    Microsoft Orleans 在.net用简单方法构建高并发.分布式的大型应用程序框架. 原文:http://dotnet.github.io/orleans/ 在线文档:http://dotn ...

随机推荐

  1. 利用shell连接服务器

    #应用 连接timesten 数据库 host = Linux(ip, 'user', 'pwd') # 传入Ip,用户名,密码host.connect() #主机开启cdsql = host.sen ...

  2. 文件的上传和下载--SpringMVC

    文件的上传和下载是项目开发中最常用的功能,例如图片的上传和下载.邮件附件的上传和下载等. 接下来,将对Spring MVC环境中文件的上传和下载进行详细的讲解. 一.文件上传 多数文件上传都是通过表单 ...

  3. 基本数据结构 -- 栈简介(C语言实现)

    栈是一种后进先出的线性表,是最基本的一种数据结构,在许多地方都有应用. 一.什么是栈 栈是限制插入和删除只能在一个位置上进行的线性表.其中,允许插入和删除的一端位于表的末端,叫做栈顶(top),不允许 ...

  4. GTK 预置对话框 GtkDialog 文件/颜色/字体选取等 GtkFileSelection

    (GTK2) 文档链接 作用:打开一个预置的对话框,如文件选取对话框 GtkFileSelection 效果下图所示 ╰── GtkDialog ├── GtkAboutDialog ├── GtkC ...

  5. (二)Hyperledger Fabric 1.1安装部署-Fabric Samples

    Hyperledger Fabric Samples是官方推荐的First Network,对于熟悉fabric和测试基础环境很有好处. Fabric Samples源码下载:使用git下载源码,进入 ...

  6. 多tomcat 同一个浏览器 多个项目 会导致session覆盖

    1,多tomcat 同一个浏览器 同一个项目 会导致session覆盖 个人猜测:一个服务器中有多个Tomcat服务器多个项目,每个服务器占用不同的端口号,当在同一个浏览器里面同时打开2个系统时,一个 ...

  7. 20172319 2018.03.27-04.05 《Java程序设计》第4周学习总结

    20172319 2018.03.27-04.05 <Java程序设计>第4周学习总结 教材学习内容总结 第四章 编写类 类与对象的回顾:对象是有状态的,状态由对象的属性值确定.属性由类中 ...

  8. web02-welcomeyou

    新建web项目web02-welcomeyou, 修改index.jsp为 <body> This is my JSP page. <br> <form action=& ...

  9. xml命名空间

    https://yq.aliyun.com/articles/40353 ************************************* 摘要: 相信很多人和我一样,在编写Spring或者 ...

  10. 《大象Think in UML》阅读笔记之一

    Think in UML这一书以UML为载体,将面向对象的分析设计思想巧妙地融合在建模UML当中,通过一些实例将软件系统的开发过程中的一些知识有机地结合起来.全书共分为四篇:准备篇.基础篇.进阶篇和总 ...