This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via JDBC connections.

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

Tutorial: Discover how to build a pipeline with Kafka leveraging DataDirect PostgreSQL JDBC driver to move the data from PostgreSQL to HDFS. Let’s go streaming!

Apache Kafka is an open source distributed streaming platform which enables you to build streaming data pipelines between different applications. You can also build real-time streaming applications that interact with streams of data, focusing on providing a scalable, high throughput and low latency platform to interact with data streams.

Earlier this year, Apache Kafka announced a new tool called Kafka Connect which can helps users to easily move datasets in and out of Kafka using connectors, and it has support for JDBC connectors out of the box! One of the major benefits for DataDirect customers is that you can now easily build an ETL pipeline using Kafka leveraging your DataDirect JDBC drivers. Now you can easily connect and get the data from your data sources into Kafka and export the data from there to another data source.

Image From https://kafka.apache.org/

Environment Setup

Before proceeding any further with this tutorial, make sure that you have installed the following and are configured properly. This tutorial is written assuming you are also working on Ubuntu 16.04 LTS, you have PostgreSQL, Apache Hadoop, and Hive installed.

  1. Installing Apache Kafka and required toolsTo make the installation process easier for people trying this out for the first time, we will be installing Confluent Platform. This takes care of installing Apache Kafka, Schema Registry and Kafka Connect which includes connectors for moving files, JDBC connectors and HDFS connector for Hadoop.
    1. To begin with, install Confluent’s public key by running the command: wget -qO -http://packages.confluent.io/deb/2.0/archive.key | sudo apt-key add -
    2. Now add the repository to your sources.list by running the following command: sudo add-apt-repository "deb http://packages.confluent.io/deb/2.0 stable main"
    3. Update your package lists and then install the Confluent platform by running the following commands: sudo apt-get updatesudo apt-get install confluent-platform-2.11.7
  2. Install DataDirect PostgreSQL JDBC driver
    1. Download DataDirect PostgreSQL JDBC driver by visiting here.
    2. Install the PostgreSQL JDBC driver by running the following command: java -jar PROGRESS_DATADIRECT_JDBC_POSTGRESQL_ALL.jar
    3. Follow the instructions on the screen to install the driver successfully (you can install the driver in evaluation mode where you can try it for 15 days, or in license mode, if you have bought the driver)
  3. Configuring data sources for Kafka Connect
    1. Create a new file called postgres.properties, paste the following configuration and save the file. To learn more about the modes that are being used in the below configuration file, visit this pagename=test-postgres-jdbcconnector.class=io.confluent.connect.jdbc.JdbcSourceConnectortasks.max=1connection.url=jdbc:datadirect:postgresql://<;server>:<port>;User=<user>;Password=<password>;Database=<dbname>mode=timestamp+incrementingincrementing.column.name=<id>timestamp.column.name=<modifiedtimestamp>topic.prefix=test_jdbc_table.whitelist=actor
    2. Create another file called hdfs.properties, paste the following configuration and save the file. To learn more about HDFS connector and configuration options used, visit this pagename=hdfs-sinkconnector.class=io.confluent.connect.hdfs.HdfsSinkConnectortasks.max=1topics=test_jdbc_actorhdfs.url=hdfs://<;server>:<port>flush.size=2hive.metastore.uris=thrift://<;server>:<port>hive.integration=trueschema.compatibility=BACKWARD
    3. Note that postgres.properties and hdfs.properties have basically the connection configuration details and behavior of the JDBC and HDFS connectors.
    4. Create a symbolic link for DataDirect Postgres JDBC driver in Hive lib folder by using the following command: ln -s /path/to/datadirect/lib/postgresql.jar /path/to/hive/lib/postgresql.jar
    5. Also make the DataDirect Postgres JDBC driver available on Kafka Connect process’s CLASSPATH by running the following command: export CLASSPATH=/path/to/datadirect/lib/postgresql.jar
    6. Start the Hadoop cluster by running following commands: cd /path/to/hadoop/sbin./start-dfs.sh./start-yarn.sh
  4. Configuring and running Kafka Services
  5. Download the configuration files for Kafkazookeeper and schema-registry services
  6. Start the Zookeeper service by providing the zookeeper.properties file path as a parameter by using the command: zookeeper-server-start /path/to/zookeeper.properties
  7. Start the Kafka service by providing the server.properties file path as a parameter by using the command: kafka-server-start /path/to/server.properties
  8. Start the Schema registry service by providing the schema-registry.properties file path as a parameter by using the command:               schema-registry-start /path/to/ schema-registry.properties

Ingesting Data Into HDFS using Kafka Connect

To start ingesting data from PostgreSQL, the final thing that you have to do is start Kafka Connect. You can start Kafka Connect by running the following command:

connect-standalone /path/to/connect-avro-standalone.properties \ /path/to/postgres.properties /path/to/hdfs.properties

This will import the data from PostgreSQL to Kafka using DataDirect PostgreSQL JDBC drivers and create a topic with name test_jdbc_actor. Then the data is exported from Kafka to HDFS by reading the topic test_jdbc_actor through the HDFS connector. The data stays in Kafka, so you can reuse it to export to any other data sources.

Next Steps

We hope this tutorial helped you understand on how you can build a simple ETL pipeline using Kafka Connect leveraging DataDirect PostgreSQL JDBC drivers. This tutorial is not limited to PostgreSQL. In fact, you can create ETL pipelines leveraging any of our DataDirect JDBC drivers that we offer for Relational databases like OracleDB2 and SQL Server, Cloud sources likeSalesforce and Eloqua or BigData sources like CDH HiveSpark SQL and Cassandra by following similar steps. Also, subscribe to our blog via email or RSS feed  for more awesome tutorials.

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate andwhy your current integration solution is not enough, brought to you in partnership with Liaison.

Build an ETL Pipeline With Kafka Connect via JDBC Connectors的更多相关文章

  1. 1.3 Quick Start中 Step 7: Use Kafka Connect to import/export data官网剖析(博主推荐)

    不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Step 7: Use Kafka Connect to import/export ...

  2. Kafka connect快速构建数据ETL通道

    摘要: 作者:Syn良子 出处:http://www.cnblogs.com/cssdongl 转载请注明出处 业余时间调研了一下Kafka connect的配置和使用,记录一些自己的理解和心得,欢迎 ...

  3. Kafka Connect Architecture

    Kafka Connect's goal of copying data between systems has been tackled by a variety of frameworks, ma ...

  4. Streaming data from Oracle using Oracle GoldenGate and Kafka Connect

    This is a guest blog from Robin Moffatt. Robin Moffatt is Head of R&D (Europe) at Rittman Mead, ...

  5. 基于Kafka Connect框架DataPipeline在实时数据集成上做了哪些提升?

    在不断满足当前企业客户数据集成需求的同时,DataPipeline也基于Kafka Connect 框架做了很多非常重要的提升. 1. 系统架构层面. DataPipeline引入DataPipeli ...

  6. 打造实时数据集成平台——DataPipeline基于Kafka Connect的应用实践

    导读:传统ETL方案让企业难以承受数据集成之重,基于Kafka Connect构建的新型实时数据集成平台被寄予厚望. 在4月21日的Kafka Beijing Meetup第四场活动上,DataPip ...

  7. 替代Flume——Kafka Connect简介

    我们知道过去对于Kafka的定义是分布式,分区化的,带备份机制的日志提交服务.也就是一个分布式的消息队列,这也是他最常见的用法.但是Kafka不止于此,打开最新的官网. 我们看到Kafka最新的定义是 ...

  8. Apache Kafka Connect - 2019完整指南

    今天,我们将讨论Apache Kafka Connect.此Kafka Connect文章包含有关Kafka Connector类型的信息,Kafka Connect的功能和限制.此外,我们将了解Ka ...

  9. SQL Server CDC配合Kafka Connect监听数据变化

    写在前面 好久没更新Blog了,从CRUD Boy转型大数据开发,拉宽了不少的知识面,从今年年初开始筹备.组建.招兵买马,到现在稳定开搞中,期间踏过无数的火坑,也许除了这篇还很写上三四篇. 进入主题, ...

随机推荐

  1. Effective C++ -----条款42:了解typename的双重意义

    声明template参数时,前缀关键字class和typename可互换. 请使用关键字typename标识嵌套从属类型名称:但不得在base class lists(基类列)或member init ...

  2. linux 共享内存实现

    说起共享内存,一般来说会让人想起下面一些方法:1.多线程.线程之间的内存都是共享的.更确切的说,属于同一进程的线程使用的是同一个地址空间,而不是在不同地址空间之间进行内存共享:2.父子进程间的内存共享 ...

  3. HDU 4949 Light(插头dp、位运算)

    比赛的时候没看题,赛后看题觉得比赛看到应该可以敲的,敲了之后发现还真就会卡题.. 因为写完之后,无限TLE... 直到后来用位运算代替了我插头dp常用的decode.encode.shift三个函数以 ...

  4. 【Git】笔记4 分支管理1

    1.创建与合并分支 一开始的时候,master分支是一条线,Git用master指向最新的提交,再用HEAD指向master,就能确定当前分支,以及当前分支的提交点: 每次提交,master分支都会向 ...

  5. C#中DataTable排序、检索、合并等操作实例

    转载引用至:http://www.jb51.net/article/49222.htm     一.排序1.获取DataTable的默认视图2.对视图设置排序表达式3.用排序后的视图导出的新DataT ...

  6. SSH框架中json传递失败

    错误截图: 这个错误乍一看无从下手,报的都是框架底层的错误,于是查阅资料得到了答案. 错误原因:struts会将action中定义的一些变量序列化转换成json格式,需要调用对象的一系列get方法,并 ...

  7. GCD的使用

    什么是 GCD Grand Central Dispatch (GCD) 是 Apple 开发的一个多核编程的解决方法.该方法在 Mac OS X 10.6 雪豹中首次推出,并随后被引入到了 iOS4 ...

  8. [Android Rro] SDK JAR

    cd ../../../outputs/aar/mkdir AAR_VERSIONmkdir JAR_VERSIONmv app-release.aar AAR_VERSION/${project_n ...

  9. [Android Pro] APK

    svn updatesvn status ls -alsvn log --limit 8 > RELEASE_NOTE.txt cat RELEASE_NOTE.txt chmod a+x gr ...

  10. openfire 服务器名称:后面的黄色叹号

    然后点击重新获取证书,然后重新启动服务,问题解决!