【原创】大数据基础之Flume(2)应用之kafka-kudu
应用一:kafka数据同步到kudu
1 准备kafka topic
# bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic test_sync --partitions 2 --replication-factor 2
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "test_sync".
# bin/kafka-topics.sh --zookeeper $zk:2181/kafka -describe --topic test_sync
Topic:test_sync PartitionCount:2 ReplicationFactor:2 Configs:
Topic: test_sync Partition: 0 Leader: 112 Replicas: 112,111 Isr: 112,111
Topic: test_sync Partition: 1 Leader: 110 Replicas: 110,112 Isr: 110,112
2 准备kudu表
impala-shell
CREATE TABLE test.test_sync (
id int,
name string,
description string,
create_time timestamp,
update_time timestamp,
primary key (id)
)
PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='$kudu_master:7051');
3 准备flume kudu支持
3.1 下载jar
# wget https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/kudu/kudu-flume-sink/1.7.0-cdh5.16.1/kudu-flume-sink-1.7.0-cdh5.16.1.jar
# mv kudu-flume-sink-1.7.0-cdh5.16.1.jar $FLUME_HOME/lib/ # wget http://central.maven.org/maven2/org/json/json/20160810/json-20160810.jar
# mv json-20160810.jar $FLUME_HOME/lib/
3.2 开发
代码库:https://github.com/apache/kudu/tree/master/java/kudu-flume-sink
kudu-flume-sink默认使用的producer是
org.apache.kudu.flume.sink.SimpleKuduOperationsProducer
public List<Operation> getOperations(Event event) throws FlumeException {
try {
Insert insert = table.newInsert();
PartialRow row = insert.getRow();
row.addBinary(payloadColumn, event.getBody());
return Collections.singletonList((Operation) insert);
} catch (Exception e) {
throw new FlumeException("Failed to create Kudu Insert object", e);
}
}
是将消息直接存放到一个payload列中
如果想要支持json格式数据,需要二次开发
package com.cloudera.kudu;
public class JsonKuduOperationsProducer implements KuduOperationsProducer {
代码详见:https://www.cnblogs.com/barneywill/p/10573221.html
打包放到$FLUME_HOME/lib下
4 准备flume conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = 192.168.0.1:9092
a1.sources.r1.kafka.topics = test_sync
a1.sources.r1.kafka.consumer.group.id = flume-consumer # Describe the sink
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000 # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 a1.sinks.k1.type = org.apache.kudu.flume.sink.KuduSink
a1.sinks.k1.producer = com.cloudera.kudu.JsonKuduOperationsProducer
a1.sinks.k1.masterAddresses = 192.168.0.1:7051
a1.sinks.k1.tableName = impala::test.test_sync
a1.sinks.k1.batchSize = 50
5 启动flume
bin/flume-ng agent --conf conf --conf-file conf/order.properties --name a1
6 kudu确认
impala-shell
select * from test_sync limit 10;
参考:https://kudu.apache.org/2016/08/31/intro-flume-kudu-sink.html
【原创】大数据基础之Flume(2)应用之kafka-kudu的更多相关文章
- 【原创】大数据基础之Flume(2)kudu sink
kudu中的flume sink代码路径: https://github.com/apache/kudu/tree/master/java/kudu-flume-sink kudu-flume-sin ...
- 【原创】大数据基础之Flume(2)Sink代码解析
flume sink核心类结构 1 核心接口Sink org.apache.flume.Sink /** * <p>Requests the sink to attempt to cons ...
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 大数据系列之Flume+kafka 整合
相关文章: 大数据系列之Kafka安装 大数据系列之Flume--几种不同的Sources 大数据系列之Flume+HDFS 关于Flume 的 一些核心概念: 组件名称 功能介绍 Agent ...
- 【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
- 【原创】大数据基础之Impala(1)简介、安装、使用
impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic datab ...
- 【原创】大数据基础之Benchmark(2)TPC-DS
tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...
- 大数据基础知识问答----spark篇,大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
- 低调、奢华、有内涵的敏捷式大数据方案:Flume+Cassandra+Presto+SpagoBI
基于FacebookPresto+Cassandra的敏捷式大数据 文件夹 1 1.1 1.1.1 1.1.2 1.2 1.2.1 1.2.2 2 2.1 2.2 2.3 2.4 2.5 2.6 3 ...
随机推荐
- 047、管理Docker Machine(2019-03012 周二)
参考https://www.cnblogs.com/CloudMan6/p/7248188.html 用docker-machine创建machine的过程很简洁,非常适合多主机环境.除此之外 ...
- java定时器实现总结
前言:Java定时器目前主要有3种实现方式:JDK组件,Spring Task,Quartz框架. 1. JDK组件(1) java.util.TimerTask MyTimerTask.java: ...
- JS 样式字符串 转 JSON对象
项目中需要把div 上的样式值转成数据展示 形如: padding: 7px 2px 1px 3px; color: rgb(238, 65, 65); background-color: rgb(2 ...
- jquery 禁止滚动条滚动,并且滚动条不消失,页面大小不闪动
一,禁止滚动,滚动条不消失,页面大小不闪动 //禁止滚动条滚动 function unScroll() { var top = $(document).scrollTop(); $(document) ...
- 【十一】jvm 性能调优工具之 jmap
jvm 性能调优工具之 jmap 概述 命令jmap是一个多功能的命令.它可以生成 java 程序的 dump 文件, 也可以查看堆内对象示例的统计信息.查看 ClassLoader 的信息以及 fi ...
- Hbase思维导图之调优
- MySQL数据库的版本更新方法
MySQL数据库的版本更新很快,新的特性也随之不断的更新,更主要的是解决了很多影响我们应用的BUG,为了让我们的MySQL变得更美好,我们有必要去给它升级,尽管你会说它现在已经跑得很好很稳定完全够用了 ...
- 关于tcp queue
半连接队列:服务端维护的与客户端保持SYN_RECV状态的连接队列,等待客户端回复,当收到客户端ack后,如果条件允许(全连接队列未达到最大值),服务端进入ESTAB状态,从半连接队列移到全连接队列的 ...
- CNN学习入门
https://blog.csdn.net/ice_actor/article/details/78648780
- Javascript - DOM文档对象模型
文档对象模型(DOM) DOM(Document Object Model,文档对象模型)是一个通过和JavaScript进行内容交互的APIJavascript和DOM一般经常作为一个整体,因为Ja ...