kudu中的flume sink代码路径:

https://github.com/apache/kudu/tree/master/java/kudu-flume-sink

kudu-flume-sink默认使用的producer是

org.apache.kudu.flume.sink.SimpleKuduOperationsProducer

  public List<Operation> getOperations(Event event) throws FlumeException {
try {
Insert insert = table.newInsert();
PartialRow row = insert.getRow();
row.addBinary(payloadColumn, event.getBody()); return Collections.singletonList((Operation) insert);
} catch (Exception e) {
throw new FlumeException("Failed to create Kudu Insert object", e);
}
}

是将消息直接存放到一个payload列中

如果想要支持json格式数据,需要二次开发

package com.cloudera.kudu;
public class JsonKuduOperationsProducer implements KuduOperationsProducer {

网上已经有人共享出来代码:https://cloud.tencent.com/developer/article/1158194

但是以上代码有几个不方便的地方,1)不允许null;2)对时间类型支持不好;3)所有的值必须是string,然后根据kudu中字段类型进行解析,在生成数据时需要注意,否则需要自行修改代码;

针对以上不便修改后代码如下:

JsonKuduOperationsProducer.java

package com.cloudera.kudu;

import com.google.common.collect.Lists;
import com.google.common.base.Preconditions;
import org.apache.avro.data.Json;
import org.json.JSONObject;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.annotations.InterfaceAudience;
import org.apache.flume.annotations.InterfaceStability;
import org.apache.kudu.ColumnSchema;
import org.apache.kudu.Schema;
import org.apache.kudu.Type;
import org.apache.kudu.client.KuduTable;
import org.apache.kudu.client.Operation;
import org.apache.kudu.client.PartialRow;
import org.apache.kudu.flume.sink.KuduOperationsProducer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import java.nio.charset.Charset;
import java.text.SimpleDateFormat;
import java.util.List;
import java.util.TimeZone;
import java.util.function.Function; @InterfaceAudience.Public
@InterfaceStability.Evolving
public class JsonKuduOperationsProducer implements KuduOperationsProducer {
private static final Logger logger = LoggerFactory.getLogger(JsonKuduOperationsProducer.class);
private static final String INSERT = "insert";
private static final String UPSERT = "upsert";
private static final List<String> validOperations = Lists.newArrayList(UPSERT, INSERT); public static final String ENCODING_PROP = "encoding";
public static final String DEFAULT_ENCODING = "utf-8";
public static final String OPERATION_PROP = "operation";
public static final String DEFAULT_OPERATION = UPSERT;
public static final String SKIP_MISSING_COLUMN_PROP = "skipMissingColumn";
public static final boolean DEFAULT_SKIP_MISSING_COLUMN = false;
public static final String SKIP_BAD_COLUMN_VALUE_PROP = "skipBadColumnValue";
public static final boolean DEFAULT_SKIP_BAD_COLUMN_VALUE = false;
public static final String WARN_UNMATCHED_ROWS_PROP = "skipUnmatchedRows";
public static final boolean DEFAULT_WARN_UNMATCHED_ROWS = true; private KuduTable table;
private Charset charset;
private String operation;
private boolean skipMissingColumn;
private boolean skipBadColumnValue;
private boolean warnUnmatchedRows; public JsonKuduOperationsProducer() {
} @Override
public void configure(Context context) {
String charsetName = context.getString(ENCODING_PROP, DEFAULT_ENCODING);
try {
charset = Charset.forName(charsetName);
} catch (IllegalArgumentException e) {
throw new FlumeException(
String.format("Invalid or unsupported charset %s", charsetName), e);
}
operation = context.getString(OPERATION_PROP, DEFAULT_OPERATION).toLowerCase();
Preconditions.checkArgument(
validOperations.contains(operation),
"Unrecognized operation '%s'",
operation);
skipMissingColumn = context.getBoolean(SKIP_MISSING_COLUMN_PROP,
DEFAULT_SKIP_MISSING_COLUMN);
skipBadColumnValue = context.getBoolean(SKIP_BAD_COLUMN_VALUE_PROP,
DEFAULT_SKIP_BAD_COLUMN_VALUE);
warnUnmatchedRows = context.getBoolean(WARN_UNMATCHED_ROWS_PROP,
DEFAULT_WARN_UNMATCHED_ROWS);
} @Override
public void initialize(KuduTable table) {
this.table = table;
} @Override
public List<Operation> getOperations(Event event) throws FlumeException {
String raw = new String(event.getBody(), charset);
logger.info("get raw: " + raw);
List<Operation> ops = Lists.newArrayList();
if(raw != null && !raw.isEmpty()) {
JSONObject json = null;
//just pass if it is not a json
try {
json = new JSONObject(raw);
} catch (Exception e) {
e.printStackTrace();
}
if (json != null) {
Schema schema = table.getSchema();
Operation op;
switch (operation) {
case UPSERT:
op = table.newUpsert();
break;
case INSERT:
op = table.newInsert();
break;
default:
throw new FlumeException(
String.format("Unrecognized operation type '%s' in getOperations(): " +
"this should never happen!", operation));
}
//just record the error event into log and pass
try {
PartialRow row = op.getRow();
for (ColumnSchema col : schema.getColumns()) {
try {
if (json.has(col.getName()) && json.get(col.getName()) != null) coerceAndSet(json.get(col.getName()), col.getName(), col.getType(), col.isKey(), col.isNullable(), col.getDefaultValue(), row);
else if (col.isKey() || !col.isNullable()) throw new RuntimeException("column : " + col.getName() + " is null or not exists in " + row);
} catch (NumberFormatException e) {
String msg = String.format(
"Raw value '%s' couldn't be parsed to type %s for column '%s'",
raw, col.getType(), col.getName());
logOrThrow(skipBadColumnValue, msg, e);
} catch (IllegalArgumentException e) {
String msg = String.format(
"Column '%s' has no matching group in '%s'",
col.getName(), raw);
logOrThrow(skipMissingColumn, msg, e);
}
}
ops.add(op);
} catch (Exception e) {
logger.error("get error [" + e.getMessage() + "]: " + raw, e);
}
}
}
return ops;
} protected <T> T getValue(T defaultValue, Object val, boolean isKey, boolean isNullable, Object columnDefaultValue, boolean compressException, Function<String, T> fromStr) {
T result = defaultValue;
try {
if (val == null) {
if (isKey || !isNullable) {
throw new RuntimeException("column is key or not nullable");
}
if (columnDefaultValue != null && !"null".equals(columnDefaultValue)) {
if (columnDefaultValue instanceof String) result = fromStr.apply((String)columnDefaultValue);
else result = (T)columnDefaultValue;
}
} else {
boolean isConverted = false;
//handle: try to convert directly
// try {
// result = (T)val;
// isConverted = true;
// } catch (Exception e1) {
//// e1.printStackTrace();
// }
//handle: parse from string
if (!isConverted) result = fromStr.apply(val.toString());
}
} catch(Exception e) {
if (compressException) e.printStackTrace();
else throw e;
}
return result;
} private SimpleDateFormat[] sdfs = new SimpleDateFormat[]{
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.000'Z'"),
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'"),
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
};
{
for (SimpleDateFormat sdf : sdfs) sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
} private void coerceAndSet(Object rawVal, String colName, Type type, boolean isKey, boolean isNullable, Object defaultValue, PartialRow row)
throws NumberFormatException {
switch (type) {
case INT8:
row.addByte(colName, (rawVal != null && rawVal instanceof Boolean) ? (Boolean)rawVal ? (byte)1 : (byte)0 : this.getValue((byte)0, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Byte.parseByte(str)));
break;
case INT16:
row.addShort(colName, this.getValue((short)0, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Short.parseShort(str)));
break;
case INT32:
row.addInt(colName, this.getValue(0, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Integer.parseInt(str)));
break;
case INT64:
row.addLong(colName, this.getValue(0l, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Long.parseLong(str)));
break;
case BINARY:
row.addBinary(colName, rawVal == null ? new byte[0] : rawVal.toString().getBytes(charset));
break;
case STRING:
row.addString(colName, rawVal == null ? "" : rawVal.toString());
break;
case BOOL:
row.addBoolean(colName, this.getValue(false, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Boolean.parseBoolean(str)));
break;
case FLOAT:
row.addFloat(colName, this.getValue(0f, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Float.parseFloat(str)));
break;
case DOUBLE:
row.addDouble(colName, this.getValue(0d, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> Double.parseDouble(str)));
break;
case UNIXTIME_MICROS:
Long value = this.<Long>getValue(null, rawVal, isKey, isNullable, defaultValue, this.skipBadColumnValue, (String str) -> {
Long result = null;
if (str != null && !"".equals(str)) {
boolean isPatternOk =false;
//handle: yyyy-MM-dd HH:mm:ss
if (str.contains("-") && str.contains(":")) {
for (SimpleDateFormat sdf : sdfs) {
try {
result = sdf.parse(str).getTime() * 1000;
isPatternOk = true;
break;
} catch (Exception e) {
// e.printStackTrace();
}
}
}
//handle: second, millisecond, microsecond
if (!isPatternOk && (str.length() == 10 || str.length() == 13 || str.length() == 16)) {
result = Long.parseLong(str);
if (str.length() == 10) result *= 1000000;
if (str.length() == 13) result *= 1000;
}
}
return result;
});
if (value != null) row.addLong(colName, value);
break;
default:
logger.warn("got unknown type {} for column '{}'-- ignoring this column", type, colName);
}
} private void logOrThrow(boolean log, String msg, Exception e)
throws FlumeException {
if (log) {
logger.warn(msg, e);
} else {
throw new FlumeException(msg, e);
}
} @Override
public void close() {
}
}

去掉类JsonStr2Map,主要是getValue和coerceAndSet配合,支持默认值,支持null,支持传递任意类型(自动适配处理),支持boolean转byte,时间类型支持yyyy-MM-dd HH:mm:ss等pattern和秒、毫秒、微秒4种格式,并且会自动将秒和毫秒转成微秒;

注意SimpleDateFormat设置timezone为UTC,这里是为了保证消息中的时间和写入kudu中的时间一致,否则会根据timezone做偏移,比如timezone为Asia/Shanghai,则写入kudu的时间会比消息中的时间晚8小时;

打包放到$FLUME_HOME/lib下

【原创】大数据基础之Flume(2)kudu sink的更多相关文章

  1. 【原创】大数据基础之Flume(2)应用之kafka-kudu

    应用一:kafka数据同步到kudu 1 准备kafka topic # bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic ...

  2. 【原创】大数据基础之Flume(2)Sink代码解析

    flume sink核心类结构 1 核心接口Sink org.apache.flume.Sink /** * <p>Requests the sink to attempt to cons ...

  3. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  4. 大数据系列之Flume+kafka 整合

    相关文章: 大数据系列之Kafka安装 大数据系列之Flume--几种不同的Sources 大数据系列之Flume+HDFS 关于Flume 的 一些核心概念: 组件名称     功能介绍 Agent ...

  5. 【原创】大数据基础之Kudu(5)kudu增加或删除目录/数据盘

    kudu加减数据盘不能直接修改配置fs_data_dirs后重启,否则会报错: Check failed: _s.ok() Bad status: Already present: FS layout ...

  6. 【原创】大数据基础之词频统计Word Count

    对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...

  7. 【原创】大数据基础之Impala(1)简介、安装、使用

    impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic datab ...

  8. 【原创】大数据基础之Benchmark(2)TPC-DS

    tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...

  9. 大数据基础知识问答----spark篇,大数据生态圈

    Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...

随机推荐

  1. 多线程Thread类的方法

    创建多个线程的第一种方法 1.定义一个Thread类的子类,比如MyThread类 2.重写Thread的run方法,设置线程任务 3.创建Mythread类的对象 4.调用方法start(),开启新 ...

  2. UML图的使用

    UML(Unified Modeling Language)中文统一建模语言,是一种开放的方法,用于说明.可视化.构建和编写一个正在开发的.面向对象的.软件密集系统的制品的开放方法. 类之间的关系 在 ...

  3. telnet客户端程序

    #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types. ...

  4. GCC编译器原理(二)------编译原理一:ELF文件(1)

    二.ELF 文件介绍 2.1 可执行文件格式综述 相对于其它文件类型,可执行文件可能是一个操作系统中最重要的文件类型,因为它们是完成操作的真正执行者.可执行文件的大小.运行速度.资源占用情况以及可扩展 ...

  5. [译]Domain Events Pattern Example

    原文 完整源码 本文展示的是一个关于网上调查的项目.想象下,当用户完成了一个调查,我们想通知所有人调查已经结束,分配一个人去检查调用问卷. 领域对象 public class Survey { pub ...

  6. form表单post提交的数据格式

    1.浏览器行为:Form表单提交 action:url 地址,服务器接收表单数据的地址 method:提交服务器的http方法,一般为post和get name:最好好吃name属性的唯一性 enct ...

  7. 迅为6818开发板-Cortex-A5架构丨支持4G全网通丨GPS丨WIFI丨另有丰富的扩展模块

    迅为6848开发板-S5P6818芯片采用Cortex-A53架构! Cortex-A53和其高端兄弟Cortex-A57一样都是64位架构,实目前ARM的主力,且二者的指令集是完全兼容的,可以组成新 ...

  8. Codeforces Round #540 (Div. 3)

    A链接 讨论一下2a2a2a跟bbb的大小关系即可. #include <set> #include <map> #include <queue> #include ...

  9. python之MD5加密

    一. MD5加密import hashlib #Python3里的引用#import md5 #Python2里的引用 1. md5是不可逆的,不能解密2. 所有语言生成的md5串都是一样的 3. 不 ...

  10. $(function() {....}) ,(function($){...})(jQuery)

    $(function() {....}) 是 jQuery 中的经典用法,等同于 $(document).ready(function() {....}),即在页面加载完成后才执行某个函数,如果函数中 ...