1. 概述

Druid的数据摄入主要包括两大类：
1. 实时输入摄入：包括Pull,Push两种
- Pull:需要启动一个RealtimeNode节点，通过不同的Firehose摄取不同种类的数据源。
- Push:需要启动Tranquility或是Kafka索引服务。通过HTTP调用的方式进行数据摄入
2. 离线数据摄入:可以通过Realtime节点摄入，也可以通过索引节点启动任务摄入

本文演示环节主要基于上一章部署的集群来进行

2. 实时数据摄入

2.1 Pull

由于Realtime Node 没有提供高可用，可伸缩等特性，对于比较重要的场景推荐使用 Tranquility Server or 或是Tranquility Kafka索引服务

2.2 Push

Indexing service在前文已经介绍过了，Tranquility 是一个Scala库，它通过索引服务实现数据实时的摄入。它之所以存在，是因为Indexing service API属于低层面的。Tranquility是对索引服务进行抽象封装，对使用者屏蔽了创建任务，处理分区、复制、服务发现和shema rollover等环节。

通过Tranquility 的数据摄入，可以分为两种方式

Tranquility Server：发送方可以通过Tranquility Server 提供的HTTP接口，向Druid发送数据。
Tranquility Kafka：发送发可以先将数据发送到Kafka,Tranquility Kafka会根据配置从Kafka获取数据，并写到Druid中。

2.2.1 Tranquility Server配置

配置流程如下
1. 开启Tranquility Server，在数据节点上编辑conf/supervise/data-with-query.conf 文件，将Tranquility Server注释放开

# Uncomment to use Tranquility Server

!p95 tranquility-server bin/tranquility server -configFile conf/tranquility/server.json

2. 拷贝quick里面的server.json

root@druid:~/imply-2.3.# cp conf-quickstart/tranquility/server.json conf/tranquility/

3. 启动服务

root@druid:~/imply-2.3.# bin/supervise -c conf/supervise/data-with-query.conf

启动信息如下：

[Fri Dec   :: ] Running command[tranquility-server], logging to[/root/imply-2.3./var/sv/tranquility-server.log]: bin/tranquility server -configFile conf/tranquility/server.json

4. 发送数据

bin/generate-example-metrics | curl -XPOST -H'Content-Type: application/json' --data-binary @- http://localhost:8200/v1/post/tutorial-tranquility-server

如果成功会打印出,表名产生了25条数据到druid里

{"result":{"received":,"sent":}}

5. 查询数据

root@druid:~/imply-2.3./bin#./plyql -h localhost -p  -q "SELECT server, SUM("count") AS "events", COUNT(*) AS "rows" FROM "tutorial-tranquility-server" GROUP BY server;"

┌──────────────────┬────────┬──────┐

│ server           │ events │ rows │

├──────────────────┼────────┼──────┤

│ www1.example.com │       │     │

│ www2.example.com │       │     │

│ www3.example.com │       │     │

│ www4.example.com │       │     │

│ www5.example.com │       │     │

└──────────────────┴────────┴──────┘

6. 重启Tranquility Server:

bin/service –restart tranquility-server

2.2.2 Tranquility Kafka配置

配置流程如下
1. 开启Tranquility Kafka，在数据节点上编辑conf/supervise/data-with-query.conf 文件，将Tranquility Kafka注释放开

# Uncomment to use Tranquility Server

!p95 tranquility-server bin/tranquility server -configFile conf/tranquility/server.json

2. 拷贝quick里面的kafka.json

root@druid:~/imply-2.3.# cp conf-quickstart/tranquility/kafka.json conf/tranquility/

详细配置可参考：http://druid.io/docs/0.12.1/tutorials/tutorial-kafka.html

3. 在kafa集群中创建topic

root@druid:/opt/PaaS/Talas/lib/Kafka/bin#./kafka-topics.sh --create --zookeeper native-lufanfeng----:,native-lufanfeng----:,native-lufanfeng----: --replication-factor  --partitions  --topic tutorial-tranquility-kafka

4. 启动服务

root@druid:~/imply-2.3.# bin/supervise -c conf/supervise/data-with-query.conf

启动信息如下：

[Tue Dec  :: ] Running command[tranquility-kafka], logging to[/root/imply-2.3./var/sv/tranquility-kafka.log]: bin/tranquility kafka -configFile conf/tranquility/kafka.json

5. 使用kafka自带的工具发送数据

root@druid:/opt/PaaS/Talas/lib/Kafka/bin# ./kafka-console-producer.sh --broker-list native-lufanfeng----:,native-lufanfeng----:,native-lufanfeng----: --topic tutorial-tranquility-kafka

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/list", "metricType": "request/latency", "server": "www1.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/list", "metricType": "request/latency", "server": "www1.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/list", "metricType": "request/latency", "server": "www5.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/list", "metricType": "request/latency", "server": "www4.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/", "metricType": "request/latency", "server": "www3.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/list", "metricType": "request/latency", "server": "www2.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/", "metricType": "request/latency", "server": "www5.example.com"}

{"unit": "milliseconds", "http_method": "GET", "value": , "timestamp": "2017-12-12T05:55:59Z", "http_code": "", "page": "/", "metricType": "request/latency", "server": "www3.example.com"}

此时观察kafka-server.log的日志会发现类似于如下输出

-- ::, [KafkaConsumer-CommitThread] INFO  c.m.tranquility.kafka.KafkaConsumer - Flushed {tutorial-tranquility-kafka={receivedCount=, sentCount=,droppedCount=, unparseableCount=}} pending messages in 0ms and committed offsets in 0ms.

在datasource中,windowPeriod设置成了P10M,timestamp不在当前时间10M内的数据都会被过滤，由于上面的数据的timestamp和执行时间相差了大概26分钟左右，所以都会被drop调，为了达到演示效果，可以对bin/generate-example-metrics-main 的脚本进行调整。代码如下：

# Copyright 2017 Imply Data, Inc.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

import argparse

import json

import random

import sys

from datetime import datetime

from kafka import KafkaProducer

from kafka import KafkaClient

hosts="native-lufanfeng-2-5-24-138:9092,native-lufanfeng-3-5-24-139:9092,native-lufanfeng-4-5-24-140:9092"

# hosts="10.48.253.104:9092"

topic='tutorial-tranquility-kafka'

class KafkaSender():

    def __init__(self):

        self.client=KafkaClient(hosts)

        self.producer=KafkaProducer(bootstrap_servers=hosts)

        self.client.ensure_topic_exists(topic)

    def send_messages(self,msg):

        self.producer.send(topic,msg)

        self.producer.r

def main():

  parser = argparse.ArgumentParser(description='Generate example page request latency metrics.')

  parser.add_argument('--count', '-c', type=int, default=25, help='Number of events to generate (negative for unlimited)')

  args = parser.parse_args()

  count = 0

  sender = KafkaSender()

  while args.count < 0 or count < args.count:

    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")

    r = random.randint(1, 4)

    if r == 1 or r == 2:

      page = '/'

    elif r == 3:

      page = '/list'

    else:

      page = '/get/' + str(random.randint(1, 99))

    server = 'www' + str(random.randint(1, 5)) + '.example.com'

    latency = max(1, random.gauss(80, 40))

    record = json.dumps({

      'timestamp': timestamp,

      'metricType': 'request/latency',

      'value': int(latency),

      # Additional dimensions

      'page': page,

      'server': server,

      'http_method': 'GET',

      'http_code': '',

      'unit': 'milliseconds'

    })

    sender.send_messages(record)

    print 'Send:%s Successful!' % record

    count += 1

try:

  main()

except KeyboardInterrupt:

  sys.exit(1)

3. 离线数据摄入

3.1 静态文件摄入

使用自带的摄入机制，可以在数据节点摄入本地文件，方法如下：

bin/post-index-task --file quickstart/wikiticker-index.json

wikiticker-index.json 文件中既包括datasource的定义，也包括数据文件位置的配置

3.2 HDFS文件摄入

配置过程可参考：http://druid.io/docs/0.12.1/ingestion/batch-ingestion.html

4. 配置参考

通用配置：https://github.com/druid-io/tranquility/blob/master/docs/configuration.md
数据摄入通用配置:http://druid.io/docs/latest/ingestion/index.html
Tranquility Kafka:https://github.com/druid-io/tranquility/blob/master/docs/kafka.md

5. 其他注意事项

5.1 数据分片

Druid的分片基本都是通过配置tunningConfig来配置的，实时，批量配置的方式会存在一定的差异

实时加载包括下面两种类型
- Linear分片：
- 添加新节点时，原节点的配置不需要调整
- 当存在分片时数据也能被查询
- Numbered分片
- 所有分片存在时，才能查询
- 需要制定分片总数

本地文件加载包括下面两种类型
- 按照Partition大小分片
- 设置总的分片数

Hadoop文件加载包括下面两种类型
- 哈希分片
- 范围分片

5.2 高基数维度优化

对于需要统计维度基数的需求，如果某个维度的基数很大，可能会存在下列问题。维度基数统计主要包括下面两种类型
- Cardinality: 基于HyperLogLog算法，只在查询阶段做了优化，不能减少存储容量，基数大时，效率可能会有问题
- HyperUnique: 在摄入阶段进行优化，对于不需要对高基数维度进行过滤，分组的业务场景可以使用该类型

Druid.io系列（九）：数据摄入的更多相关文章

Druid.io系列（七）：架构剖析
1. 前言 Druid 的目标是提供一个能够在大数据集上做实时数据摄入与查询的平台,然而对于大多数系统而言,提供数据的快速摄入与提供快速查询是难以同时实现的两个指标.例如对于普通的RDBMS,如果想要 ...
Druid.io系列（一）：简介
原文链接: https://blog.csdn.net/njpjsoftdev/article/details/52955676 Druid.io(以下简称Druid)是面向海量数据的.用于实时查询与 ...
Druid.io系列（五）：查询过程
原文链接: https://blog.csdn.net/njpjsoftdev/article/details/52956194 Druid使用JSON over HTTP 作为底层的查询语言,不过强 ...
Druid.io系列（三）： Druid集群节点
原文链接: https://blog.csdn.net/njpjsoftdev/article/details/52955937 1 Historical Node Historical Node的职 ...
Druid.io系列（八）：部署
介绍前面几个章节对Druid的整体架构做了简单的说明,本文主要描述如何部署Druid的环境 Imply提供了一套完整的部署方式,包括依赖库,Druid,图形化的数据展示页面,SQL查询组件等.本文将 ...
Druid.io系列（二）：基本概念与架构
原文链接: https://blog.csdn.net/njpjsoftdev/article/details/52955788 在介绍Druid架构之前,我们先结合有关OLAP的基本原理来理解Dr ...
Druid.io系列（六）：问题总结
原文地址: https://blog.csdn.net/njpjsoftdev/article/details/52956508 我们在生产环境中使用Druid也遇到了很多问题,通过阅读官网文档.源码 ...
Druid.io系列（四）：索引过程分析
原文链接: https://blog.csdn.net/njpjsoftdev/article/details/52956083 Druid底层不保存原始数据,而是借鉴了Apache Lucene.A ...
java io系列15之 DataOutputStream(数据输出流)的认知、源码和示例
本章介绍DataOutputStream.我们先对DataOutputStream有个大致认识,然后再深入学习它的源码,最后通过示例加深对它的了解. 转载请注明出处:http://www.cnblog ...

随机推荐

yii2.0 高级版 restful api使用
1.复制任意个目录(backend)为api 2.打开api下的main.php 修改 id=>app-api,'controllerNamespace' => 'api\controll ...
ffmpeg+EasyDSS流媒体服务器实现稳定的rtmp推流直播
本文转自EasyDarwin团队成员Alex的博客:http://blog.csdn.net/cai6811376/article/details/74783269 需求在做EasyDSS开发时,总 ...
Mac 配置前端基本环境
一,sublime 下载一个版本,替换packages,要想shift command p管用,得在sublime里面control -,然后把 import urllib.request,os,h ...
Linux下升级安装Python-3.6.2版本
本文主要介绍在Linux(CentOS)下将Python的版本升级为3.6.2的方法众所周知,在2020年python官方将不再支持2.7版本的python,所以使用3.x版本的python是必要的 ...
Sublime Text 2 设置文件详解(转)
Sublime Text 2是那种让人会一眼就爱上的编辑器,不仅GUI让人眼前一亮,功能更是没的说,拓展性目前来说也完全够用了,网上介绍软件的文章和推荐插件的文章也不少,而且很不错,大家可以去找找自己 ...
Chrome 插件下载
这里推荐几个下载chrome扩展的网站 http://www.cnplugins.com/index.html 分类全,没有搜索 http://www.chromein.com/ 有搜索,推荐使用 h ...
ios一些噁心记录
有时在tableview的头部会凭空多出一块空白区域,这是由于ios会"贴心"的多分配一些用于滑动的多余inset. 消除这一空白的方法是,在tableview所在的control ...
C++中的友元函数的总结
1.友元函数的简单介绍 1.1为什么要使用友元函数在实现类之间数据共享时,减少系统开销,提高效率.如果类A中的函数要访问类B中的成员(例如:智能指针类的实现),那么类A中该函数要是类B的友元函数.具 ...
C的动态链表建立
运用到的函数为: 动态内存分配函数malloc() 比如:char *name=(char *)malloc(20); 相当与c++的new关键字动态内存释放函数free ...
使 WPF 支持触摸板的横向滚动
微软终于开始学苹果一样好好做触摸板了(就是键盘空格键下面那一大块).然而鉴于以前没有好好做,以至于 WPF 程序甚至都没有对触摸板的横向滚动提供支持(竖向滚动是直接使用了 MouseWheel,汗-- ...

Druid.io系列（九）：数据摄入