trident介绍

（一）理论基础

很多其它理论以后再补充，或者參考书籍

1、trident是什么？

Trident is a high-level abstraction for doing realtime computing on top of Storm. It allows you to seamlessly intermix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you're familiar with
high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar – Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing
on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.

简单的说，trident是storm的更高层次抽象，相对storm，它主要提供了2个方面的优点：

（1）提供了更高层次的抽象，将经常使用的count,sum等封装成了方法，能够直接调用，不须要自己实现。

（2）提供了一次原语，如groupby等。

（3）提供了事务支持，能够保证数据均处理且仅仅处理了一次。

2、trident每次处理消息均为batch为单位，即一次处理多个元组。

3、事务类型

关于事务类型，有2个比較easy混淆的概念：spout的事务类型以及事务状态。

它们都有3种类型。分别为：事务型、非事务型和透明事务型。

（1）spout

spout的类型指定了因为下游出现故障导致元组须要重放时，应该怎么发送元组。

事务型spout:重放时能保证同一个批次发送同一批元组。能够保证每个元组都被发送且仅仅发送一个。且同一个批次所发送的元组是一样的。

非事务型spout：没有不论什么保障，发完就算。

透明事务型spout：同一个批次发送的元组有可能不同的，它能够保证每个元组都被发送且仅仅发送一次，但不能保证重放时同一个批次的数据是一样的。这对于部分失效的情况尤事实上用，假如以kafka作为spout。当一个topic的某个分区失效时。能够用其他分区的数据先形成一个批次发送出去，假设是事务型spout，则必须等待那个分区恢复后才干继续发送。

这三种类型能够分别通过实现ITransactionalSpout、ITridentSpout、IOpaquePartitionedTridentSpout接口来定义。

（2）state

state的类型指定了假设将storm的中间输出或者终于输出持久化到某个地方（如内存）。当某个批次的数据重放时应该假设更新状态。state对于下游出现错误的情况尤事实上用。

事务型状态：同一批次tuple提供的结果是同样的。

非事务型状态：没有回滚能力。更新操作是永久的。

透明事务型状态：更新操作基于先前的值，这样因为这批数据发生变化。相应的结果也会发生变化。透明事务型状态除了保存当前数据外，还要保存上一批数据。当数据重放时，能够基于上一批数据作更新。

（二）看官方提供的演示样例

package org.ljh.tridentdemo;

import backtype.storm.Config;

import backtype.storm.LocalCluster;

import backtype.storm.LocalDRPC;

import backtype.storm.StormSubmitter;

import backtype.storm.generated.StormTopology;

import backtype.storm.tuple.Fields;

import backtype.storm.tuple.Values;

import storm.trident.TridentState;

import storm.trident.TridentTopology;

import storm.trident.operation.BaseFunction;

import storm.trident.operation.TridentCollector;

import storm.trident.operation.builtin.Count;

import storm.trident.operation.builtin.FilterNull;

import storm.trident.operation.builtin.MapGet;

import storm.trident.operation.builtin.Sum;

import storm.trident.testing.FixedBatchSpout;

import storm.trident.testing.MemoryMapState;

import storm.trident.tuple.TridentTuple;

public class TridentWordCount {

    public static class Split extends BaseFunction {

        @Override

        public void execute(TridentTuple tuple, TridentCollector collector) {

            String sentence = tuple.getString(0);

            for (String word : sentence.split(" ")) {

                collector.emit(new Values(word));

            }

        }

    }

    public static StormTopology buildTopology(LocalDRPC drpc) {

        FixedBatchSpout spout =

                new FixedBatchSpout(new Fields("sentence"), 3, new Values(

                        "the cow jumped over the moon"), new Values(

                        "the man went to the store and bought some candy"), new Values(

                        "four score and seven years ago"),

                        new Values("how many apples can you eat"), new Values(

                                "to be or not to be the person"));

        spout.setCycle(true);

        //创建拓扑对象

        TridentTopology topology = new TridentTopology();

        //这个流程用于统计单词数据。结果将被保存在wordCounts中

        TridentState wordCounts =

                topology.newStream("spout1", spout)

                        .parallelismHint(16)

                        .each(new Fields("sentence"), new Split(), new Fields("word"))

                        .groupBy(new Fields("word"))

                        .persistentAggregate(new MemoryMapState.Factory(), new Count(),

                                new Fields("count")).parallelismHint(16);

        //这个流程用于查询上面的统计结果

        topology.newDRPCStream("words", drpc)

                .each(new Fields("args"), new Split(), new Fields("word"))

                .groupBy(new Fields("word"))

                .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))

                .each(new Fields("count"), new FilterNull())

               .aggregate(new Fields("count"), new Sum(), new Fields("sum"));

        return topology.build();

    }

    public static void main(String[] args) throws Exception {

        Config conf = new Config();

        conf.setMaxSpoutPending(20);

        if (args.length == 0) {

            LocalDRPC drpc = new LocalDRPC();

            LocalCluster cluster = new LocalCluster();

            cluster.submitTopology("wordCounter", conf, buildTopology(drpc));

            for (int i = 0; i < 100; i++) {

                System.out.println("DRPC RESULT: " + drpc.execute("words", "cat the dog jumped"));

                Thread.sleep(1000);

            }

        } else {

            conf.setNumWorkers(3);

            StormSubmitter.submitTopologyWithProgressBar(args[0], conf, buildTopology(null));

        }

    }

}

实例实现了最主要的wordcount功能，然后将结果输出。

关键过程例如以下：

1、定义了输入流

        FixedBatchSpout spout =

                new FixedBatchSpout(new Fields("sentence"), 3, new Values(

                        "the cow jumped over the moon"), new Values(

                        "the man went to the store and bought some candy"), new Values(

                        "four score and seven years ago"),

                        new Values("how many apples can you eat"), new Values(

                                "to be or not to be the person"));

        spout.setCycle(true);

（1）使用FixedBatchSpout创建一个输入spout。spout的输出字段为sentence。每3个元组作为一个batch。

（2）数据不断的反复发送。

2、统计单词数量

        TridentState wordCounts =

                topology.newStream("spout1", spout)

                        .parallelismHint(16)

                        .each(new Fields("sentence"), new Split(), new Fields("word"))

                        .groupBy(new Fields("word"))

                        .persistentAggregate(new MemoryMapState.Factory(), new Count(),

                                new Fields("count")).parallelismHint(16);

这个流程用于统计单词数据，结果将被保存在wordCounts中。6行代码的含义分别为：

（1）首先从spout中读取消息，spout1定义了zookeeper中用于保存这个拓扑的节点名称。

（2）并行度设置为16，即16个线程同一时候从spout中读取消息。

（3）each中的三个參数分别为：输入字段名称，处理函数，输出字段名称。即从字段名称叫sentence的数据流中读取数据，然后经过new Split()处理后，以word作为字段名发送出去。当中new Split()后面介绍。它的功能就是将输入的内容以空格为界作了切分。

（4）将字段名称为word的数据流作分组，即同样值的放在一组。

（5）将已经分好组的数据作统计，结果放到MemoryMapState。然后以count作为字段名称将结果发送出去。这步骤会同一时候存储数据及状态，并将返回TridentState对象。

（6）并行度设置。

3、输出统计结果

        topology.newDRPCStream("words", drpc)

                .each(new Fields("args"), new Split(), new Fields("word"))

                .groupBy(new Fields("word"))

                .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))

                .each(new Fields("count"), new FilterNull())

               .aggregate(new Fields("count"), new Sum(), new Fields("sum"));

这个流程从上述的wordCounts对象中读取结果。并返回。6行代码的含义分别为：

（1）等待一个drpc调用，从drpcserver中接受words的调用来提供消息。

调用代码例如以下：

drpc.execute("words", "cat the dog jumped")

（2）输入为上述调用中提供的參数，经过Split()后。以word作为字段名称发送出去。

（3）以word的值作分组。

（4）从wordCounts对象中查询结果。

4个參数分别代表：数据来源，输入数据，内置方法（用于从map中依据key来查找value）。输出名称。

（5）过滤掉空的查询结果，如本例中，cat和dog都没有结果。

（6）将结果作统计，并以sum作为字段名称发送出去，这也是DRPC调用所返回的结果。假设没有这一行。最后的输出结果

DRPC RESULT: [["cat the dog jumped","the",2310],["cat the dog jumped","jumped",462]]

加上这一行后，结果为：

DRPC RESULT: [[180]]

4、split的字义

    public static class Split extends BaseFunction {

        @Override

        public void execute(TridentTuple tuple, TridentCollector collector) {

            String sentence = tuple.getString(0);

            for (String word : sentence.split(" ")) {

                collector.emit(new Values(word));

            }

        }

    }

注意它最后会发送数据。

5、创建并启动拓扑

    public static void main(String[] args) throws Exception {

        Config conf = new Config();

        conf.setMaxSpoutPending(20);

        if (args.length == 0) {

            LocalDRPC drpc = new LocalDRPC();

            LocalCluster cluster = new LocalCluster();

            cluster.submitTopology("wordCounter", conf, buildTopology(drpc));

            for (int i = 0; i < 100; i++) {

                System.out.println("DRPC RESULT: " + drpc.execute("words", "cat the dog jumped"));

                Thread.sleep(1000);

            }

        } else {

            conf.setNumWorkers(3);

            StormSubmitter.submitTopologyWithProgressBar(args[0], conf, buildTopology(null));

        }

    }

（1）当无參数执行时。启动一个本地的集群，及自已创建一个drpc对象来输入。

（2）当有參数执行时，设置worker数量为3。然后提交拓扑到集群。并等待远程的drpc调用。

（三）使用kafka作为数据源的一个样例

package com.netease.sytopology;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.util.Arrays;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import storm.kafka.BrokerHosts;

import storm.kafka.StringScheme;

import storm.kafka.ZkHosts;

import storm.kafka.trident.OpaqueTridentKafkaSpout;

import storm.kafka.trident.TridentKafkaConfig;

import storm.trident.TridentTopology;

import storm.trident.operation.BaseFunction;

import storm.trident.operation.TridentCollector;

import storm.trident.operation.builtin.Count;

import storm.trident.testing.MemoryMapState;

import storm.trident.tuple.TridentTuple;

import backtype.storm.Config;

import backtype.storm.StormSubmitter;

import backtype.storm.generated.AlreadyAliveException;

import backtype.storm.generated.InvalidTopologyException;

import backtype.storm.generated.StormTopology;

import backtype.storm.spout.SchemeAsMultiScheme;

import backtype.storm.tuple.Fields;

import backtype.storm.tuple.Values;

/*

 * 本类完毕下面内容

 */

public class SyTopology {

    public static final Logger LOG = LoggerFactory.getLogger(SyTopology.class);

    private final BrokerHosts brokerHosts;

    public SyTopology(String kafkaZookeeper) {

        brokerHosts = new ZkHosts(kafkaZookeeper);

    }

    public StormTopology buildTopology() {

        TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(brokerHosts, "ma30", "storm");

        kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());

        // TransactionalTridentKafkaSpout kafkaSpout = new

        // TransactionalTridentKafkaSpout(kafkaConfig);

        OpaqueTridentKafkaSpout kafkaSpout = new OpaqueTridentKafkaSpout(kafkaConfig);

        TridentTopology topology = new TridentTopology();

        // TridentState wordCounts =

        topology.newStream("kafka4", kafkaSpout).

        each(new Fields("str"), new Split(),

                new Fields("word")).groupBy(new Fields("word"))

                .persistentAggregate(new MemoryMapState.Factory(), new Count(),

                        new Fields("count")).parallelismHint(16);

        // .persistentAggregate(new HazelCastStateFactory(), new Count(),

        // new Fields("aggregates_words")).parallelismHint(2);

        return topology.build();

    }

    public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException {

        String kafkaZk = args[0];

        SyTopology topology = new SyTopology(kafkaZk);

        Config config = new Config();

        config.put(Config.TOPOLOGY_TRIDENT_BATCH_EMIT_INTERVAL_MILLIS, 2000);

        String name = args[1];

        String dockerIp = args[2];

        config.setNumWorkers(9);

        config.setMaxTaskParallelism(5);

        config.put(Config.NIMBUS_HOST, dockerIp);

        config.put(Config.NIMBUS_THRIFT_PORT, 6627);

        config.put(Config.STORM_ZOOKEEPER_PORT, 2181);

        config.put(Config.STORM_ZOOKEEPER_SERVERS, Arrays.asList(dockerIp));

        StormSubmitter.submitTopology(name, config, topology.buildTopology());

    }

    static class Split extends BaseFunction {

        public void execute(TridentTuple tuple, TridentCollector collector) {

            String sentence = tuple.getString(0);

            for (String word : sentence.split(",")) {

                try {

                    FileWriter fw = new FileWriter(new File("/home/data/test/ma30/ma30.txt"),true);

                    fw.write(word);

                    fw.flush();

                    fw.close();

                } catch (IOException e) {

                    // TODO Auto-generated catch block

                    e.printStackTrace();

                }

                collector.emit(new Values(word));

            }

        }

    }

}

本例将从kafka中读取消息，然后对消息依据“，”作拆分，并写入一个本地文件。

1、定义kafka想着配置

TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(brokerHosts, "ma30", "storm");

kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());

OpaqueTridentKafkaSpout kafkaSpout = new OpaqueTridentKafkaSpout(kafkaConfig);

当中ma30是订阅的topic名称。

2、从kafka中读取消息并处理

        topology.newStream("kafka4", kafkaSpout).

        each(new Fields("str"), new Split(),new Fields("word")).

        groupBy(new Fields("word"))

        .persistentAggregate(new MemoryMapState.Factory(), new Count(),

                        new Fields("count")).parallelismHint(16);

（1）指定了数据来源，并指定zookeeper中用于保存数据的位置，即保存在/transactional/kafka4。

（2）指定处理方法及发射的字段。

（3）依据word作分组。

（4）计数后将状态写入MemoryMapState

提交拓扑：

storm jar target/sytopology2-0.0.1-SNAPSHOT.jar com.netease.sytopology.SyTopology 192.168.172.98:2181/kafka test3 192.168.172.98

此时能够在/home/data/test/ma30/ma30.txt看到split的结果

trident介绍的更多相关文章

Transactional topologies —— 事务拓扑
事务拓扑是怎么回事? Storm guarantees that every message will be played through the topology at least once. St ...
Storm入门教程第五章一致性事务【转】
Storm是一个分布式的流处理系统,利用anchor和ack机制保证所有tuple都被成功处理.如果tuple出错,则可以被重传,但是如何保证出错的tuple只被处理一次呢?Storm提供了一套事务性 ...
Strom-7 Storm Trident 详细介绍
一.概要 1.1 Storm(简介) Storm是一个实时的可靠地分布式流计算框架. 具体就不多说了,举个例子,它的一个典型的大数据实时计算应用场景:从Kafka消息队列读取消息( ...
storm trident 的介绍与使用
一.trident 的介绍 trident 的英文意思是三叉戟,在这里我的理解是因为之前我们通过之前的学习topology spout bolt 去处理数据是没有问题的,但trident 的对spou ...
Storm介绍及与Spark Streaming对比
Storm介绍 Storm是由Twitter开源的分布式.高容错的实时处理系统,它的出现令持续不断的流计算变得容易,弥补了Hadoop批处理所不能满足的实时要求.Storm常用于在实时分析.在线机器学 ...
iOS冰与火之歌(番外篇) - 基于PEGASUS（Trident三叉戟）的OS X 10.11.6本地提权
iOS冰与火之歌(番外篇) 基于PEGASUS(Trident三叉戟)的OS X 10.11.6本地提权蒸米@阿里移动安全 0x00 序这段时间最火的漏洞当属阿联酋的人权活动人士被apt攻击所使用 ...
各大浏览器内核介绍（Rendering Engine）
在介绍各大浏览器的内核之前,我们先来了解一下什么是浏览器内核. 所谓浏览器内核就是指浏览器最重要或者说核心的部分"Rendering Engine",译为"渲染引擎&qu ...
Spark入门实战系列--7.Spark Streaming（上）--实时流计算Spark Streaming原理介绍
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .Spark Streaming简介 1.1 概述 Spark Streaming 是Spa ...
【转】Fiddler的基本介绍
转:http://kb.cnblogs.com/page/130367/#basic Fiddler的官方网站: www.fiddler2.com Fiddler的官方帮助:http://docs. ...

随机推荐

Java学习（运算符，引用数据类型）
一. 运算符 1.算数运算符运算符是用来计算数据的符号.数据可以是常量,也可以是变量.被运算符操作的数我们称为操作数. 算术运算符最常见的操作就是将操作数参与数学计算,具体使用看下图 ...
vue插件集合
Vue2.0+组件库总结 UI组件 element - 饿了么出品的Vue2的web UI工具套件 Vux - 基于Vue和WeUI的组件库 mint-ui - Vue 2的移动UI元素 iview ...
【LOJ】 #2008. 「SCOI2015」小凸想跑步
题解一道想法很简单的计算几何(由于我半平面交总是写不对,我理所当然的怀疑半平面交错了,事实上是我直线建错了) 首先我们对于两个凸包上的点设为$(x_0,y_0)$和$(x_1,y_1)$(逆 ...
HDU 6024 Building Shops
$dp$. $dp[i]$表示到$i$位置,且$i$位置建立了的最小花费,那么$dp[i] = min(dp[k]+cost[i+1][k-1])$,$k$是上一个建的位置.最后枚举$dp[i]$,加 ...
jQuery.Validate.js验证大表单的优化
最近在项目中有遇到一个Form表单中有200多个标签.在提交表单时网页会出现等待时间很长,甚至会出现网页奔溃的情况. 主要的原因是因为在使用jQuery.Validate.js进行Form验证的时候会 ...
FutureTask简单实战
FutureTask是什么? 线程池的实现核心之一是FutureTask.在提交任务时,用户实现的Callable实例task会被包装为FutureTask实例ftask:提交后任务异步执行,无需用户 ...
最小生成树---->prim算法的应用 hdu1863
畅通工程 Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Submi ...
图论之初，拓扑排序、前向星（通过存储边来存储图）加优先队列对拓扑的优化-----hdu1285
确定比赛名次 Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Sub ...
Scrapy学习篇（三）之创建项目
创建项目创建项目是爬取内容的第一步,之前已经讲过,Scrapy通过scrapy startproject <project_name>命令来在当前目录下创建一个新的项目. 下面我们创建一 ...
绘制bitmap 全屏安卓获取屏幕大小
韩梦飞沙韩亚飞 313134555@qq.com yue31313 han_meng_fei_sha 绘制bitmap 全屏 Rectf rectF = new RectF(0, 0, w, ...

trident介绍

trident介绍的更多相关文章

随机推荐

热门专题