以下为官方文档:

Multithreaded Pipeline Overview

A multithreaded pipeline is a pipeline with an origin that supports parallel execution, enabling one pipeline to run in multiple threads.

Multithreaded pipelines enable processing high volumes of data in a single pipeline on one Data Collector, thus taking full advantage of all available CPUs on the Data Collector machine. When using multithreaded pipelines, make sure to allocate sufficient resources to the pipeline and Data Collector.

A multithreaded pipeline honors the configured delivery guarantee for the pipeline, but does not guarantee the order in which batches of data are processed.

How It Works

When you configure a multithreaded pipeline, you specify the number of threads that the origin should use to generate batches of data. You can also configure the maximum number of pipeline runners that Data Collector uses to perform pipeline processing.

A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors and destinations in the pipeline and represents all pipeline processing after the origin.

Origins perform multithreaded processing based on the origin systems they work with, but the following is true for all origins that generate multithreaded pipelines:

When you start the pipeline, the origin creates a number of threads based on the multithreaded property configured in the origin. And Data Collector creates a number of pipeline runners based on the pipeline Max Runners property to perform pipeline processing. Each thread connects to the origin system and creates a batch of data, and passes the batch to an available pipeline runner.

Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property specify the interval or to opt out of empty batch generation.

Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline instances, the order that batches are written to destinations is not ensured.

For example, take the following multithreaded pipeline. The HTTP Server origin processes HTTP POST and PUT requests passed from HTTP clients. When you configure the origin, you specify the number of threads to use - in this case, the Max Concurrent Requests property:

Let's say you configure the pipeline to opt out of the Max Runners property. When you do this, Data Collector generates a matching number of pipeline runners for the number of threads.

With Max Concurrent Requests set to 5, when you start the pipeline the origin creates five threads and Data Collector creates five pipeline runners. Upon receiving data, the origin passes a batch to each of the pipeline runners for processing.

Conceptually, the multithreaded pipeline looks like this:

Each pipeline runner performs the processing associated with the rest of the pipeline. After a batch is written to pipeline destinations - in this case, Azure Data Lake Store 1 and 2 - the pipeline runner becomes available for another batch of data. Each batch is processed and written as quickly as possible, independently from batches processed by other pipeline runners, so the write-order of the batches can differ from the read-order.

At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.

 
 
 
 

StreamSets 多线程 Pipelines的更多相关文章

  1. StreamSets 部署 Pipelines 到 SDC Edge

    可以使用如下方法: 下载edge 运行包并包含pipeline定义文件. 直接发布到edge 设备. 在data colelctor 机器配置并配置了edge server 地址(主要需要网络可访问) ...

  2. StreamSets 相关文章

    相关streamsets 文章(不按顺序) 学习视频-百度网盘 StreamSets 设计Edge pipeline StreamSets Data Collector Edge 说明 streams ...

  3. StreamSets SDC RPC Pipelines说明

    主要目的是进行跨pipeline 数据的通信,而不仅仅是内部pipeline 的通信,之间不同网络进行通信 一个参考图 pipeline 类型 origin destination 部署架构 使用多个 ...

  4. StreamSets使用指南

    StreamSets使用指南 最近在调研Streamsets,照猫画虎做了几个最简单的Demo鉴于网络上相关资料非常少,做个记录. 1.简介 Streamsets是一款大数据实时采集和ETL工具,可以 ...

  5. python多线程爬虫设计及实现示例

    爬虫的基本步骤分为:获取,解析,存储.假设这里获取和存储为io密集型(访问网络和数据存储),解析为cpu密集型.那么在设计多线程爬虫时主要有两种方案:第一种方案是一个线程完成三个步骤,然后运行多个线程 ...

  6. 抓包分析、多线程爬虫及xpath学习

    1.抓包分析 1.1 Fiddler安装及基本操作 由于很多网站采用的是HTTPS协议,而fiddler默认不支持HTTPS,先通过设置使fiddler能抓取HTTPS网站,过程可参考(https:/ ...

  7. StreamSets学习系列之StreamSets的Core Tarball方式安装(图文详解)

    不多说,直接上干货! 前期博客 StreamSets学习系列之StreamSets支持多种安装方式[Core Tarball.Cloudera Parcel .Full Tarball .Full R ...

  8. StreamSets 设计Edge pipeline

    edge pipeline 运行在edge 执行模式,我们可以使用 data collector UI 进行edge pipeline 设计, 设计完成之后,你可以部署对应的pipeline到edge ...

  9. streamsets origin 说明

    origin 是streamsets pipeline的soure 入口,只能应用一个origin 在pipeline中, 对于运行在不同执行模式的pipeline 可以应用不同的origin 独立模 ...

随机推荐

  1. 3-5 回顾,快速二分法的疑点解惑:为啥先右j移动?因为设定a[left]为基准点

    快速二分法的疑点解惑:为啥先右j移动?因为设定a[left]为基准数 , 1] [91, 86, 42, 46, 9, 68, 77, 46, 7, 1] [91, 86, 42, 46, 9, 68 ...

  2. 『PyTorch』第五弹_深入理解autograd_中:Variable梯度探究

    查看非叶节点梯度的两种方法 在反向传播过程中非叶子节点的导数计算完之后即被清空.若想查看这些变量的梯度,有两种方法: 使用autograd.grad函数 使用hook autograd.grad和ho ...

  3. python-day33--进程间通信(IPC)

    方式:队列(推荐使用) 一.基本情况 1.可以往队列里放任意类型的数据 2. 队列:先进先出 3. q=Queue(3) #可以设置队列中最多可以进入多少个值,也可以不设置 q.put('first' ...

  4. hdu2159完全背包

    md心里有事的时候不能写题操 FATE Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Othe ...

  5. POJ-1753 Flip Game (BFS+状态压缩)

    Description Flip game is played on a rectangular 4x4 field with two-sided pieces placed on each of i ...

  6. OAF 通过个性化 在标准事件上添加验证

    在实际的开发过程中,我们经常会遇到以下情况: 在执行标准的功能之前要对个性化的内容进行校验. 比如:在某个标准页面通过个性化添加了一个勾选框,在点击下一步的时候必须去验证此勾选框是否勾选. 具体实现如 ...

  7. 信号的发送kill,raise,alarm,setitimer,abort,sigqueue

    1.kill函数 int kill(pid_t pid, int sig); 发送信号给指定的进程. (1) If pid is positive, then signal sig is sent t ...

  8. URAL 1941

    比赛的时候三个点没有优化成功.其实也没有想到哈希成数.然后就变成了只要一个长度和scary相等的区间内所有数字个数都是相等的.那么就是符合题意的.于是.为了不TLE我们不能对txt每个位置遍历 的同时 ...

  9. POJ 1754 线段树

    e,应该是线段树里的水题.线段树单点更新.查询区间最值. 代码套用模板 PS :模板有些地方不太懂. #include<stdio.h>#include<iostream>#i ...

  10. Dash:程序员的好帮手(转载)

    作为一名死coder,每天最常见的动作就是查看各种API文档,你一定也有过同时打开N个窗口(HTML.PDF.CHM),不停的在编辑器与文档之间切换的感受吧?怎么说呢,其实我很讨厌这种枯燥无味的动作, ...