一,Execution Tree

执行树是数据流组件(转换和适配器)基于同步关系所建立的逻辑分组,每一个分组都是一个执行树的开始和结束,也可以将执行树理解为一个缓冲区的开始和结束,即缓冲区的整个生命周期。

大家知道,异步转换组件会结束输入缓冲区,创建新的输出缓冲区,所以,执行树的分组实际上通过异步转换组件来划分的,一个异步转换组件意味着上游执行树的结束和下游执行树的开始。当数据流经过异步转换组件,进入一个新的执行树,上一个执行树的缓冲区和相同数据就不再需要了,因为数据已经被传递到一个新的执行树和一组新的缓冲区中。

1,缓冲区配置文件

在执行package时,缓冲区管理器根据package中的执行树来定义缓冲区配置文件。一个特定的执行树中的所有组件使用的缓冲区配置文件是相同的。当为每个执行树定义缓冲区配置文件时,SSIS缓冲区管理器会查看执行树中所有的转换组件,并在缓冲区中包括所有转换组件需要的每一个column,这就意味着某些列在初始转换或适配器(源和目的组件)中没有被使用,缓冲区仍然会为这些列分配空间。

优化数据流可以看作是优化关系表,列的宽度和数量越少,缓冲区容纳的数据行数越多。

2,EngineThreads 属性

如果线程可用,并且执行树需要较高的CPU利用率,那么线程调度程序可以为单个执行树分配多个线程。每个转换和适配器(源和目的)都可以接收一个线程,所以如果执行树有N个组件,那么它最多可以拥有N个线程。

Data flow Task 有 EngineThreads 属性,用于设置Data Flow Task 中所有执行树在同一时刻能够使用的最大线程数。根据执行树的复杂性,为执行树分配单个或多个线程,能够提高数据流的执行效率。

二,Data Pipeline

1,Pipeline是一种“链式模型”,按照一定的顺序,串接不同的程序或者不同的组件,让它们组成一条直线的工作流。对于一个完整的输入,经过各个组件的先后协同处理,得到唯一的最终输出。

Pipeline的模型图是流程式的,如下图,就像一条生产线一样,各个组件独立完成特定的目的,并把处理的结果上交到下一层。

2,SSIS Data Pipeline

MSDB上对SSIS Data Pipeline的解释如下:SSIS Process Control

The SSIS data flow process control component and its tasks are processed by the data flow engine within SSIS. A key feature of the SSIS data flow engine is the data pipeline, shown in Figure 1-1, which uses memory buffers to improve processing performance. The data pipeline enables parallel data processing options and reduces or eliminates multiple passes of reading and writing of the data during package execution and processing. This level of efficiency means you can process significantly more data in shorter periods of time than is possible if you rely simply on stored procedures for your ETL processes.

Figure 1-1 The SSIS data flow data pipeline

Maximum data processing performance for SSIS packages is achieved because the data pipeline uses buffers to manipulate data in memory. Source data, whether it’s relational, structured as XML data, or stored in flat files like spreadsheets or comma-delimited text files, is converted into table-like structures containing columns and rows and loaded directly into memory buffers without the need of staging the data first in temporary tables. Transformations within a data flow operate on the in-memory buffered data as well as on sorting, merging, modifying, and enhancing the data before sending it to the next transformation or on to its final destination. By avoiding the overhead of re-reading from and writing to disk, the processes required to move and manipulate data can operate at optimal speed.

读完MSDN的解释之后,依然云里雾里,从《SSIS Architecture and Internals Interview Questions》这篇文章中,我大概了解SSIS Data Flow Pipeline的设计功能,Data Pipeline是内存中的一种链式结构,用于执行Data Flow Task定义的数据流任务,Data Flow Task中的任何一个组件(源,转换和目的)都是链式结构的一个节点,各个节点按照Data Flow Path定义的路径,或并行,或串行,执行相应的功能。

引用于《SSIS Architecture and Internals Interview Questions

What are the different components in the SSIS architecture?

  • The SSIS architecture comprises of four main components:

    • The SSIS runtime engine manages the workflow of the package
    • The data flow pipeline engine manages the flow of data from source to destination and in-memory transformations
    • The SSIS object model is used for programmatically creating, managing and monitoring SSIS packages
    • The SSIS windows service allows managing and monitoring packages
  • To learn more about the architecture click here.

How is SSIS runtime engine different from the SSIS dataflow pipeline engine?

  • The SSIS Runtime Engine manages the workflow of the packages during runtime, which means its role is to execute the tasks in a defined sequence.  As you know, you can define the sequence using precedence constraints. This engine is also responsible for providing support for event logging, breakpoints in the BIDS designer, package configuration, transactions and connections. The SSIS Runtime engine has been designed to support concurrent/parallel execution of tasks in the package.
  • The Dataflow Pipeline Engine is responsible for executing the data flow tasks of the package. It creates a dataflow pipeline by allocating in-memory structure for storing data in-transit. This means, the engine pulls data from source, stores it in memory, executes the required transformation in the data stored in memory and finally loads the data to the destination. Like the SSIS runtime engine, the Dataflow pipeline has been designed to do its work in parallel by creating multiple threads and enabling them to run multiple execution trees/units in parallel.

How is a synchronous (non-blocking) transformation different from an asynchronous (blocking) transformation in SQL Server Integration Services?

  • A transformation changes the data in the required format before loading it to the destination or passing the data down the path. The transformation can be categorized in Synchronous and Asynchronous transformation.
  • A transformation is called synchronous when it processes each incoming row (modify the data in required format in place only so that the layout of the result-set remains same) and passes them down the hierarchy/path. It means, output rows are synchronous with the input rows (1:1 relationship between input and output rows) and hence it uses the same allocated buffer set/memory and does not require additional memory. Please note, these kinds of transformations have lower memory requirements as they work on a row-by-row basis (and hence run quite faster) and do not block the data flow in the pipeline. Some of the examples are : Lookup, Derived Columns, Data Conversion, Copy column, Multicast, Row count transformations, etc.
  • A transformation is called Asynchronous when it requires all incoming rows to be stored locally in the memory before it can start producing output rows. For example, with an Aggregate Transformation, it requires all the rows to be loaded and stored in memory before it can aggregate and produce the output rows. This way you can see input rows are not in sync with output rows and more memory is required to store the whole set of data (no memory reuse) for both the data input and output. These kind of transformations have higher memory requirements (and there are high chances of buffer spooling to disk if insufficient memory is available) and generally runs slower. The asynchronous transformations are also called "blocking transformations" because of its nature of blocking the output rows unless all input rows are read into memory. To learn more about it click here.

What is the difference between a partially blocking transformation versus a fully blocking transformation in SQL Server Integration Services?

  • Asynchronous transformations, as discussed in last question, can be further divided in two categories depending on their blocking behavior:

    • Partially Blocking Transformations do not block the output until a full read of the inputs occur.  However, they require new buffers/memory to be allocated to store the newly created result-set because the output from these kind of transformations differs from the input set. For example, Merge Join transformation joins two sorted inputs and produces a merged output. In this case if you notice, the data flow pipeline engine creates two input sets of memory, but the merged output from the transformation requires another set of output buffers as structure of the output rows which are different from the input rows. It means the memory requirement for this type of transformations is higher than synchronous transformations where the transformation is completed in place.
    • Full Blocking Transformations, apart from requiring an additional set of output buffers, also blocks the output completely unless the whole input set is read. For example, the Sort Transformation requires all input rows to be available before it can start sorting and pass down the rows to the output path. These kind of transformations are most expensive and should be used only as needed. For example, if you can get sorted data from the source system, use that logic instead of using a Sort transformation to sort the data in transit/memory. To learn more about it click here.

What is an SSIS execution tree and how can I analyze the execution trees of a data flow task?

  • The work to be done in the data flow task is divided into multiple chunks, which are called execution units, by the dataflow pipeline engine.  Each represents a group of transformations. The individual execution unit is called an execution tree, which can be executed by separate thread along with other execution trees in a parallel manner. The memory structure is also called a data buffer, which gets created by the data flow pipeline engine and has the scope of each individual execution tree. An execution tree normally starts at either the source or an asynchronous transformation and ends at the first asynchronous transformation or a destination. During execution of the execution tree, the source reads the data, then stores the data to a buffer, executes the transformation in the buffer and passes the buffer to the next execution tree in the path by passing the pointers to the buffers. To learn more about it click here.
  • To see how many execution trees are getting created and how many rows are getting stored in each buffer for a individual data flow task, you can enable logging of these events of data flow task: PipelineExecutionTrees, PipelineComponentTime, PipelineInitialization, BufferSizeTunning, etc. To learn more about events that can be logged click here.

How can an SSIS package be scheduled to execute at a defined time or at a defined interval per day?

  • You can configure a SQL Server Agent Job with a job step type of SQL Server Integration Services Package, the job invokes the dtexec command line utility internally to execute the package. You can run the job (and in turn the SSIS package) on demand or you can create a schedule for a one time need or on a reoccurring basis. Refer to this tip to learn more about it.

What is an SSIS Proxy account and why would you create it?

  • When we try to execute an SSIS package from a SQL Server Agent Job it fails with the message "Non-SysAdmins have been denied permission to run DTS Execution job steps without a proxy account". This error message is generated if the account under which SQL Server Agent Service is running and the job owner is not a sysadmin on the instance or the job step is not set to run under a proxy account associated with the SSIS subsystem. Refer to this tip to learn more about it.

How can you configure your SSIS package to run in 32-bit mode on 64-bit machine when using some data providers which are not available on the 64-bit platform?

  • In order to run an SSIS package in 32-bit mode the SSIS project property Run64BitRuntime needs to be set to "False".  The default configuration for this property is "True".  This configuration is an instruction to load the 32-bit runtime environment rather than 64-bit, and your packages will still run without any additional changes. The property can be found under SSIS Project Property Pages -> Configuration Properties -> Debugging.

SSIS Data Flow 的 Execution Tree 和 Data Pipeline的更多相关文章

  1. Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图

    https://www.threadingbuildingblocks.org/docs/help/index.htm Parallelizing Data Flow and Dependency G ...

  2. Data Flow Diagram with Examples - Customer Service System

    Data Flow Diagram with Examples - Customer Service System Data Flow Diagram (DFD) provides a visual ...

  3. SSIS ->> Control Flow And Data Flow

    In the Control Flow, the task is the smallest unit of work, and a task requires completion (success, ...

  4. SSIS Data Flow优化

    一,数据流设计优化 数据流有两个特性:流和在内存缓冲区中处理数据,根据数据流的这两个特性,对数据流进行优化. 1,流,同时对数据进行提取,转换和加载操作 流,就是在source提取数据时,转换组件处理 ...

  5. SSIS的 Data Flow 和 Control Flow

    Control Flow 和 Data Flow,是SSIS Design中主要用到的两个Tab,理解这两个Tab的作用,对设计更高效的package十分重要. 一,Control Flow 在Con ...

  6. SSIS ->> Data Flow Design And Tuning

    Requirements: Source and destination system impact Processing time windows and performance Destinati ...

  7. 微软BI 之SSIS 系列 - 理解Data Flow Task 中的同步与异步, 阻塞,半阻塞和全阻塞以及Buffer 缓存概念

    开篇介绍 在 SSIS Dataflow 数据流中的组件可以分为 Synchronous 同步和 Asynchronous 异步这两种类型. 同步与异步 Synchronous and Asynchr ...

  8. [转]Data Flow How-to Topics (SSIS)

    本文转自:http://technet.microsoft.com/en-us/library/ms137612(v=sql.90).aspx This section contains proced ...

  9. Data Flow的Error Output

    一,在Data Flow Task中,对于Error Row的处理通过Error Output Tab配置的. 1,操作失败的类型:Error(Conversion) 和 Truncation. 2, ...

随机推荐

  1. 用PS如何把图片调出时尚杂志色

    摘自:http://www.3lian.com/edu/2013/07-22/83061.html 01:打开图片,执行调整图层-色彩平衡;调整图层的标记-红色方框内图标. 02:色彩平衡-设置-点选 ...

  2. Python for Infomatics 第13章 网页服务四(译)

    这几天因为其他事务,打断了自己的学习计划,今天继续我的翻译,避免又中途而废. 注:文章原文为Dr. Charles Severance 的 <Python for Informatics> ...

  3. 深度学习框架搭建之最新版Python及最新版numpy安装

    这两天为了搭载深度学习的Python架构花了不少功夫,但是Theano对Python以及nunpy的版本都有限制,所以只能选用版本较新的python和nunpy以确保不过时.但是最新版Python和最 ...

  4. dede 简略标题调用标签

    一.简略标题调用标签: 1.{dede:field.shorttitle/} 不可以在{dede:arclist}标签中套用,一般放在网页titile处; 2.[field:shorttitle/] ...

  5. 误删/usr文件夹解决办法

    http://blog.chinaunix.net/uid-2623904-id-3044156.html http://www.centoscn.com/CentOS/Intermediate/20 ...

  6. 未能加载文件或程序集“Antlr3.Runtime”或它的某一个依赖项

    清空编译临时文件夹,从新编译就行了,路径如下: C:/Users/hp/AppData/Local/Temp/Temporary ASP.NET Files

  7. html 设置宽度100% 块状元素往下调解决方法

    css在设置body的宽度为100%充满整个屏幕时,当浏览器缩小时块状元素会被挤压下去 解决方案非常简单,给body设置一个最小宽度 min-width:960px; 此时即使浏览器缩小,在960像素 ...

  8. 网站中使用中文个性字库字体--@font-face解决方案探索 l(转)

    最近的项目有用到特别中文字体,最终效果如下图: 红线标记处均为字体,可选中,交互起来,比图片方便太多了. 解决思路就是将体积巨大的中文字库,取子集,只包涵要使用的那部分文字,因此体积就很小了(包含10 ...

  9. Node.js Ubuntu下安装

    安装 Node.js 依次执行以下指令: sudo apt-get update sudo apt-get install -y python-software-properties python g ...

  10. java基本类型的默认值及其取值范围