PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.
Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERT
statements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTS
andUPDATES
can be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- 如何解决Requests的SSLError(转)
add by zhj: 我使用方法2“更新系统的certificate”解决了问题 原文:https://www.jianshu.com/p/8deb13738d2c 这两天在Linux上爬Googl ...
- 大数据技术 - 为什么是SQL
在大数据处理以及分析中 SQL 的普及率非常高,几乎是每一个大数据工程师必须掌握的语言,甚至非数据处理岗位的人也在学习使用 SQL.今天这篇文章就聊聊 SQL 在数据分析中作用以及掌握 SQL 的必要 ...
- java中各种常见的异常
一.各种常见的异常 在上一节中程序如果你注意留意,程序抛出的异常是:java.lang.ArithmeticException.这个异常是在lang包中已经定义的.在lang包中还定义了一些我们非常常 ...
- Linux学习笔记之Linux系统的swap分区
0x00 什么是swap分区 Swap分区在系统的物理内存不够用的时候,把物理内存中的一部分空间释放出来,以供当前运行的程序使用.那些被释放的空间可能来自一些很长时间没有什么操作的程序,这些被释放的空 ...
- jQuery实现form表单基于ajax无刷新提交方法详解
本文实例讲述了jQuery实现form表单基于ajax无刷新提交方法.分享给大家供大家参考,具体如下: 首先,新建Login.html页面: <!DOCTYPE html PUBLIC &quo ...
- Managing C++ Objects: 管理C++对象 —— 一些建议准则
原文链接: Managing C++ Objects Here are some guidelines I have found useful for writing C++ classes. The ...
- Mycat分布式数据库架构解决方案--Mycat实现数据库分表
echo编辑整理,欢迎转载,转载请声明文章来源.欢迎添加echo微信(微信号:t2421499075)交流学习. 百战不败,依不自称常胜,百败不颓,依能奋力前行.--这才是真正的堪称强大!!! 准备工 ...
- 26、router.beforeEach路由拦截
为了防止用户未登录直接修改路径来访问页面,解决办法: 在main.js文件中加入以下代码: // 路由拦截 router.beforeEach((to, from, next) => { if ...
- react 爬坑记录
1.父子组件优化其一发生render条件:数据改变(state或者props改变),有时子组件会过多render.这时可在子组件里面的生命周期钩子里执行 shouldComponentUpdate(n ...
- error C4996: 'AVStream::codec': was declared deprecated
关闭VS的SDL检查 工程 属性=>C/C++ =>General=> SDL checks 改为 No(/sdl).