PipelineWise illustrates the power of Singer

Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.
Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERT
statements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTS
andUPDATES
can be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
PipelineWise illustrates the power of Singer的更多相关文章
- pipelinewise 学习二 创建一个简单的pipeline
pipelinewise 提供了方便的创建简单pipeline的命令,可以简化pipeline 的创建,同时也可以帮我们学习 生成demo pipeline pipelinewise init --n ...
- xv6课本翻译之——第0章 操作系统接口
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- NCE3
Lesson1 A puma at large Pumas are large, cat-like animals which are found in America. When reports ...
- vim 的寄存器
If you've been following my series on Vim, it should be clear now that Vim has a pretty clear philos ...
- 2 Advanced Read/Write Splitting with PHP’s MySQLnd
原文地址需FQ才能看 https://blog.engineyard.com/2014/advanced-read-write-splitting-with-phps-mysqlnd In part ...
- New Concept English three (45)
31w/m 65error In democratic countries any efforts to restrict the freedom of the press are rightly c ...
- book-rev8 Chapter 0 Operating system interfaces
Chapter 0 第0章 Operating system interfaces 操作系统接口 The job of an operating system is to share a comput ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- 无线电源传输 Wireless Power Consortium (WPC) Communication
Universally Compatible Wireless Power Using the Qi Protocol Wireless charging of portable electronic ...
随机推荐
- ubuntu18 docker中部署ELK
ELK是ElasticSearch.Logstash.Kibana的简称,一般用于日志系统,从日志收集,日志转储,日志展示等入手,用以提供简洁高效的日志处理机制. 鉴于没有额外的机器,这里就用dock ...
- 【简记】修改Docker数据目录位置,包含镜像位置
为啥要改? Docker安装后默认下载的位置在/var/lib/docker ,如果/var分区没有独立分出来,Linux下默认是与/根分区在一起.一般我们装Linux系统的时候,除了做邮件服务器外, ...
- Debian Stretch升级当前最新稳定版内核
Why update kernel ? Update the kernel to new version fixed some newer hardware has no driver softwar ...
- MQTTv5.0 ---AUTH – 认证交换
AUTH报文被从客户端发送给服务端,或从服务端发送给客户端,作为扩展认证交换的一部分,比如质询/ 响应认证.如果CONNECT报文不包含相同的认证方法,则客户端或服务端发送AUTH报文将造成协议错 误 ...
- lightGBM gpu环境配置
推荐先看一手官方的Installation Guide.我用的是ubuntu 16.04,一些要求如下图: 主要是OpenCL以及libboost两个环境的要求. (1) OpenCL的安装.我这里之 ...
- VMwarm下安装ubuntu的一些问题
1.终端无法输入中文的原因(未实践) 原文地址 2.Windows10下VMwarm(V15.5)和ubuntu14.04实现复制文件(已经实践) 转载路径
- Beego 学习笔记11:文件的上传下载
文件的上传和下载 1->文件的上传 文件的上传,采用的是uploadify.js这个插件. 本事例实现的是上传图片文件,其他的文件上传也一样. 2->文件的下载 文件的下载有两个实现的方式 ...
- tomcat启动Publishing failed with multiple errors
转自:https://blog.csdn.net/leisurelen/article/details/46940441 新安装一个tomcat插件.启动的时候就弹错误框.但tomcat还能使用. P ...
- Java 8中的Base64编码和解码
转自:https://juejin.im/post/5c99b2976fb9a070e76376cc Java 8会因为将lambdas,流,新的日期/时间模型和Nashorn JavaScript引 ...
- mysql 查询当天数据
查询当天数据 select * from tab where FROM_UNIXTIME(fabutime, '%Y%m%d') = 20121217; mysql TO_DAYS(date) 函 ...