转自:http://blog.163.com/guaiguai_family/blog/static/20078414520138911393767/ http://sites.computer.org/debull/A12june/pipeline.pdf这一套可以成为互联网公司的标准基础架构了,摘要如下: 把数据的 source of truth 放在数据总线里,而非 Hadoop 和数据仓库里.这是个很违反直觉的做法,但得益与 Kafka 巧妙的数据持久性以及分区.备份的设计,数据总线成了…
一,Execution Tree 执行树是数据流组件(转换和适配器)基于同步关系所建立的逻辑分组,每一个分组都是一个执行树的开始和结束,也可以将执行树理解为一个缓冲区的开始和结束,即缓冲区的整个生命周期. 大家知道,异步转换组件会结束输入缓冲区,创建新的输出缓冲区,所以,执行树的分组实际上通过异步转换组件来划分的,一个异步转换组件意味着上游执行树的结束和下游执行树的开始.当数据流经过异步转换组件,进入一个新的执行树,上一个执行树的缓冲区和相同数据就不再需要了,因为数据已经被传递到一个新的执行树和…
一.理论介绍(一)相关资料1.官方资料,非常详细:   http://kafka.apache.org/documentation.html#quickstart2.有一篇翻译版,基本一致,有些细节不同,建议入门时先读此文,再读官方文档.若自认英语很强,请忽视:   http://www.linuxidc.com/Linux/2014-07/104470.htm3.还有一文也可以:http://www.sxt.cn/info-2871-u-324.html其主要内容来源于以下三篇文章:日志:每个…
转自:https://www.stitchdata.com/blog/pipelinewise-singer/ Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to wr…
转自: http://www.confluent.io/blog/stream-data-platform-1/ These days you hear a lot about "stream processing", "event data", and "real-time", often related to technologies like Kafka, Storm, Samza, or Spark's Streaming module.…
How to build an ML pipeline for Data Science 垃圾信息分类 Ref:Develop a NLP Model in Python & Deploy It with Flask, Step by Step 其中使用naive bayes模型 做分类,此文不做表述. 重点来啦:Turning the Spam Message Classifier into a Web Application 其实就是http request 对接模型的 prediction…
http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying 主要的思想, 将所有的系统都可以看作两部分,真正的数据log系统和各种各样的query engine 所有的一致性由log系统来保证,其他各种query engine不需要考虑一致性,安全性,只需要不停的从log系统来同步数据,如果数据丢失或c…
This is a guest blog from Robin Moffatt. Robin Moffatt is Head of R&D (Europe) at Rittman Mead, and an Oracle ACE. His particular interests are analytics, systems architecture, administration, and performance optimization. This blog is also posted on…
https://github.com/onurakpolat/awesome-bigdata A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data. Your contributions are always welco…
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it has been disruptiv…