Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- asp.net mvc3 数据验证(二)——错误信息的自定义及其本地化
原文:asp.net mvc3 数据验证(二)--错误信息的自定义及其本地化 一.自定义错误信息 在上一篇文章中所做的验证,在界面上提示的信息都是系统自带的,有些读起来比较生硬.比如: ...
- Magicodes.NET框架
Magicodes.NET框架之路——让代码再飞一会(ASP.NET Scaffolding) 首先感谢大家对Magicodes.NET框架的支持.就如我上篇所说,框架成熟可能至少还需要一年,毕竟 ...
- [推荐]ORACLE PL/SQL编程之四:把游标说透(不怕做不到,只怕想不到)
原文:[推荐]ORACLE PL/SQL编程之四:把游标说透(不怕做不到,只怕想不到) [推荐]ORACLE PL/SQL编程之四: 把游标说透(不怕做不到,只怕想不到) 继上两篇:ORACLE PL ...
- sqlserver大容量日志文件处理
原文:sqlserver大容量日志文件处理 针对SqlServer2000 .SqlServer2005.SqlServer2008.SqlServer2012.SqlServer2014库日志文件优 ...
- selenium2入门 用testNG对百度首页输入框进行测试 (三)
如果还没有安装testNG的亲,可以点击http://www.cnblogs.com/milanmi/p/4346580.html查看安装过程. 这节主要是对百度首页的输入框进行输入测试. packa ...
- C#遍历文件名
遍历文件名程序 //////////////////第一种方法///////////// static ArrayList GetAllFiles(string path) { ArrayList r ...
- [译]JDK 6 and JDK 7中的subString()方法
(说明,该文章翻译自The substring() Method in JDK 6 and JDK 7) 在JDK 6 and JDK 7中的substring(int beginIndex, int ...
- Lua 5.2 Reference Manual
Lua 5.2 Reference Manual.pdf
- Lucene.net入门学习
Lucene.net入门学习(结合盘古分词) Lucene简介 Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,即它不是一个完整的全 ...
- linux 启动oracle报cannot restore segment prot after reloc: Permission denied
error while loading shared libraries: $ORACLE_HOME/lib/libnnz10.so: cannot restore segment prot afte ...