Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- android 如何加入第一3正方形lib图书馆kernel于
注意:只能lib图书馆kernel编译到位.例如下列: alps/kernel/ alps/mediatek/custom/common/kernel/ alps/mediatek/custom/$p ...
- 使用collectd与visage收集kvm虚拟机性能实时图形
软件功能: 通过collectd软件来监控收集kvm虚拟机的性能数据,包含cpu,memory.磁盘IO.网络流量等 通过visage软件将收集到的数据绘制图形. 安装: 系统环境:ubuntu12. ...
- HTML5新增核心工具——canvas
原文:HTML5新增核心工具--canvas Canvas元素称得上是HTML5的核心所在,它是一个依靠JavaScript绘制华丽图像的元素. Canvas由一个可绘制地区HTML代码中的属性定义决 ...
- 【solr这四个主题】在Tomcat 部署Solr4.x
1.安装Tomcat (1)下载并解压缩到/opt/tomcat在 # cd /opt/jediael # tar -zxvf apache-tomcat-7.0.54.tar.gz # mv apa ...
- AppiumDriver升级到2.0.0版本引发的问题--Cannot instantiate the type AppiumDriver
1. 问题描述和起因 在使用Appium1.7.0及其以下版本的时候,我们可以直接使用如下代码来创建一个AppiumDriver实例进行对安卓设备的操作. driver = new AndroidDr ...
- Web应用程序整体测试基础——单元测试
近年来,随着基于B/S结构的大型应用越来越多,Web应用程序测试问题也在逐步完善中.但Web应用程序测试既可以在系统开发中实施,也可以独立于系统单独完成,这取决于Web应用程序的复杂性和多样性.同时程 ...
- [译]Java 设计模式之抽象工厂
(文章翻译自Java Design Pattern: Abstract Factory) 抽象工厂模式针对工厂模式增加了抽象层.如果我们使用抽象工厂模式和工厂模式比较的话,很明显抽象工厂模式增加了一个 ...
- HDU 4812 D Tree 树分区+逆+hash新位置
意甲冠军: 特定n点树 K 以下n号码是正确的点 以下n-1行给出了树的侧. 问: 所以,如果有在正确的道路点图的路径 % mod = K 如果输出路径的两端存在. 多条路径则输出字典序最小的一条. ...
- javascript面向对象2
原文:javascript面向对象2 首先我们先创建一个对象 var user = Object(); user.name = "张三"; user.age = 20; user. ...
- C#使用Thrift简介,C#客户端和Java服务端相互交互
C#使用Thrift简介,C#客户端和Java服务端相互交互 本文主要介绍两部分内容: C#中使用Thrift简介 用Java创建一个服务端,用C#创建一个客户端通过thrift与其交互. 用纯C#实 ...