Flume Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreamin…
Last week ,I lead the meeting for new project. i'm  very excited. The meeting is divided into the following sections. 1 project introduce. The project will be an upgraded version of the existing Yitoa GPTimer. Additional features are required to faci…
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin's ACM Template 4. Data Structures by Dayou Liu 5. Euler's Totient Function Getting Started: 1) What is a good algorithm? The answer could be about cor…
Text mining is the application of natural language processing techniques and analytical methods to text data in order to derive relevant information. Text mining is getting a lot attention these last years, due to an exponential increase in digital t…
Introduction Optimization is always the ultimate goal whether you are dealing with a real life problem or building a software product. I, as a computer science student, always fiddled with optimizing my code to the extent that I could brag about its…
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This is the Visual C# tutorial. (Switch to the Visual Basic tutorial.) The default-paging option of a data presentation control is unsuitable when working w…
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例. 下面是一个例子: def getWordBlacklist(sparkContext): if ('wordBlacklist' not in globals()): globals()['wordBlacklist'] = sparkContext.broadcast(["a", &…
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zipfianacademy.com. There are plenty of articles and discussions on the web about what data science is, what qualitiesdefine a data scientist, how to nur…
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it has been disruptiv…
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Input DStreams 和 Receivers(接收器) DStreams 上的 Transformations(转换) DStreams 上的输出操作 DataFrame 和 SQL 操作 MLlib 操作 缓存 / 持久性 Checkpointing Accumulators, Broadcas…