A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Pythonand Awesome Sysadmin

Hadoop

Apache Hadoop - Apache Hadoop
Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
dumbo - Python module that allows you to easily write and run Hadoop programs.
hadoopy - Python MapReduce library written in Cython.
mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
pydoop - Pydoop is a package that provides a Python API for Hadoop.
hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
White Elephant - Hadoop log aggregator and dashboard
Kiji Project
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
Apache Ignite - Distributed in-memory platform

YARN

Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

Apache HBase - Apache HBase
Apache Phoenix - A SQL skin over HBase supporting secondary indices
happybase - A developer-friendly Python library to interact with Apache HBase.
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
hindex - Secondary Index for HBase
Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
OpenTSDB - The Scalable Time Series Database
Apache Cassandra

SQL on Hadoop

SQL on Hadoop

Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
Apache Phoenix A SQL skin over HBase supporting secondary indices
Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
Lingual - SQL interface for Cascading (MR/Tez job generator)
Cloudera Impala
Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
Apache Tajo - Data warehouse system for Apache Hadoop
Apache Drill - Schema-free SQL Query Engine
Apache Trafodion

Data Management

Apache Calcite - A Dynamic Data Management Framework
Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

Apache Oozie - Apache Oozie
Azkaban
Apache Falcon - Data management and processing platform
Apache NiFi - A dataflow system
Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
Luigi - Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

Apache Flume - Apache Flume
Suro - Netflix's distributed Data Pipeline
Apache Sqoop - Apache Sqoop
Apache Kafka - Apache Kafka
Gobblin from LinkedIn - Universal data ingestion framework for Hadoop

DSL

Apache Pig - Apache Pig
Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
vahara - Machine learning and natural language processing with Apache Pig
packetpig - Open Source Big Data Security Analytics
akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Libraries and Tools

Kite Software Development Kit - A set of libraries, tools, examples, and documentation
gohadoop - Native go clients for Apache Hadoop YARN.
Hue - A Web interface for analyzing data with Apache Hadoop.
Apache Zeppelin - A web-based notebook that enables interactive data analytics
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
Apache Thrift
Apache Avro - Apache Avro is a data serialization system.
Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
Spring for Apache Hadoop
hdfs - A native go client for HDFS
Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
snakebite

Realtime Data Processing

Apache Storm
Apache Samza
Apache Spark
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Distributed Computing and Programming

Apache Spark
- Spark Packages - A community index of packages for Apache Spark
- SparkHub - A community site for Apache Spark
Apache Crunch
Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.

Packaging, Provisioning and Monitoring
Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Apache Ambari - Apache Ambari
Ganglia Monitoring System
ankush - A big data cluster management tool that creates and manages clusters of different technologies.
Apache Zookeeper - Apache Zookeeper
Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
Buildoop - Hadoop Ecosystem Builder
Deploop - The Hadoop Deploy System
Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Search

ElasticSearch
Apache Solr
SenseiDB - Open-source, distributed, realtime, semi-structured database
Banana - Kibana port for Apache Solr

Search Engine Framework

Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.

Security

Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
Apache Sentry - An authorization module for Hadoop
Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Benchmark

Big Data Benchmark
HiBench
Big-Bench
hive-benchmarks
hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Machine learning and Big Data analytics

Apache Mahout
Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
MLlib - MLlib is Apache Spark's scalable machine learning library.
R - R is a free software environment for statistical computing and graphics.
RHadoop including RHDFS, RHBase, RMR2, plyrmr
RHive RHive, for launching Hive queries from R
Apache Lens
Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Misc.

Hive Plugins
- UDF
  - http://nexr.github.io/hive-udf/
  - https://github.com/edwardcapriolo/hive_cassandra_udfs
  - https://github.com/livingsocial/HiveSwarm
  - https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
  - https://github.com/karthkk/udfs
  - https://github.com/twitter/elephant-bird - Twitter
  - https://github.com/lovelysystems/ls-hive
  - https://github.com/stewi2/hive-udfs
  - https://github.com/klout/brickhouse
  - https://github.com/markgrover/hive-translate (PostgreSQL translate())
  - https://github.com/deanwampler/HiveUDFs
  - https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
  - https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
  - https://github.com/Netflix/Surus
- Storage Handler
- SerDe
- Libraries and tools
  - https://github.com/forward3d/rbhive
  - https://github.com/synctree/activerecord-hive-adapter
  - https://github.com/hrp/sequel-hive-adapter
  - https://github.com/forward/node-hive
  - https://github.com/recruitcojp/WebHive
  - shib - WebUI for query engines: Hive and Presto
  - clive - Clojure library for interacting with Hive via Thrift
  - https://github.com/anjuke/hwi
  - https://code.google.com/a/apache-extras.org/p/hipy/
  - https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
  - PyHive - Python interface to Hive and Presto
  - https://github.com/recruitcojp/OdbcHive
  - Hive-Sharp
  - HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
  - Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
  - Hive_test- Unit test framework for hive and hive-service
Flume Plugins
- Flume MongoDB Sink
- Flume HornetQ Channel
- Flume MessagePack Source
- Flume RabbitMQ source and sink
- Flume UDP Source
- Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
- Flume Custom Serializers
- Real-time analytics in Apache Flume
- .Net FlumeNG Clients

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Presentations

Hadoop Summit Presentations - Slide decks from Hadoop Summit
Hadoop 24/7
An example Apache Hadoop Yarn upgrade
Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning

Books

Hadoop and Big Data Events

Awesome Hadoop的更多相关文章

Hadoop 中利用 mapreduce 读写 mysql 数据
Hadoop 中利用 mapreduce 读写 mysql 数据有时候我们在项目中会遇到输入结果集很大,但是输出结果很小,比如一些 pv.uv 数据,然后为了实时查询的需求,或者一些 OLAP ...
初识Hadoop、Hive
2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...
hadoop 2.7.3本地环境运行官方wordcount-基于HDFS
接上篇<hadoop 2.7.3本地环境运行官方wordcount>.继续在本地模式下测试,本次使用hdfs. 2 本地模式使用fs计数wodcount 上面是直接使用的是linux的文件 ...
hadoop 2.7.3本地环境运行官方wordcount
hadoop 2.7.3本地环境运行官方wordcount 基本环境: 系统:win7 虚机环境:virtualBox 虚机:centos 7 hadoop版本:2.7.3 本次先以独立模式(本地模式 ...
【Big Data】HADOOP集群的配置（一）
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
Hadoop学习之旅二：HDFS
本文基于Hadoop1.X 概述分布式文件系统主要用来解决如下几个问题: 读写大文件加速运算对于某些体积巨大的文件,比如其大小超过了计算机文件系统所能存放的最大限制或者是其大小甚至超过了计算机整 ...
程序员必须要知道的Hadoop的一些事实
程序员必须要知道的Hadoop的一些事实.现如今,Apache Hadoop已经无人不知无人不晓.当年雅虎搜索工程师Doug Cutting开发出这个用以创建分布式计算机环境的开源软...... 1: ...
Hadoop 2.x 生态系统及技术架构图
一.负责收集数据的工具:Sqoop(关系型数据导入Hadoop)Flume(日志数据导入Hadoop,支持数据源广泛)Kafka(支持数据源有限,但吞吐大) 二.负责存储数据的工具:HBaseMong ...
Hadoop的安装与设置(1)
在Ubuntu下安装与设置Hadoop的主要过程. 1. 创建Hadoop用户创建一个用户,用户名为hadoop,在home下创建该用户的主目录,就不详细介绍了. 2. 安装Java环境下载Lin ...
基于Ubuntu Hadoop的群集搭建Hive
Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...

随机推荐

php curl 访问 https站点
$uri = "https://your_website"; $ch = curl_init (); $data=I('post.'); curl_setopt ( $ch, CU ...
[第一阶段] Python学习
首先声明一下,我这个学习计划是关于学习Python的. 先说一下起因:我自己接触Python算是很久了,目前仍没学会,很失败,很惭愧.所以这次一方面简单分析一下自学会碰到的问题:另一方便,我想到了一种 ...
C++STL中map容器的说明和使用技巧(杂谈)
1.map简介 map是一类关联式容器.它的特点是增加和删除节点对迭代器的影响很小,除了那个操作节点,对其他的节点都没有什么影响.对于迭代器来说,可以修改实值,而不能修改key. 2.map的功能自 ...
使用 rsync 同步
原文地址 http://www.howtocn.org/rsync:use_rsync 选项说明 -a, ––archive 归档模式,表示以递归方式传输文件,并保持所有文件属性,等价于 -rlpt ...
用PetaPoco为ASP.NET已有数据库建模
序:最近一直在抓紧重构公司的网站,没有很多时间去写博客,积累了很多的问题,几乎是一天一个,折腾死了,尤其是在模型方面几经周折. 以前,多半从事PHP开发,很少接触到模型(thinkphp中模型),但是 ...
使用Angular 4、Bootstrap 4、TypeScript和ASP.NET Core开发的Apworks框架案例应用：Task List
最近我为我自己的应用开发框架Apworks设计了一套案例应用程序,并以Apache 2.0开源,开源地址是:https://github.com/daxnet/apworks-examples,目的是 ...
redis数据库入门
Redis入门(1) 之安装.配置.安全登录 REmote DIctionary Server(Redis) 是一个由Salvatore Sanfilippo写的key-value存储系统. Redi ...
谈谈一些有趣的CSS题目（十七）-- 不可思议的颜色混合模式 mix-blend-mode
开本系列,谈谈一些有趣的 CSS 题目,题目类型天马行空,想到什么说什么,不仅为了拓宽一下解决问题的思路,更涉及一些容易忽视的 CSS 细节. 解题不考虑兼容性,题目天马行空,想到什么说什么,如果解题 ...
(转载)提高mysql千万级大数据SQL查询优化30条经验（Mysql索引优化注意）
1.对查询进行优化,应尽量避免全表扫描,首先应考虑在 where 及 order by 涉及的列上建立索引. 2.应尽量避免在 where 子句中对字段进行 null 值判断,否则将导致引擎放弃使用索 ...
CSS完美实现iframe高度自适应（支持跨域）
Iframe的强大功能偶就不多说了,它不但被开发人员经常运用,而且黑客们也常常使用它,总之用过的人知道它的强大之处,但是Iframe有个致命的"BUG"就是iframe的高度无法自 ...

Awesome Hadoop