A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHPAwesome Pythonand Awesome Sysadmin

Hadoop

  • Apache Hadoop - Apache Hadoop
  • Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
  • Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • dumbo - Python module that allows you to easily write and run Hadoop programs.
  • hadoopy - Python MapReduce library written in Cython.
  • mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
  • pydoop - Pydoop is a package that provides a Python API for Hadoop.
  • hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
  • White Elephant - Hadoop log aggregator and dashboard
  • Kiji Project
  • Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
  • Apache Ignite - Distributed in-memory platform

YARN

  • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
  • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
  • mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

  • Apache HBase - Apache HBase
  • Apache Phoenix - A SQL skin over HBase supporting secondary indices
  • happybase - A developer-friendly Python library to interact with Apache HBase.
  • Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
  • hindex - Secondary Index for HBase
  • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • OpenTSDB - The Scalable Time Series Database
  • Apache Cassandra

SQL on Hadoop

SQL on Hadoop

  • Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
  • Apache Phoenix A SQL skin over HBase supporting secondary indices
  • Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
  • Lingual - SQL interface for Cascading (MR/Tez job generator)
  • Cloudera Impala
  • Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
  • Apache Tajo - Data warehouse system for Apache Hadoop
  • Apache Drill - Schema-free SQL Query Engine
  • Apache Trafodion

Data Management

  • Apache Calcite - A Dynamic Data Management Framework
  • Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

  • Apache Oozie - Apache Oozie
  • Azkaban
  • Apache Falcon - Data management and processing platform
  • Apache NiFi - A dataflow system
  • Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
  • Luigi - Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

DSL

  • Apache Pig - Apache Pig
  • Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
  • vahara - Machine learning and natural language processing with Apache Pig
  • packetpig - Open Source Big Data Security Analytics
  • akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
  • seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
  • Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
  • PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

  • Apache Spark

    • Spark Packages - A community index of packages for Apache Spark
    • SparkHub - A community site for Apache Spark
  • Apache Crunch
  • Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
  • Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.

    Packaging, Provisioning and Monitoring

  • Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

  • Apache Ambari - Apache Ambari
  • Ganglia Monitoring System
  • ankush - A big data cluster management tool that creates and manages clusters of different technologies.
  • Apache Zookeeper - Apache Zookeeper
  • Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
  • Buildoop - Hadoop Ecosystem Builder
  • Deploop - The Hadoop Deploy System
  • Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
  • inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Search

Search Engine Framework

  • Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.

Security

  • Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
  • Apache Sentry - An authorization module for Hadoop
  • Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Benchmark

  • Big Data Benchmark
  • HiBench
  • Big-Bench
  • hive-benchmarks
  • hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
  • YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Machine learning and Big Data analytics

  • Apache Mahout
  • Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
  • MLlib - MLlib is Apache Spark's scalable machine learning library.
  • R - R is a free software environment for statistical computing and graphics.
  • RHadoop including RHDFS, RHBase, RMR2, plyrmr
  • RHive RHive, for launching Hive queries from R
  • Apache Lens
  • Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Misc.

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Presentations

Books

Hadoop and Big Data Events

Awesome Hadoop的更多相关文章

  1. Hadoop 中利用 mapreduce 读写 mysql 数据

    Hadoop 中利用 mapreduce 读写 mysql 数据   有时候我们在项目中会遇到输入结果集很大,但是输出结果很小,比如一些 pv.uv 数据,然后为了实时查询的需求,或者一些 OLAP ...

  2. 初识Hadoop、Hive

    2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...

  3. hadoop 2.7.3本地环境运行官方wordcount-基于HDFS

    接上篇<hadoop 2.7.3本地环境运行官方wordcount>.继续在本地模式下测试,本次使用hdfs. 2 本地模式使用fs计数wodcount 上面是直接使用的是linux的文件 ...

  4. hadoop 2.7.3本地环境运行官方wordcount

    hadoop 2.7.3本地环境运行官方wordcount 基本环境: 系统:win7 虚机环境:virtualBox 虚机:centos 7 hadoop版本:2.7.3 本次先以独立模式(本地模式 ...

  5. 【Big Data】HADOOP集群的配置(一)

    Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...

  6. Hadoop学习之旅二:HDFS

    本文基于Hadoop1.X 概述 分布式文件系统主要用来解决如下几个问题: 读写大文件 加速运算 对于某些体积巨大的文件,比如其大小超过了计算机文件系统所能存放的最大限制或者是其大小甚至超过了计算机整 ...

  7. 程序员必须要知道的Hadoop的一些事实

    程序员必须要知道的Hadoop的一些事实.现如今,Apache Hadoop已经无人不知无人不晓.当年雅虎搜索工程师Doug Cutting开发出这个用以创建分布式计算机环境的开源软...... 1: ...

  8. Hadoop 2.x 生态系统及技术架构图

    一.负责收集数据的工具:Sqoop(关系型数据导入Hadoop)Flume(日志数据导入Hadoop,支持数据源广泛)Kafka(支持数据源有限,但吞吐大) 二.负责存储数据的工具:HBaseMong ...

  9. Hadoop的安装与设置(1)

    在Ubuntu下安装与设置Hadoop的主要过程. 1. 创建Hadoop用户 创建一个用户,用户名为hadoop,在home下创建该用户的主目录,就不详细介绍了. 2. 安装Java环境 下载Lin ...

  10. 基于Ubuntu Hadoop的群集搭建Hive

    Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...

随机推荐

  1. JS比较思维模型

    在这里,要分享的JS中两种思维方式: 1)面向对象风格示例: function Foo(who){ this.me = who; } Foo.prototype.identify = function ...

  2. JavaScript巧学巧用

    关于 微信公众号:前端呼啦圈(Love-FED) 我的博客:劳卜的博客 知乎专栏:前端呼啦圈 前言 由于工作和生活上的一些变化,最近写文章的频率有点下降了,实在不好意思,不过相信不久就会慢慢恢复过来, ...

  3. 用Web抓包分析工具Livepool 实现本地替换开发

    这是官方的介绍: LivePool 是一个基于 NodeJS,类似 Fiddler 支持抓包和本地替换的 Web 开发调试工具,是 Tencent AlloyTeam 在开发实践过程总结出的一套的便捷 ...

  4. [Day01] Python基础

    明天要完成的任务如下:  Python 四则运算 Python 数据结构 Python 元算符(in.not in.is.and.or) 用户输入 (input.raw_input) 流程控制 缩进 ...

  5. 开始奇妙的DP之旅

    铭记各位大佬教导,开始看一些很迷的动态规划,那就从比较典型的01背包开始吧,想想还是从比较简单的导弹拦截开始吧,说简单都是骗人的,还是看采药吧. 一.动态规划 刚听到动态规划这个东西,据HLT大佬所言 ...

  6. [进程管理] Linux中Load average的理解

    Load average的定义 系统平均负载被定义为在特定时间间隔内运行队列中的平均进程树.如果一个进程满足以下条件则其就会位于运行队列中: - 它没有在等待I/O操作的结果 - 它没有主动进入等待状 ...

  7. RedHat 7.1 下安装 Zabbix监控程序详解(适合linux初级用户)

    RedHat 7.1 安装 Zabbix 监控程序详解(适合对linux初级用户)2017-05-02 安装步骤: 1.zabbix需要安装LAMP架构 2.安装zabbix服务 3.初始化zabbi ...

  8. namenode和datanode 的namespaceID导致的问题

    namenode和datanode 的namespaceID导致,datanode无法正常的启动,经过查资料,解决的办法就是更改datanode的VERSION之中的namespace namenod ...

  9. 【SoDiaoEditor电子病历编辑器】阶段性更新啦

    转眼距离上一次v2正式发布已经过去一个半月了.github期间不定期push了二十几次,同时感谢分布在广州.福建.上海.北京的一众小伙伴,正是你们给出的建议,才让SoDiaoEditor不断完善. 我 ...

  10. Haproxy------在windows下配置负载均衡

    配置Haproxy 1.解压Haproxy到d:\haproxy 2.置haproxy.cfg文件 global log 127.0.0.1 local0 maxconn 1500 daemon de ...