MongoDB Connector for Hadoop
MongoDB Connector for Hadoop
https://github.com/mongodb/mongo-hadoop
Purpose
The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.
Current stable release: 1.2.0
Features
- Can create data splits to read from standalone, replica set, or sharded configurations
- Source data can be filtered with queries using the MongoDB query language
- Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
- Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
- Can write data out in .bson format, which can then be imported to any MongoDB database with
mongorestore - Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.
Download
See the release page.
Building
To build, first edit the value for hadoopRelease in ThisBuild in the build.sbt file to select the distribution of Hadoop that you want to build against. For example to build for CDH4:
hadoopRelease in ThisBuild := "cdh4"
or for Hadoop 1.0.x:
hadoopRelease in ThisBuild := "1.0"
To determine which value you need to set in this file, refer to the list of distributions below. Then run ./sbt package to build the jars, which will be generated in the core/target/ directory.
After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:
$HADOOP_HOME/lib/$HADOOP_HOME/share/hadoop/mapreduce/$HADOOP_HOME/share/hadoop/lib/
Supported Distributions of Hadoop
Apache Hadoop 1.0
Does not support Hadoop Streaming.
Build using
"1.0"or"1.0.x"Apache Hadoop 1.1
Includes support for Hadoop Streaming.
Build using
"1.1"or"1.1.x"Apache Hadoop 0.20.*
Does not support Hadoop Streaming
Includes Pig 0.9.2.
Build using
"0.20"or"0.20.x"Apache Hadoop 0.23
Includes Pig 0.9.2.
Includes support for Streaming
Build using
"0.23"or"0.23.x"Cloudera Distribution for Hadoop Release 4
This is the newest release from Cloudera which is based on Apache Hadoop 2.0. The newer MR2/YARN APIs are not yet supported, but MR1 is still fully compatible.
Includes support for Streaming and Pig 0.11.1.
Build with
"cdh4"Apache Hadoop 2.2
Includes Pig 0.9.2
Includes support for Streaming
Build using
"2.2"or"2.2.x"
Configuration
Streaming
Examples
Usage with static .bson (mongo backup) files
Usage with Amazon Elastic MapReduce
Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.
Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.
Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions lib folders.
For a full example (running the enron example on Elastic MapReduce) please see here.
Usage with Pig
Documentation on Pig with the MongoDB Connector for Hadoop.
For examples on using Pig with the MongoDB Connector for Hadoop, also refer to the examples section.
Notes for Contributors
If your code introduces new features, please add tests that cover them if possible and make sure that the existing test suite still passes. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help.
Maintainers
Mike O'Brien (mikeo@10gen.com)
Contributors
- Brendan McAdams brendan@10gen.com
- Eliot Horowitz erh@10gen.com
- Ryan Nitz ryan@10gen.com
- Russell Jurney (@rjurney) (Lots of significant Pig improvements)
- Sarthak Dudhara sarthak.83@gmail.com (BSONWritable comparable interface)
- Priya Manda priyakanth024@gmail.com (Test Harness Code)
- Rushin Shah rushin10@gmail.com (Test Harness Code)
- Joseph Shraibman jks@iname.com (Sharded Input Splits)
- Sumin Xia xiasumin1984@gmail.com (Sharded Input Splits)
- Jeremy Karn
- bpfoster
- Ross Lawley
- Carsten Hufe
- Asya Kamsky
- Thomas Millar
Support
Issue tracking: https://jira.mongodb.org/browse/HADOOP/
Discussion: http://groups.google.com/group/mongodb-user/
MongoDB Connector for Hadoop的更多相关文章
- mongoDB BI 分析利器 - PostgreSQL FDW (MongoDB Connector for BI)
背景 mongoDB是近几年迅速崛起的一种文档型数据库,广泛应用于对事务无要求,但是要求较好的开发灵活性,扩展弹性的领域,. 随着企业对数据挖掘需求的增加,用户可能会对存储在mongo中的数据有挖掘需 ...
- 收藏2个mongodb connector网址
https://github.com/plaa/mongo-spark https://github.com/mongodb/mongo-hadoop http://codeforhire.com/2 ...
- Scala2.11.8 spark2.3.1 mongodb connector 2.3.0
import java.sql.DriverManager import com.mongodb.spark._ import org.apache.spark.SparkConf import or ...
- MongoDB资料--Java驱动, Hadoop驱动, Spark使用
MongoDB数据库备份: mongodump -h 192.168.1.160 -d MapLoc -o /usr/local/myjar/mongo/MapLoc/数据库还原:mongoresto ...
- 零售行业下MongoDB在产品目录系统、库存系统、个性推荐系统中的应用【转载】
Retail Reference Architecture Part 1: Building a Flexible, Searchable, Low-Latency Product Catalog P ...
- Hadoop, Python, and NoSQL lead the pack for big data jobs
Hadoop, Python, and NoSQL lead the pack for big data jobs Rise in cloud-based analytics could incr ...
- MongoDB:数据库介绍与基础操作
二.部署在本地服务器 在上次的学习过程中,我们主要进行了MongoDB运行环境的搭建和可视化工具的安装.此次我们将学习MongoDB有关的基本概念和在adminmongo上的基本操作.该文档中的数据库 ...
- Spark连接MongoDB之Scala
MongoDB Connector for Spark Spark Connector Scala Guide spark-shell --jars "mongo-spark-connect ...
- 后Hadoop时代的大数据技术思考:数据即服务
1. Hadoop 的神话正在破灭 IBM leads BigInsights for Hadoop out behind barn. Shots heard IBM has announced th ...
随机推荐
- 【转】node-webkit:开发桌面+WEB混合型应用的神器
顾名思义,node-webkit就是nodejs+webkit. 这样做的好处显而易见,核心奥义在于,用nodejs来进行本地化调用,用webkit来解析和执行HTML+JS. 快速上手 下载node ...
- CSS权威指南学习笔记系列(1)CSS和文档
题外话:HTML是一种结构化语言,而CSS是它的补充:这是一种样式语言.CSS是前端三板斧之一,因此学习CSS很重要.而我还是菜鸟,所以需要加强学习CSS.这个是我学习CSS权威指南的笔记,如有不对, ...
- 迁移/home目录至新硬盘分区总结--无备份情况下
搞了一天,终于成功迁移.由于一开始就没备份过程实在很曲折. 希望本篇对那些没有备份习惯的朋友们有所帮助. 准备工作: sudo vim /etc/fstab 在文件中加入: /dev/sdb8 ...
- c-参数(argument)
In C, array arguments behave as though they are passed by reference, and scalar variables and cons ...
- Python 快捷键
Ctrl + [ .Ctrl + ] 缩进代码Alt+3 Alt+4 注释.取消注释代码行Alt+5 Alt+6 切换缩进方式 空格<=>TabAlt+/ 单词完成,只要文中出现过,就可以 ...
- html5中的一些小知识点(CSS)
1.点击a标签周围区域就可以进入超链接: a标签 的css样式中的 display属性设置为block 就可以了 2.文字左右居中: text-align 属性值为 center 3.文字上下居中: ...
- 监听<input/>标签行为的方法总结
一.内容改变 1.<input onchange="javascript:function()"/>方法 onchange可以替换为下面几种:oninput,onpro ...
- RequireJS学习笔记(转)
前言 进入移动前端是很不错的选择,这块也是我希望的道路,但是不熟悉啊... 现在项目用的是require+backbone,整个框架被封装了一次,今天看了代码搞不清楚,觉得应该先从源头抓起,所以再看看 ...
- Python自动化运维之8、正则表达式re模块
re模块 正则表达式使用单个字符串来描述.匹配一系列符合某个句法规则的字符串,在文本处理方面功能非常强大,也经常用作爬虫,来爬取特定内容,Python本身不支持正则,但是通过导入re模块,Python ...
- Solr自动生成ID
在Solr中,每一个索引,都要有一个唯一的ID,类似于关系型数据库表中的主键.为了方便创建索引,需要配置自动生成的ID,即UUID. 一.配置schema.xml文件 添加uuid字段类型,修改字段i ...