Ubuntu15.10下Hadoop2.6.0伪分布式环境安装配置及Hadoop Streaming的体验
Ubuntu用的是Ubuntu15.10Beta2版本,正式的版本好像要到这个月的22号才发布。
参考的资料主要是http://www.powerxing.com/install-hadoop-cluster/和《Hadoop基础教程》这本书。
我的用户名是wuyouwulv,所以在接下来的代码中如果出现wuyouwulv的地方只要更改一下用户名就可以了。
搭建hadoop伪分布式环境并不需要为此创建一个新的group和user,所以我这里用的一直都是wuyouwulv这个用户。
我所需的文件都放在我的U盘根目录下的hadoop2.6目录下,它们包括:
core-site.xml
hadoop-2.6.0.tar.gz
hadoop-env.sh
hdfs-site.xml
mapred-site.xml
onenodeinstall.sh
readme.txt
其中主要的内容如下:
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/wuyouwulv/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
core-site.xml
hadoop-env.sh(这里其实就是改了JAVA_HOME)
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes. # The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/default-java # The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} # Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done # The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE="" # Extra Java runtime options. Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true" # Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS" export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS" export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS" # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS" # On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER} # Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER # Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER} ###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS="" ###
# Advanced Users Only!
### # The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR} # A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
hadoop-env.sh
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/wuyouwulv/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/wuyouwulv/hadoop/tmp/dfs/data</value>
</property>
</configuration>
hdfs-site.xml
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
mapred-site.xml
onenodeinstall.sh
#!/bin/bash # enable ssh localhost
ssh-keygen
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
chmod ~/.ssh/authorized_keys
chmod ~/.ssh # install hadoop
tar -zxvf /mnt/usb/hadoop2./hadoop-2.6..tar.gz -C /home/wuyouwulv/
mv hadoop-2.6./ hadoop # hadoop environment setting
cp /mnt/usb/hadoop2./core-site.xml hadoop/etc/hadoop/core-site.xml
cp /mnt/usb/hadoop2./hdfs-site.xml hadoop/etc/hadoop/hdfs-site.xml
cp /mnt/usb/hadoop2./mapred-site.xml hadoop/etc/hadoop/mapred-site.xml
cp /mnt/usb/hadoop2./hadoop-env.sh hadoop/etc/hadoop/hadoop-env.sh
onenodeinstall.sh
hadoop-2.6.0.tar.gz可以从网上下载得到。
在安装Hadoop之前需要安装Java,我安装的是默认的jdk版本
$ echo $JAVA_HOME
/usr/lib/jvm/default-java
需要配置shh,使其能够ssh localhost。
因为我的相关素材都是放在U盘的hadoop2.6目录下的,所以在正式安装hadoop之前我需要将其挂载到/mnt/usb/目录下:
$ sudo mkdir /mnt/usb
$ sudo mount -t vfat /dev/sdb1 /mnt/usb/
我准备吧hadoop安装在~/hadoop目录下,安装的指令如下:
# install hadoop
~$ tar -zxvf /mnt/usb/hadoop2.6/hadoop-2.6.0.tar.gz -C /home/wuyouwulv/
~$ mv hadoop-2.6.0/ hadoop
# hadoop environment setting
~$ cp /mnt/usb/hadoop2.6/core-site.xml hadoop/etc/hadoop/core-site.xml
~$ cp /mnt/usb/hadoop2.6/hdfs-site.xml hadoop/etc/hadoop/hdfs-site.xml
~$ cp /mnt/usb/hadoop2.6/mapred-site.xml hadoop/etc/hadoop/mapred-site.xml
~$ cp /mnt/usb/hadoop2.6/hadoop-env.sh hadoop/etc/hadoop/hadoop-env.sh
这样就安装好了hadoop,现在我们可以启动hadoop:
~$ cd /usr/local/hadoop
~/hadoop$ bin/hdfs namenode -format # namenode 格式化
~/hadoop$ sbin/start-dfs.sh # 开启守护进程
~/hadoop$ jps # 判断是否启动成功
若成功启动则会列出如下进程: NameNode、DataNode和SecondaryNameNode。
Hadoop Streaming运行WordCount的python的MapReduce程序:
~/hadoop$ bin/hdfs dfs -mkdir -p /user/wuyouwulv # 创建HDFS目录
~/hadoop$ bin/hdfs dfs -mkdir input
~/hadoop$ bin/hdfs dfs -copyFromLocal test.txt input # test.txt中包含一些单词
~/hadoop$ bin/hadoop java share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
> -file wcmapper.py -mapper wcmapper.py -file wcreducer.py -reducer wcreducer.py \
> -input input -output output
运行之后就会生成结果。
~/hadoop$ bin/hdfs dfs -cat output/* # 查看输出
wcmapper.py:
#!/usr/bin/python
import sys for line in sys.stdin:
a = line.split()
for x in a:
print x + "\t1"
wcmapper.py
wcreducer.py:
#!/usr/bin/python
import sys current = ""
count = 0 for line in sys.stdin:
word, c = line.split("\t")
if word == current:
count += 1
else:
if current != "":
print current + "\t" + str(count)
current = word
count = 1
print current + "\t" + str(count)
wcreducer.py
这里注意的是“bin/hdfs dfs -mkdir -p /user/wuyouwulv”处的wuyouwulv必须是当前的这个用户,见http://stackoverflow.com/questions/20821584/hadoop-2-2-installation-no-such-file-or-directory
input和output对应的目录是HDFS中的目录而不是本地目录。
最终这个程序实现了WordCount的功能。
两个python程序要加上可执行权限:
~/hadoop$ chmod a+x *.py
Ubuntu15.10下Hadoop2.6.0伪分布式环境安装配置及Hadoop Streaming的体验的更多相关文章
- 在Win7虚拟机下搭建Hadoop2.6.0伪分布式环境
近几年大数据越来越火热.由于工作需要以及个人兴趣,最近开始学习大数据相关技术.学习过程中的一些经验教训希望能通过博文沉淀下来,与网友分享讨论,作为个人备忘. 第一篇,在win7虚拟机下搭建hadoop ...
- Hadoop2.5.0伪分布式环境搭建
本章主要介绍下在Linux系统下的Hadoop2.5.0伪分布式环境搭建步骤.首先要搭建Hadoop伪分布式环境,需要完成一些前置依赖工作,包括创建用户.安装JDK.关闭防火墙等. 一.创建hadoo ...
- ubuntu14.04搭建Hadoop2.9.0伪分布式环境
本文主要参考 给力星的博文——Hadoop安装教程_单机/伪分布式配置_Hadoop2.6.0/Ubuntu14.04 一些准备工作的基本步骤和步骤具体说明本文不再列出,文章中提到的“见参考”均指以上 ...
- 安装hadoop2.6.0伪分布式环境
集群环境搭建请见:http://blog.csdn.net/jediael_lu/article/details/45145767 一.环境准备 1.安装linux.jdk 2.下载hadoop2.6 ...
- 安装hadoop2.6.0伪分布式环境 分类: A1_HADOOP 2015-04-27 18:59 409人阅读 评论(0) 收藏
集群环境搭建请见:http://blog.csdn.net/jediael_lu/article/details/45145767 一.环境准备 1.安装linux.jdk 2.下载hadoop2.6 ...
- 琐碎-hadoop2.2.0伪分布式和完全分布式安装(centos6.4)
环境是centos6.4-32,hadoop2.2.0 伪分布式文档:http://pan.baidu.com/s/1kTrAcWB 完全分布式文档:http://pan.baidu.com/s/1s ...
- OS X Yosemite下安装Hadoop2.5.1伪分布式环境
最近开始学习Hadoop,一直使用的是公司配好的环境.用了一段时间后发现对Hadoop还是一知半解,故决定动手在本机上安装一个供学习研究使用.正好自己用的是mac,所以没啥说的,直接安装. 总体流程 ...
- Hadoop系列(二)hadoop2.2.0伪分布式安装
一.环境配置 安装虚拟机vmware,并在该虚拟机机中安装CentOS 6.4: 修改hostname(修改配置文件/etc/sysconfig/network中的HOSTNAME=hadoop),修 ...
- Hadoop2.6.0伪分布环境搭建
用到的软件: 一.安装jdk: 1.要安装的jdk,我把它拷在了共享文件夹里面. (用优盘拷也可以) 2.我把jdk拷在了用户文件夹下面. (其他地方也可以,不过路径要相应改变) 3.执行复制安装 ...
随机推荐
- Execute Disable Bit
“Execute Disable Bit”是Intel在新一代处理器中引入的一项功能,开启该功能后,可以防止病毒.蠕虫.木马等程序利用溢出.无限扩大等手法去破坏系统内存并取得系统的控制权.其工作原理是 ...
- vim 中的常用编辑
1.将1到3列行首添加‘#’ :1,3s/^/#/g 2.将1到3列行首去除‘#’ :1,3s/^#//g 3.将1到3列中前两列字符去掉 :1,3s/^..//g 4.将1到3列中行末前两个字符去掉 ...
- Hibernate与JDBC、EJB、JDO的比较
常用的数据库操作包括:JDBC.EJB.JDO以及Hibernate.它的各有优缺点: (1) JDBC:多数Java开发人员是用JDBC来和数据库进行通信,它可以通过DAO模式进行改善和提高.但这种 ...
- PHPAdmin的安装和配置
phpadmin是用于管理mysql数据库的一个产品,,毕竟很多数据库服务器不能够公开连接,所以只能够使用http的方式来进行连接管理. 下载phpadmin( http://xj-http. ...
- dataframe 列名重新排序
在用list包含多个dict的模式生成dataframe时,由于dict的无序性,而uci很多数据的特征名直接是1,2,3...,生成的dataframe和原生的不一样, 为了方便观看和使用,我们将其 ...
- NFS的安装配置使用
/////////////////////////////NFS///////////////////////////////////////////////////写在前面:NFS在数据传输/信息传 ...
- 使用SpringMVC报错 Error creating bean with name 'conversionService' defined in class path resource [springmvc.xml]
使用SpringMVC报错 Error creating bean with name 'conversionService' defined in class path resource [spri ...
- 绝对布局absoluteLayout
绝对布局absoluteLayout 一.简介 二.实例 绝对布局我们是指定的横纵坐标,所以可以这样直接拖 绝对布局实际中用的少
- 通过使用Netty实现RPC
目标:通过使用Netty框架实现RPC(远程过程调用协议),技术储备为以后实现分布式服务框架做技术储备.在这里实现自定义协议主要实现远程方法调用. 技术分析: 1.通过Java的反射技术我们可以获取对 ...
- iptables(四)iptables匹配条件总结之一
经过前文的总结,我们已经能够熟练的管理规则了,但是我们使用过的"匹配条件"少得可怜,之前的示例中,我们只使用过一种匹配条件,就是将"源地址"作为匹配条件. 那么 ...