哈哈,几天连续收到百度两次电话,均是利好消息,于是乎不知不觉的自己的工作效率也提高了,几天折腾了好久终于在单机上配置好了hadoop,然后也成功的运行了一个用例,耶耶耶耶耶耶。

转自:http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Running Hadoop on Ubuntu Linux (Single-Node Cluster)

In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Are you looking for the multi-node cluster tutorial? Just head over there.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReducecomputing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.

This tutorial has been tested with the following software versions:

  • Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
  • Hadoop 1.0.3, released May 2012

Figure 1: Cluster of machines running Hadoop at Yahoo! (Source: Yahoo!)

Prerequisites

Sun Java 6

Hadoop requires a working Java 1.5+ (aka Java 5) installation. However, using Java 1.6 (aka Java 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.

Important Note: The apt instructions below are taken from this SuperUser.com thread. I got notified that the previous instructions that I provided no longer work. Please be aware that adding a third-party repository to your Ubuntu configuration is considered a security risk. If you do not want to proceed with the apt instructions below, feel free to install Sun JDK 6 via alternative means (e.g. by downloading the binary package from Oracle) and then continue with the next section in the tutorial.
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  1. # Add the Ferramosca Roberto's repository to your apt repositories
  2. # See https://launchpad.net/~ferramroberto/
  3. #
  4. $ sudo apt-get install python-software-properties
  5. $ sudo add-apt-repository ppa:ferramroberto/java
  6. # Update the source list
  7. $ sudo apt-get update
  8. # Install Sun Java 6 JDK
  9. $ sudo apt-get install sun-java6-jdk
  10. # Select Sun's Java as the default on your machine.
  11. # See 'sudo update-alternatives --config java' for more information.
  12. #
  13. $ sudo update-java-alternatives -s java-6-sun

The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu).

After installation, make a quick check whether Sun’s JDK is correctly set up:

  1. 1
  2. 2
  3. 3
  4. 4
  1. user@ubuntu:~# java -version
  2. java version "1.6.0_20"
  3. Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
  4. Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Adding a dedicated Hadoop system user

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

  1. 1
  2. 2
  1. $ sudo addgroup hadoop
  2. $ sudo adduser --ingroup hadoop hduser

This will add the user hduser and the group hadoop to your local machine.

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduseruser we created in the previous section.

I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several online guides available.

First, we have to generate an SSH key for the hduser user.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  1. user@ubuntu:~$ su - hduser
  2. hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
  3. Generating public/private rsa key pair.
  4. Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
  5. Created directory '/home/hduser/.ssh'.
  6. Your identification has been saved in /home/hduser/.ssh/id_rsa.
  7. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
  8. The key fingerprint is:
  9. 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
  10. The key's randomart image is:
  11. [...snipp...]
  12. hduser@ubuntu:~$

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

Second, you have to enable SSH access to your local machine with this newly created key.

  1. 1
  1. hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  1. hduser@ubuntu:~$ ssh localhost
  2. The authenticity of host 'localhost (::1)' can't be established.
  3. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
  4. Are you sure you want to continue connecting (yes/no)? yes
  5. Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
  6. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
  7. Ubuntu 10.04 LTS
  8. [...snipp...]
  9. hduser@ubuntu:~$

If the SSH connect should fail, these general tips might help:

  • Enable debugging with ssh -vvv localhost and investigate the error in detail.
  • Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

Disabling IPv6

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

/etc/sysctl.conf

  1. 1
  2. 2
  3. 3
  4. 4
  1. # disable ipv6
  2. net.ipv6.conf.all.disable_ipv6 = 1
  3. net.ipv6.conf.default.disable_ipv6 = 1
  4. net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

  1. 1
  1. $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

Alternative

You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh:

conf/hadoop-env.sh

  1. 1
  1. export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Hadoop

Installation

Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

  1. 1
  2. 2
  3. 3
  4. 4
  1. $ cd /usr/local
  2. $ sudo tar xzf hadoop-1.0.3.tar.gz
  3. $ sudo mv hadoop-1.0.3 hadoop
  4. $ sudo chown -R hduser:hadoop hadoop

(Just to give you the idea, YMMV – personally, I create a symlink from hadoop-1.0.3 to hadoop.)

Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

$HOME/.bashrc

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  1. # Set Hadoop-related environment variables
  2. export HADOOP_HOME=/usr/local/hadoop
  3. # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
  4. export JAVA_HOME=/usr/lib/jvm/java-6-sun
  5. # Some convenient aliases and functions for running Hadoop-related commands
  6. unalias fs &> /dev/null
  7. alias fs="hadoop fs"
  8. unalias hls &> /dev/null
  9. alias hls="fs -ls"
  10. # If you have LZO compression enabled in your Hadoop cluster and
  11. # compress job outputs with LZOP (not covered in this tutorial):
  12. # Conveniently inspect an LZOP compressed file from the command
  13. # line; run via:
  14. #
  15. # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
  16. #
  17. # Requires installed 'lzop' command.
  18. #
  19. lzohead () {
  20. hadoop fs -cat $1 | lzop -dc | head -1000 | less
  21. }
  22. # Add Hadoop bin/ directory to PATH
  23. export PATH=$PATH:$HADOOP_HOME/bin

You can repeat this exercise also for other users who want to use Hadoop.

Excursus: Hadoop Distributed File System (HDFS)

Before we continue let us briefly learn a bit more about Hadoop’s distributed file system.

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.

The Hadoop Distributed File System: Architecture and Design hadoop.apache.org/hdfs/docs/…

The following picture gives an overview of the most important HDFS components.

Configuration

Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.

hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change

conf/hadoop-env.sh

  1. 1
  2. 2
  1. # The java implementation to use. Required.
  2. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to

conf/hadoop-env.sh

  1. 1
  2. 2
  1. # The java implementation to use. Required.
  2. export JAVA_HOME=/usr/lib/jvm/java-6-sun

Note: If you are on a Mac with OS X 10.7 you can use the following line to set up JAVA_HOME in conf/hadoop-env.sh.

conf/hadoop-env.sh (on Mac systems)

  1. 1
  2. 2
  1. # for our Mac users
  2. export JAVA_HOME=`/usr/libexec/java_home`

conf/*-site.xml

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

Now we create the directory and set the required ownerships and permissions:

  1. 1
  2. 2
  3. 3
  4. 4
  1. $ sudo mkdir -p /app/hadoop/tmp
  2. $ sudo chown hduser:hadoop /app/hadoop/tmp
  3. # ...and if you want to tighten up security, chmod from 755 to 750...
  4. $ sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOExceptionwhen you try to format the name node in the next section).

Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.

In file conf/core-site.xml:

conf/core-site.xml

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  1. <property>
  2. <name>hadoop.tmp.dir</name>
  3. <value>/app/hadoop/tmp</value>
  4. <description>A base for other temporary directories.</description>
  5. </property>
  6. <property>
  7. <name>fs.default.name</name>
  8. <value>hdfs://localhost:54310</value>
  9. <description>The name of the default file system. A URI whose
  10. scheme and authority determine the FileSystem implementation. The
  11. uri's scheme determines the config property (fs.SCHEME.impl) naming
  12. the FileSystem implementation class. The uri's authority is used to
  13. determine the host, port, etc. for a filesystem.</description>
  14. </property>

In file conf/mapred-site.xml:

conf/mapred-site.xml

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  1. <property>
  2. <name>mapred.job.tracker</name>
  3. <value>localhost:54311</value>
  4. <description>The host and port that the MapReduce job tracker runs
  5. at. If "local", then jobs are run in-process as a single map
  6. and reduce task.
  7. </description>
  8. </property>

In file conf/hdfs-site.xml:

conf/hdfs-site.xml

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  1. <property>
  2. <name>dfs.replication</name>
  3. <value>1</value>
  4. <description>Default block replication.
  5. The actual number of replications can be specified when the file is created.
  6. The default is used if replication is not specified in create time.
  7. </description>
  8. </property>

See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.

Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the filesystem (which simply initializes the directory specified by the dfs.name.dirvariable), run the command

  1. 1
  1. hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
  2. 10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
  3. /************************************************************
  4. STARTUP_MSG: Starting NameNode
  5. STARTUP_MSG: host = ubuntu/127.0.1.1
  6. STARTUP_MSG: args = [-format]
  7. STARTUP_MSG: version = 0.20.2
  8. STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
  9. ************************************************************/
  10. 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
  11. 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
  12. 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
  13. 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
  14. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
  15. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
  16. /************************************************************
  17. SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
  18. ************************************************************/
  19. hduser@ubuntu:/usr/local/hadoop$

Starting your single-node cluster

Run the command:

  1. 1
  1. hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

The output will look like this:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  1. hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
  2. starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
  3. localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
  4. localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
  5. starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
  6. localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
  7. hduser@ubuntu:/usr/local/hadoop$

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  1. hduser@ubuntu:/usr/local/hadoop$ jps
  2. 2287 TaskTracker
  3. 2149 JobTracker
  4. 1938 DataNode
  5. 2085 SecondaryNameNode
  6. 2349 Jps
  7. 1788 NameNode

You can also check with netstat if Hadoop is listening on the configured ports.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  1. hduser@ubuntu:~$ sudo netstat -plten | grep java
  2. tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java
  3. tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java
  4. tcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java
  5. tcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java
  6. tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java
  7. tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java
  8. tcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java
  9. tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java
  10. tcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java
  11. tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java
  12. hduser@ubuntu:~$

If there are any errors, examine the log files in the /logs/ directory.

Stopping your single-node cluster

Run the command

  1. 1
  1. hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

to stop all the daemons running on your machine.

Example output:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  1. hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
  2. stopping jobtracker
  3. localhost: stopping tasktracker
  4. stopping namenode
  5. localhost: stopping datanode
  6. localhost: stopping secondarynamenode
  7. hduser@ubuntu:/usr/local/hadoop$

Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.

Download example input data

We will use three ebooks from Project Gutenberg for this example:

Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  1. hduser@ubuntu:~$ ls -l /tmp/gutenberg/
  2. total 3604
  3. -rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt
  4. -rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt
  5. -rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt
  6. hduser@ubuntu:~$

Restart the Hadoop cluster

Restart your Hadoop cluster if it’s not running already.

  1. 1
  1. hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

Copy local example data to HDFS

Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg
  2. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
  3. Found 1 items
  4. drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
  5. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
  6. Found 3 items
  7. -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
  8. -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
  9. -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
  10. hduser@ubuntu:/usr/local/hadoop$

Run the MapReduce job

Now, we actually run the WordCount example job.

  1. 1
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

This command will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output.

Note: Some people run the command above and get the following error message:

  1. Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar
  2. at org.apache.hadoop.util.RunJar.main (RunJar.java: 90)
  3. Caused by: java.util.zip.ZipException: error in opening zip file

In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:

  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

Example output of the previous command in the console:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
  2. 10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
  3. 10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
  4. 10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0%
  5. 10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0%
  6. 10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%
  7. 10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%
  8. 10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
  9. 10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
  10. 10/05/08 17:43:28 INFO mapred.JobClient: Job Counters
  11. 10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1
  12. 10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3
  13. 10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3
  14. 10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters
  15. 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026
  16. 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
  17. 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918
  18. 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330
  19. 10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework
  20. 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290
  21. 10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286
  22. 10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934
  23. 10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796
  24. 10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290
  25. 10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874
  26. 10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267
  27. 10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187
  28. 10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187
  29. 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
  2. Found 2 items
  3. drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
  4. drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output
  5. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
  6. Found 2 items
  7. drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs
  8. -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000
  9. hduser@ubuntu:/usr/local/hadoop$

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

  1. 1
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /user/hduser/gutenberg /user/hduser/gutenberg-output

An important note about mapred.map.tasksHadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.

Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

  1. 1
  1. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000

to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  1. hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
  2. hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output
  3. hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
  4. "(Lo)cra" 1
  5. "1490 1
  6. "1498," 1
  7. "35" 1
  8. "40," 1
  9. "A 2
  10. "AS-IS". 1
  11. "A_ 1
  12. "Absoluti 1
  13. "Alack! 1
  14. hduser@ubuntu:/usr/local/hadoop$

Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.

The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

NameNode Web Interface (HDFS layer)

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

By default, it’s available at http://localhost:50070/.

JobTracker Web Interface (MapReduce layer)

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.

TaskTracker Web Interface (MapReduce layer)

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

By default, it’s available at http://localhost:50060/.

What’s next?

If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial Running Hadoop On Ubuntu Linux (Multi-Node Cluster) where I describe how to build a Hadoop ‘‘multi-node’’ cluster with two Ubuntu boxes (this will increase your current cluster size by 100%, heh).

In addition, I wrote a tutorial on how to code a simple MapReduce job in the Python programming language which can serve as the basis for writing your own MapReduce programs.

Related Links

From yours truly:

From other people:

Change Log

Only important changes to this article are listed here:

  • 2011-07-17: Renamed the Hadoop user from hadoop to hduser based on readers’ feedback. This should make the distinction between the local Hadoop user (now hduser), the local Hadoop group (hadoop), and the Hadoop CLI tool (hadoop) more clear.

(转)单机上配置hadoop的更多相关文章

  1. 在MacOs上配置Hadoop和Spark环境

    在MacOs上配置hadoop和spark环境 Setting up Hadoop with Spark on MacOs Instructions 准备环境 如果没有brew,先google怎样安装 ...

  2. mac上eclipse上配置hadoop

    在mac上安装了eclipse之后,配置hadoop其实跟在linux上配置差不多,只是mac上得eclipse和界面和linux上得有点不同. 一:安装eclipse eclipse得安装比较简单, ...

  3. Hadoop完全分布式环境搭建(四)——基于Ubuntu16.04安装和配置Hadoop大数据环境

    [系统环境] [安装配置概要] 1.上传hadoop安装文件到主节点机器 2.给文件夹设置权限 3.解压 4.拷贝到目标文件夹 放在/opt文件夹下,目录结构:/opt/hadoop/hadoop-2 ...

  4. Hadoop单机模式安装-(3)安装和配置Hadoop

    网络上关于如何单机模式安装Hadoop的文章很多,按照其步骤走下来多数都失败,按照其操作弯路走过了不少但终究还是把问题都解决了,所以顺便自己详细记录下完整的安装过程. 此篇主要介绍在Ubuntu安装完 ...

  5. Ubuntu 14.04 (32位)上搭建Hadoop 2.5.1单机和伪分布式环境

    引言 一直用的Ubuntu 32位系统(准备下次用Fedora,Ubuntu越来越不适合学习了),今天准备学习一下Hadoop,结果下载Apache官网上发布的最新的封装好的2.5.1版,配置完了根本 ...

  6. Ubuntu上搭建Hadoop环境(单机模式+伪分布模式)

    首先要了解一下Hadoop的运行模式: 单机模式(standalone)        单机模式是Hadoop的默认模式.当首次解压Hadoop的源码包时,Hadoop无法了解硬件安装环境,便保守地选 ...

  7. 沉淀,再出发——在Ubuntu Kylin15.04中配置Hadoop单机/伪分布式系统经验分享

    在Ubuntu Kylin15.04中配置Hadoop单机/伪分布式系统经验分享 一.工作准备 首先,明确工作的重心,在Ubuntu Kylin15.04中配置Hadoop集群,这里我是用的双系统中的 ...

  8. Ubuntu上搭建Hadoop环境(单机模式+伪分布模式) (转载)

    Hadoop在处理海量数据分析方面具有独天优势.今天花了在自己的Linux上搭建了伪分布模式,期间经历很多曲折,现在将经验总结如下. 首先,了解Hadoop的三种安装模式: 1. 单机模式. 单机模式 ...

  9. Ubuntu上搭建Hadoop环境(单机模式+伪分布模式)【转】

    [转自:]http://blog.csdn.net/hitwengqi/article/details/8008203 最近一直在自学Hadoop,今天花点时间搭建一个开发环境,并整理成文. 首先要了 ...

随机推荐

  1. COUNT(*),count(1),COUNT(ALL expression),COUNT(DISTINCT expression) BY Group by

    select column_2,count(column_2) as 'count(column_2)' ,count(column_1) as 'count(column_1)' ,count(*) ...

  2. poj 2046 Gap

    题目连接 http://poj.org/problem?id=2046 Gap Description Let's play a card game called Gap. You have 28 c ...

  3. Go返回参数命名

    Go语言中可以为返回值定义名称.代码实例: package main import "fmt" func add1(a int, b int) int { return a + b ...

  4. Swift function how to return nil

    这两天在学习Stanford出品的iOS7的课程,这个课程去年也看过,但是看到第3课就不行了,满篇的OC,把人都搞晕了.这段时间因为要写个iOS的App,正好赶上了Swift问世,所以趁着这股劲继续学 ...

  5. [备忘]Asp.net MVC 将服务端Model传递的对象转为客户端javascript对象

    <script type="text/javascript"> var jsObject = @Html.Raw(Json.Encode(Model.Objects)) ...

  6. Sliverlight Slide 的左右滑动

    private void btnPrev_Click(object sender, RoutedEventArgs e) { scrollRule = (scrollRule-) >= ?(sc ...

  7. EF4.1之基础(实现Code First)

    Code First:顾名思义:就是通过代码生成数据库----通过类生成数据库中对应的表: 首先定义两个类(就是建模的过程): public class Order { public int Orde ...

  8. 编译QT时出现lib/libQtGui.so: undefined reference to `ts_read_raw'的解决办法

    lib/libQtGui.so: undefined reference to `ts_read_raw' /lib/libQtGui.so: undefined reference to `ts_o ...

  9. SQLite数据库的基本API函数

    1 .打开数据库: 说明:打开一个数据库,文件名不一定要存在,如果此文件不存在, sqlite 会自动创建.第一个参数指文件名,第二个参数则是定义的 sqlite3 ** 结构体指针(关键数据结构), ...

  10. 邻接矩阵实现Dijkstra算法以及BFS与DFS算法

    //============================================================================ // Name : MatrixUDG.c ...