/×××××××××××××××××××××××××××××××××××××××××/

Author:xxx0624

HomePage:http://www.cnblogs.com/xxx0624/

/×××××××××××××××××××××××××××××××××××××××××/

Hadoop伪分布式配置过程:

Hadoop:1.2.1

Hbase:0.94.25

nutch:2.2.1

Java:1.8.0

SSH:1.0.1j

tomcat:7.0.57

zookeeper:3.4.6

(1)配置Java环境:http://www.cnblogs.com/xxx0624/p/4164744.html

(2)配置OpenSSH:http://www.cnblogs.com/xxx0624/p/4165252.html

(3)配置Hadoop:http://www.cnblogs.com/xxx0624/p/4166095.html

(4)配置tomcat:http://www.cnblogs.com/xxx0624/p/4166840.html

(5)配置zookeeper:http://www.cnblogs.com/xxx0624/p/4168440.html

(6)配置HBase:http://www.cnblogs.com/xxx0624/p/4170468.html

(7)配置ant:http://www.cnblogs.com/xxx0624/p/4172277.html

(8)配置nutch:http://www.cnblogs.com/xxx0624/p/4172601.html

(9)集成:http://www.cnblogs.com/xxx0624/p/4176199.html

Hadoop:

(1)hadoop/conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/xxx0624/hadoop/tmp</value>
</property>
</configuration>

(2)hadoop/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/xxx0624/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/xxx0624/hadoop/hdfs/data</value>
</property>
</configuration>

(3)hadoop/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

(4)hadoop/conf/hadoop-env.sh

 # Set Hadoop-specific environment variables here.

 # The only required environment variable is JAVA_HOME.  All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes. # The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun # Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH= # The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000 # Extra Java runtime options. Empty by default.
# export HADOOP_OPTS=-server # Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
# export HADOOP_CLIENT_OPTS # Extra ssh options. Empty by default.
# export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR" # Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs # File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves # host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop # Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1 # The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the users that are going to run the hadoop daemons. Otherwise there is
# the potential for a symlink attack.
# export HADOOP_PID_DIR=/var/hadoop/pids # A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER # The scheduling priority for daemon processes. See 'man nice'.
# export HADOOP_NICENESS=10 export JAVA_HOME=/usr/lib/jvm export HADOOP_HOME=/home/xxx0624/hadoop export PATH=$PATH:/home/xxx0624/hadoop/bin export HBASE_CLASSPATH=/home/xxx0624/hadoop/conf

(5)hadoop/conf/masters

localhost

(6)hadoop/conf/slaves

localhost

HBase:

(1)hbase/conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://127.0.0.1:9000/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.master</name>
<value>127.0.0.1:60000</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!--property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property-->
<property>
<name>hbase.zookeeper.quorum</name>
<value>127.0.0.1</value>
</property>
<!--property>
<name>zookeeper.znode.parent</name>
<value>/home/hbase</value>
</property-->
</configuration>

(2)hbase/conf/hbase-env.sh

 #
#/**
# * Copyright 2007 The Apache Software Foundation
# *
# * Licensed to the Apache Software Foundation (ASF) under one
# * or more contributor license agreements. See the NOTICE file
# * distributed with this work for additional information
# * regarding copyright ownership. The ASF licenses this file
# * to you under the Apache License, Version 2.0 (the
# * "License"); you may not use this file except in compliance
# * with the License. You may obtain a copy of the License at
# *
# * http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */ # Set environment variables here. # This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.) # The java implementation to use. Java 1.6 required.
# export JAVA_HOME=/usr/java/jdk1.6.0/ # Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH= # The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_HEAPSIZE=1000 # Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC" # Uncomment one of the below three options to enable java garbage collection logging for the server-side processes. # This enables basic gc logging to the .out file.
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps" # This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>" # This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M" # Uncomment one of the below three options to enable java garbage collection logging for the client processes. # This enables basic gc logging to the .out file.
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps" # This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>" # This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M" # Uncomment below if you intend to use the EXPERIMENTAL off heap cache.
# export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize="
# Set hbase.offheapcache.percentage in hbase-site.xml to a nonzero value. # Uncomment and adjust to enable JMX exporting
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
#
# export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104" # File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers # File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default.
# export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters # Extra ssh options. Empty by default.
# export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR" # Where log files are stored. $HBASE_HOME/logs by default.
# export HBASE_LOG_DIR=${HBASE_HOME}/logs # Enable remote JDWP debugging of major HBase processes. Meant for Core Developers
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8070"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8071"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8072"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8073" # A string representing this instance of hbase. $USER by default.
# export HBASE_IDENT_STRING=$USER # The scheduling priority for daemon processes. See 'man nice'.
# export HBASE_NICENESS=10 # The directory where pid files are stored. /tmp by default.
# export HBASE_PID_DIR=/var/hadoop/pids # Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HBASE_SLAVE_SLEEP=0.1 # Tell HBase whether it should manage it's own instance of Zookeeper or not.
# export HBASE_MANAGES_ZK=true export JAVA_HOME=/usr/lib/jvm
export HBASE_MANAGES_ZK=true
export HBASE_CLASSPATH=/home/xxx0624/hadoop/conf

(3)hbase/conf/regionservers

localhost

nutch:

(1)nutch/conf/nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>xxx0624-ThinkPad-Edge</value>
</property>
</configuration>

(2)nutch/conf/gora.properties

 # Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. #gora.datastore.default=org.apache.gora.mock.store.MockDataStore
#gora.datastore.autocreateschema=true ###############################
# Default SqlStore properties #
############################### #gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
#gora.sqlstore.jdbc.user=sa
#gora.sqlstore.jdbc.password= ################################
# Default AvroStore properties #
################################ # gora.avrostore.codec.type=BINARY||JSON
# gora.avrostore.output.path=file:///tmp/gora.avrostore.test.output ################################
# DatafileAvroStore properties #
################################
# DataFileAvroStore is file based store which uses Avro's
# DataFile{Writer,Reader}'s as a backend. This datastore supports
# mapreduce. # gora.datafileavrostore.###= #########################
# HBaseStore properties #
#########################
# HBase requires that the Configuration has a valid "hbase.zookeeper.quorum"
# property. It should be included within hbase-site.xml on the classpath. When
# this property is omitted, it expects Zookeeper to run on localhost:2181. # To greatly improve scan performance, increase the hbase-site Configuration
# property "hbase.client.scanner.caching". This sets the number of rows to grab
# per request. # HBase autoflushing. Enabling autoflush decreases write performance.
# Available since Gora 0.2. Defaults to disabled.
# hbase.client.autoflush.default=false #############################
# CassandraStore properties #
############################# # gora.cassandrastore.servers=localhost:9160 #######################
# MemStore properties #
#######################
# This is a memory based {@link DataStore} implementation for tests. # gora.memstore.###= ############################
# AccumuloStore properties #
############################
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
#gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore
#gora.datastore.accumulo.mock=true
#gora.datastore.accumulo.instance=a14
#gora.datastore.accumulo.zookeepers=localhost
#gora.datastore.accumulo.user=root
#gora.datastore.accumulo.password=secret

(3)nutch/build.xml

 <?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--> <project name="${name}" default="runtime" xmlns:ivy="antlib:org.apache.ivy.ant" xmlns:artifact="antlib:org.apache.maven.artifact.ant"> <!-- Load all the default properties, and any the user wants -->
<!-- to contribute (without having to type -D or edit this file -->
<property file="${user.home}/build.properties" />
<property file="${basedir}/build.properties" />
<property file="${basedir}/default.properties" />
<property name="test.junit.output.format" value="plain" />
<property name="release.dir" value="${build.dir}/release" /> <!-- define Maven coordinates, repository url and artifacts name etc -->
<property name="groupId" value="org.apache.nutch" />
<property name="artifactId" value="nutch" />
<property name="maven-repository-url" value="https://repository.apache.org/service/local/staging/deploy/maven2" />
<property name="maven-repository-id" value="apache.releases.https" />
<property name="maven-jar" value="${release.dir}/${artifactId}-${version}.jar" />
<property name="maven-javadoc-jar" value="${release.dir}/${artifactId}-${version}-javadoc.jar" />
<property name="maven-sources-jar" value="${release.dir}/${artifactId}-${version}-sources.jar" /> <!-- the normal classpath -->
<path id="classpath">
<pathelement location="${build.classes}" />
<fileset dir="${build.lib.dir}">
<include name="*.jar" />
</fileset>
</path> <presetdef name="javac">
<javac includeantruntime="false" />
</presetdef> <!-- the unit test classpath -->
<dirname property="plugins.classpath.dir" file="${build.plugins}" />
<path id="test.classpath">
<pathelement location="${test.build.classes}" />
<pathelement location="${conf.dir}" />
<pathelement location="${test.src.dir}" />
<pathelement location="${plugins.classpath.dir}" />
<path refid="classpath" />
<pathelement location="${build.dir}/${final.name}.job" />
<fileset dir="${build.lib.dir}">
<include name="*.jar" />
</fileset>
</path> <!-- ====================================================== -->
<!-- Stuff needed by all targets -->
<!-- ====================================================== -->
<target name="init" depends="ivy-init" description="--> stuff required by all targets">
<mkdir dir="${build.dir}" />
<mkdir dir="${build.classes}" />
<mkdir dir="${release.dir}" /> <mkdir dir="${test.build.dir}" />
<mkdir dir="${test.build.classes}" /> <touch datetime="01/25/1971 2:00 pm">
<fileset dir="${conf.dir}" includes="**/*.template" />
</touch> <copy todir="${conf.dir}" verbose="true">
<fileset dir="${conf.dir}" includes="**/*.template" />
<mapper type="glob" from="*.template" to="*" />
</copy>
</target> <!-- ====================================================== -->
<!-- Compile the Java files -->
<!-- ====================================================== -->
<target name="compile" depends="compile-core, compile-plugins" description="--> compile all Java files"/> <target name="compile-core" depends="init, resolve-default" description="--> compile core Java files only">
<javac
encoding="${build.encoding}"
srcdir="${src.dir}"
includes="org/apache/nutch/**/*.java"
destdir="${build.classes}"
debug="${javac.debug}"
optimize="${javac.optimize}"
target="${javac.version}"
source="${javac.version}"
deprecation="${javac.deprecation}">
<compilerarg value="-Xlint:-path"/>
<classpath refid="classpath" />
</javac>
</target> <target name="compile-plugins" depends="init, resolve-default" description="--> compile plugins only">
<ant dir="src/plugin" target="deploy" inheritAll="false" />
</target> <!-- ================================================================== -->
<!-- Make nutch.jar -->
<!-- ================================================================== -->
<!-- -->
<!-- ================================================================== -->
<target name="jar" depends="compile-core" description="--> make nutch.jar">
<copy file="${conf.dir}/nutch-default.xml" todir="${build.classes}" />
<copy file="${conf.dir}/nutch-site.xml" todir="${build.classes}" />
<jar jarfile="${build.dir}/${final.name}.jar" basedir="${build.classes}">
<manifest>
</manifest>
</jar>
</target> <!-- ================================================================== -->
<!-- Make Maven Central Release -->
<!-- ================================================================== -->
<!-- -->
<!-- ================================================================== -->
<target name="release" depends="compile-core"
description="--> generate the release distribution">
<copy file="${conf.dir}/nutch-default.xml" todir="${build.classes}" />
<copy file="${conf.dir}/nutch-site.xml" todir="${build.classes}" /> <!-- build the main artifact -->
<jar jarfile="${maven-jar}" basedir="${build.classes}" /> <!-- build the javadoc artifact -->
<javadoc
destdir="${release.dir}/javadoc"
overview="${src.dir}/overview.html"
author="true"
version="true"
use="true"
windowtitle="${name} ${version} API"
doctitle="${name} ${version} API"
bottom="Copyright &amp;copy; ${year} The Apache Software Foundation">
<arg value="${javadoc.proxy.host}" />
<arg value="${javadoc.proxy.port}" /> <packageset dir="${src.dir}"/>
<packageset dir="${plugins.dir}/creativecommons/src/java"/>
<packageset dir="${plugins.dir}/feed/src/java"/>
<packageset dir="${plugins.dir}/index-anchor/src/java"/>
<packageset dir="${plugins.dir}/index-basic/src/java"/>
<packageset dir="${plugins.dir}/index-more/src/java"/>
<packageset dir="${plugins.dir}/language-identifier/src/java"/>
<packageset dir="${plugins.dir}/lib-http/src/java"/>
<packageset dir="${plugins.dir}/lib-regex-filter/src/java"/>
<packageset dir="${plugins.dir}/microformats-reltag/src/java"/>
<packageset dir="${plugins.dir}/parse-ext/src/java"/>
<packageset dir="${plugins.dir}/parse-html/src/java"/>
<packageset dir="${plugins.dir}/parse-js/src/java"/>
<packageset dir="${plugins.dir}/parse-swf/src/java"/>
<packageset dir="${plugins.dir}/parse-tika/src/java"/>
<packageset dir="${plugins.dir}/parse-zip/src/java"/>
<packageset dir="${plugins.dir}/protocol-file/src/java"/>
<packageset dir="${plugins.dir}/protocol-ftp/src/java"/>
<packageset dir="${plugins.dir}/protocol-http/src/java"/>
<packageset dir="${plugins.dir}/protocol-httpclient/src/java"/>
<packageset dir="${plugins.dir}/protocol-sftp/src/java"/>
<packageset dir="${plugins.dir}/scoring-link/src/java"/>
<packageset dir="${plugins.dir}/scoring-opic/src/java"/>
<packageset dir="${plugins.dir}/subcollection/src/java"/>
<packageset dir="${plugins.dir}/tld/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-automaton/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-domain/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-prefix/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-regex/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-suffix/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-validator/src/java"/>
<packageset dir="${plugins.dir}/urlnormalizer-basic/src/java"/>
<packageset dir="${plugins.dir}/urlnormalizer-pass/src/java"/>
<packageset dir="${plugins.dir}/urlnormalizer-regex/src/java"/> <link href="${javadoc.link.java}" />
<link href="${javadoc.link.lucene}" />
<link href="${javadoc.link.hadoop}" /> <classpath refid="classpath" />
<classpath>
<fileset dir="${plugins.dir}">
<include name="**/*.jar" />
</fileset>
</classpath> <group title="Core" packages="org.apache.nutch.*" />
<group title="Plugins API" packages="${plugins.api}" />
<group title="Protocol Plugins" packages="${plugins.protocol}" />
<group title="URL Filter Plugins" packages="${plugins.urlfilter}" />
<group title="Scoring Plugins" packages="${plugins.scoring}" />
<group title="Parse Plugins" packages="${plugins.parse}" />
<group title="Indexing Filter Plugins" packages="${plugins.index}" />
<group title="Misc. Plugins" packages="${plugins.misc}" />
</javadoc>
<jar jarfile="${maven-javadoc-jar}">
<fileset dir="${release.dir}/javadoc" />
</jar> <!-- build the sources artifact -->
<jar jarfile="${maven-sources-jar}">
<fileset dir="${src.dir}" />
</jar>
</target> <!-- ================================================================== -->
<!-- Deploy to Apache Nexus -->
<!-- ================================================================== -->
<!-- -->
<!-- ================================================================== -->
<target name="deploy" depends="release" description="--> deploy to Apache Nexus"> <!-- generate a pom file -->
<ivy:makepom ivyfile="${ivy.file}" pomfile="${basedir}/pom.xml"
templatefile="ivy/mvn.template">
<mapping conf="default" scope="compile" />
<mapping conf="runtime" scope="runtime" />
</ivy:makepom> <!-- sign and deploy the main artifact -->
<artifact:mvn>
<arg
value="org.apache.maven.plugins:maven-gpg-plugin:1.4:sign-and-deploy-file" />
<arg value="-Durl=${maven-repository-url}" />
<arg value="-DrepositoryId=${maven-repository-id}" />
<arg value="-DpomFile=pom.xml" />
<arg value="-Dfile=${maven-jar}" />
<arg value="-Papache-release" />
</artifact:mvn> <!-- sign and deploy the sources artifact -->
<artifact:mvn>
<arg value="org.apache.maven.plugins:maven-gpg-plugin:1.4:sign-and-deploy-file" />
<arg value="-Durl=${maven-repository-url}" />
<arg value="-DrepositoryId=${maven-repository-id}" />
<arg value="-DpomFile=pom.xml" />
<arg value="-Dfile=${maven-sources-jar}" />
<arg value="-Dclassifier=sources" />
<arg value="-Papache-release" />
</artifact:mvn> <!-- sign and deploy the javadoc artifact -->
<artifact:mvn>
<arg value="org.apache.maven.plugins:maven-gpg-plugin:1.4:sign-and-deploy-file" />
<arg value="-Durl=${maven-repository-url}" />
<arg value="-DrepositoryId=${maven-repository-id}" />
<arg value="-DpomFile=pom.xml" />
<arg value="-Dfile=${maven-javadoc-jar}" />
<arg value="-Dclassifier=javadoc" />
<arg value="-Papache-release" />
</artifact:mvn>
</target> <!-- ================================================================== -->
<!-- Make job jar -->
<!-- ================================================================== -->
<!-- -->
<!-- ================================================================== -->
<target name="job" depends="compile" description="--> make nutch.job jar">
<jar jarfile="${build.dir}/${final.name}.job">
<!--
If the build.classes has the nutch config files because the jar command
command has run, exclude them. The conf directory has them.
-->
<zipfileset dir="${build.classes}" excludes="nutch-default.xml,nutch-site.xml" />
<zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*" />
<!--
need to exclude hsqldb.jar due to a conflicting version already present
in Hadoop/lib.
-->
<zipfileset dir="${build.lib.dir}" prefix="lib" includes="**/*.jar"
excludes="jasper*.jar,jsp-*.jar,hadoop-*.jar,hbase*test*.jar,ant*jar,hsqldb*.jar" />
<zipfileset dir="${build.plugins}" prefix="classes/plugins" />
</jar>
</target> <target name="runtime" depends="jar, job" description="--> default target for running Nutch">
<mkdir dir="${runtime.dir}" />
<mkdir dir="${runtime.local}" />
<mkdir dir="${runtime.deploy}" />
<!-- deploy area -->
<copy file="${build.dir}/${final.name}.job" todir="${runtime.deploy}" />
<copy todir="${runtime.deploy}/bin">
<fileset dir="src/bin" />
</copy>
<chmod perm="ugo+x" type="file">
<fileset dir="${runtime.deploy}/bin" />
</chmod>
<!-- local area -->
<copy file="${build.dir}/${final.name}.jar" todir="${runtime.local}/lib" />
<copy todir="${runtime.local}/lib/native">
<fileset dir="lib/native" />
</copy>
<copy todir="${runtime.local}/conf">
<fileset dir="${conf.dir}" excludes="*.template" />
</copy>
<copy todir="${runtime.local}/bin">
<fileset dir="src/bin" />
</copy>
<chmod perm="ugo+x" type="file">
<fileset dir="${runtime.local}/bin" />
</chmod>
<copy todir="${runtime.local}/lib">
<fileset dir="${build.dir}/lib"
excludes="ant*.jar,jasper*.jar,jsp-*.jar,hadoop*test*.jar,hbase*test*.jar" />
</copy>
<copy todir="${runtime.local}/plugins">
<fileset dir="${build.dir}/plugins" />
</copy>
<copy todir="${runtime.local}/test">
<fileset dir="${build.dir}/test" />
</copy>
</target> <!-- ================================================================== -->
<!-- Compile test code -->
<!-- ================================================================== -->
<target name="compile-core-test" depends="compile-core, resolve-test" description="--> compile test code">
<javac
encoding="${build.encoding}"
srcdir="${test.src.dir}"
includes="org/apache/nutch*/**/*.java"
destdir="${test.build.classes}"
debug="${javac.debug}"
optimize="${javac.optimize}"
target="${javac.version}"
source="${javac.version}"
deprecation="${javac.deprecation}">
<compilerarg value="-Xlint:-path"/>
<classpath refid="test.classpath" />
</javac>
</target> <!-- ================================================================== -->
<!-- Run Nutch proxy -->
<!-- ================================================================== --> <target name="proxy" depends="job, compile-core-test" description="--> run nutch proxy">
<java classname="org.apache.nutch.tools.proxy.TestbedProxy" fork="true">
<classpath refid="test.classpath" />
<arg value="-fake" />
<!--
<arg value="-delay"/>
<arg value="-200"/>
-->
<jvmarg line="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl" />
</java>
</target> <!-- ================================================================== -->
<!-- Run Nutch benchmarking analysis -->
<!-- ================================================================== --> <target name="benchmark" description="--> run nutch benchmarking analysis">
<java classname="org.apache.nutch.tools.Benchmark" fork="true">
<classpath refid="test.classpath" />
<jvmarg line="-Xmx512m -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl" />
<arg value="-maxPerHost" />
<arg value="10" />
<arg value="-seeds" />
<arg value="1" />
<arg value="-depth" />
<arg value="5" />
</java>
</target> <!-- ================================================================== -->
<!-- Run unit tests -->
<!-- ================================================================== -->
<target name="test" depends="test-core, test-plugins" description="--> run JUnit tests"/> <target name="test-core" depends="job, compile-core-test" description="--> run core JUnit tests only"> <delete dir="${test.build.data}" />
<mkdir dir="${test.build.data}" />
<!--
copy resources needed in junit tests
-->
<copy todir="${test.build.data}">
<fileset dir="src/testresources" includes="**/*" />
</copy>
<copy file="${test.src.dir}/nutch-site.xml"
todir="${test.build.classes}" /> <copy file="${test.src.dir}/log4j.properties"
todir="${test.build.classes}" /> <copy file="${test.src.dir}/gora.properties"
todir="${test.build.classes}" /> <copy file="${test.src.dir}/crawl-tests.xml"
todir="${test.build.classes}"/> <copy file="${test.src.dir}/domain-urlfilter.txt"
todir="${test.build.classes}"/> <copy file="${test.src.dir}/filter-all.txt"
todir="${test.build.classes}"/> <junit printsummary="yes" haltonfailure="no" fork="yes" dir="${basedir}"
errorProperty="tests.failed" failureProperty="tests.failed" maxmemory="1000m">
<sysproperty key="test.build.data" value="${test.build.data}" />
<sysproperty key="test.src.dir" value="${test.src.dir}" />
<sysproperty key="javax.xml.parsers.DocumentBuilderFactory" value="com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl" />
<classpath refid="test.classpath" />
<formatter type="${test.junit.output.format}" />
<batchtest todir="${test.build.dir}" unless="testcase">
<fileset dir="${test.src.dir}"
includes="**/Test*.java" excludes="**/${test.exclude}.java" />
</batchtest>
<batchtest todir="${test.build.dir}" if="testcase">
<fileset dir="${test.src.dir}" includes="**/${testcase}.java" />
</batchtest>
</junit> <fail if="tests.failed">Tests failed!</fail> </target> <target name="test-plugins" depends="compile" description="--> run plugin JUnit tests only">
<ant dir="src/plugin" target="test" inheritAll="false" />
</target> <target name="nightly" depends="test, tar-src, zip-src, javadoc" description="--> run the nightly target build">
</target> <!-- ================================================================== -->
<!-- Ivy targets -->
<!-- ================================================================== --> <!-- target: resolve ================================================= -->
<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">
<ivy:resolve file="${ivy.file}" conf="default" log="download-only" />
<ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />
<antcall target="copy-libs" />
</target> <target name="resolve-test" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">
<ivy:resolve file="${ivy.file}" conf="test" log="download-only" />
<ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />
<antcall target="copy-libs" />
</target> <target name="copy-libs" description="--> copy the libs in lib, which are not ivy enabled">
<!-- copy the libs in lib, which are not ivy enabled -->
<copy todir="${build.lib.dir}/" failonerror="false">
<fileset dir="${lib.dir}" includes="**/*.jar" />
</copy>
</target> <!-- target: publish-local =========================================== -->
<target name="publish-local" depends="jar" description="--> publish this project in the local ivy repository">
<ivy:publish artifactspattern="${build.dir}/[artifact]-${version}.[ext]"
resolver="local"
pubrevision="${version}"
pubdate="${now}"
status="integration"
forcedeliver="true" overwrite="true" />
<echo message="project ${ant.project.name} published locally with version ${version}" />
</target> <!-- target: report ================================================== -->
<target name="report" depends="resolve-test" description="--> generates a report of dependencies">
<ivy:report todir="${build.dir}" />
</target> <!-- target: ivy-init ================================================ -->
<target name="ivy-init" depends="ivy-probe-antlib, ivy-init-antlib" description="--> initialise Ivy settings">
<ivy:settings file="${ivy.dir}/ivysettings.xml" />
</target> <!-- target: ivy-probe-antlib ======================================== -->
<target name="ivy-probe-antlib" description="--> probe the antlib library">
<condition property="ivy.found">
<typefound uri="antlib:org.apache.ivy.ant" name="cleancache" />
</condition>
</target> <!-- target: ivy-download ============================================ -->
<target name="ivy-download" description="--> download ivy">
<available file="${ivy.jar}" property="ivy.jar.found" />
<antcall target="ivy-download-unchecked" />
</target> <!-- target: ivy-download-unchecked ================================== -->
<target name="ivy-download-unchecked" unless="ivy.jar.found" description="--> fetch any ivy file">
<get src="${ivy.repo.url}" dest="${ivy.jar}" usetimestamp="true" />
</target> <!-- target: ivy-init-antlib ========================================= -->
<target name="ivy-init-antlib" depends="ivy-download" unless="ivy.found" description="--> attempt to use Ivy with Antlib">
<typedef uri="antlib:org.apache.ivy.ant" onerror="fail"
loaderRef="ivyLoader">
<classpath>
<pathelement location="${ivy.jar}" />
</classpath>
</typedef>
<fail>
<condition>
<not>
<typefound uri="antlib:org.apache.ivy.ant" name="cleancache" />
</not>
</condition>
You need Apache Ivy 2.0 or later from http://ant.apache.org/
It could not be loaded from ${ivy.repo.url}
</fail>
</target> <target name="compile-avro-schema" depends="resolve-default" description="--> compile the avro schema(s) in src/gora/*.avsc">
<typedef name="schema"
classname="org.apache.avro.specific.SchemaTask"
classpathref="classpath" /> <mkdir dir="${build.gora}" />
<schema destdir="${build.gora}">
<fileset dir="./src/gora">
<include name="**/*.avsc"/>
</fileset>
</schema> </target> <!-- ====================================================== -->
<!-- Generate the Java files from the GORA schemas -->
<!-- Will call this automatically later -->
<!-- ====================================================== -->
<target name="generate-gora-src" depends="init" description="--> compile the avro schema(s) in src/gora/*.avsc">
<java classname="org.apache.gora.compiler.GoraCompiler">
<classpath refid="classpath"/>
<arg value="src/gora/"/>
<arg value="${src.dir}"/>
</java>
</target> <!-- ================================================================== -->
<!-- Documentation -->
<!-- ================================================================== -->
<target name="javadoc" depends="compile" description="--> generate Javadoc">
<mkdir dir="${build.javadoc}" />
<javadoc
overview="${src.dir}/overview.html"
destdir="${build.javadoc}"
author="true"
version="true"
use="true"
windowtitle="${name} ${version} API"
doctitle="${name} ${version} API"
bottom="Copyright &amp;copy; ${year} The Apache Software Foundation">
<arg value="${javadoc.proxy.host}" />
<arg value="${javadoc.proxy.port}" /> <packageset dir="${src.dir}"/>
<packageset dir="${plugins.dir}/creativecommons/src/java"/>
<packageset dir="${plugins.dir}/feed/src/java"/>
<packageset dir="${plugins.dir}/index-anchor/src/java"/>
<packageset dir="${plugins.dir}/index-basic/src/java"/>
<packageset dir="${plugins.dir}/index-more/src/java"/>
<packageset dir="${plugins.dir}/language-identifier/src/java"/>
<packageset dir="${plugins.dir}/lib-http/src/java"/>
<packageset dir="${plugins.dir}/lib-regex-filter/src/java"/>
<packageset dir="${plugins.dir}/microformats-reltag/src/java"/>
<packageset dir="${plugins.dir}/parse-ext/src/java"/>
<packageset dir="${plugins.dir}/parse-html/src/java"/>
<packageset dir="${plugins.dir}/parse-js/src/java"/>
<packageset dir="${plugins.dir}/parse-swf/src/java"/>
<packageset dir="${plugins.dir}/parse-tika/src/java"/>
<packageset dir="${plugins.dir}/parse-zip/src/java"/>
<packageset dir="${plugins.dir}/protocol-file/src/java"/>
<packageset dir="${plugins.dir}/protocol-ftp/src/java"/>
<packageset dir="${plugins.dir}/protocol-http/src/java"/>
<packageset dir="${plugins.dir}/protocol-httpclient/src/java"/>
<packageset dir="${plugins.dir}/protocol-sftp/src/java"/>
<packageset dir="${plugins.dir}/scoring-link/src/java"/>
<packageset dir="${plugins.dir}/scoring-opic/src/java"/>
<packageset dir="${plugins.dir}/subcollection/src/java"/>
<packageset dir="${plugins.dir}/tld/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-automaton/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-domain/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-prefix/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-regex/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-suffix/src/java"/>
<packageset dir="${plugins.dir}/urlfilter-validator/src/java"/>
<packageset dir="${plugins.dir}/urlnormalizer-basic/src/java"/>
<packageset dir="${plugins.dir}/urlnormalizer-pass/src/java"/>
<packageset dir="${plugins.dir}/urlnormalizer-regex/src/java"/> <link href="${javadoc.link.java}" />
<link href="${javadoc.link.lucene}" />
<link href="${javadoc.link.hadoop}" /> <classpath refid="classpath" />
<classpath>
<fileset dir="${plugins.dir}">
<include name="**/*.jar" />
</fileset>
</classpath> <group title="Core" packages="org.apache.nutch.*" />
<group title="Plugins API" packages="${plugins.api}" />
<group title="Protocol Plugins" packages="${plugins.protocol}" />
<group title="URL Filter Plugins" packages="${plugins.urlfilter}" />
<group title="Scoring Plugins" packages="${plugins.scoring}" />
<group title="Parse Plugins" packages="${plugins.parse}" />
<group title="Indexing Filter Plugins" packages="${plugins.index}" />
<group title="Misc. Plugins" packages="${plugins.misc}" />
</javadoc>
<!-- Copy the plugin.dtd file to the plugin doc-files dir -->
<copy file="${plugins.dir}/plugin.dtd" todir="${build.javadoc}/org/apache/nutch/plugin/doc-files" />
</target> <target name="default-doc" description="--> generate default Nutch documentation">
<style basedir="${conf.dir}" destdir="${docs.dir}"
includes="nutch-default.xml" style="conf/nutch-conf.xsl" />
</target> <!-- ================================================================== -->
<!-- D I S T R I B U T I O N -->
<!-- ================================================================== -->
<!-- -->
<!-- ================================================================== -->
<target name="package-src" depends="runtime, javadoc" description="--> generate source distribution package">
<mkdir dir="${dist.dir}"/>
<mkdir dir="${src.dist.version.dir}"/>
<mkdir dir="${src.dist.version.dir}/lib"/>
<mkdir dir="${src.dist.version.dir}/docs"/>
<mkdir dir="${src.dist.version.dir}/docs/api"/>
<mkdir dir="${src.dist.version.dir}/ivy"/> <copy todir="${src.dist.version.dir}/lib" includeEmptyDirs="false">
<fileset dir="lib"/>
</copy> <copy todir="${src.dist.version.dir}/conf">
<fileset dir="${conf.dir}" excludes="**/*.template"/>
</copy> <copy todir="${src.dist.version.dir}/docs/api">
<fileset dir="${build.javadoc}"/>
</copy> <copy todir="${src.dist.version.dir}">
<fileset dir=".">
<include name="*.txt" />
<!--<include name="KEYS" />-->
</fileset>
</copy> <copy todir="${src.dist.version.dir}/src" includeEmptyDirs="true">
<fileset dir="src"/>
</copy> <copy todir="${src.dist.version.dir}/ivy" includeEmptyDirs="true">
<fileset dir="ivy"/>
</copy> <copy todir="${src.dist.version.dir}/" file="build.xml"/>
<copy todir="${src.dist.version.dir}/" file="default.properties"/> </target> <target name="package-bin" depends="runtime, javadoc" description="--> generate binary distribution package">
<mkdir dir="${dist.dir}"/>
<mkdir dir="${bin.dist.version.dir}"/>
<mkdir dir="${bin.dist.version.dir}/lib"/>
<mkdir dir="${bin.dist.version.dir}/bin"/>
<mkdir dir="${bin.dist.version.dir}/conf"/>
<mkdir dir="${bin.dist.version.dir}/docs"/>
<mkdir dir="${bin.dist.version.dir}/docs/api"/>
<mkdir dir="${bin.dist.version.dir}/plugins"/> <copy todir="${bin.dist.version.dir}/lib" includeEmptyDirs="false">
<fileset dir="runtime/local/lib"/>
</copy> <copy todir="${bin.dist.version.dir}/bin">
<fileset dir="runtime/local/bin"/>
</copy> <chmod perm="ugo+x" type="file">
<fileset dir="${bin.dist.version.dir}/bin"/>
</chmod> <copy todir="${bin.dist.version.dir}/conf">
<fileset dir="runtime/local/conf" excludes="**/*.template"/>
</copy> <copy todir="${bin.dist.version.dir}/docs/api">
<fileset dir="${build.javadoc}"/>
</copy> <copy todir="${bin.dist.version.dir}">
<fileset dir=".">
<include name="*.txt" />
</fileset>
</copy> <copy todir="${bin.dist.version.dir}/plugins" includeEmptyDirs="true">
<fileset dir="runtime/local/plugins"/>
</copy> </target> <!-- ================================================================== -->
<!-- Make src-release tarball -->
<!-- ================================================================== -->
<target name="tar-src" depends="package-src" description="--> generate src.tar.gz distribution package">
<tar compression="gzip" longfile="gnu"
destfile="${src.dist.version.dir}.tar.gz">
<tarfileset dir="${src.dist.version.dir}" mode="664" prefix="${final.name}">
<exclude name="src/bin/*" />
<include name="**" />
</tarfileset>
<tarfileset dir="${src.dist.version.dir}" mode="755" prefix="${final.name}">
<include name="src/bin/*" />
</tarfileset>
</tar>
</target> <!-- ================================================================== -->
<!-- Make bin release tarball -->
<!-- ================================================================== -->
<target name="tar-bin" depends="package-bin" description="--> generate bin.tar.gz distribution package">
<tar compression="gzip" longfile="gnu"
destfile="${bin.dist.version.dir}.tar.gz">
<tarfileset dir="${bin.dist.version.dir}" mode="664" prefix="${final.name}">
<exclude name="bin/*" />
<include name="**" />
</tarfileset>
<tarfileset dir="${bin.dist.version.dir}" mode="755" prefix="${final.name}">
<include name="bin/*" />
</tarfileset>
</tar>
</target> <!-- ================================================================== -->
<!-- Make src release zip -->
<!-- ================================================================== -->
<target name="zip-src" depends="package-src" description="--> generate src.zip distribution package">
<zip compress="true" casesensitive="yes"
destfile="${src.dist.version.dir}.zip">
<zipfileset dir="${src.dist.version.dir}" filemode="664" prefix="${final.name}">
<exclude name="src/bin/*" />
<include name="**" />
</zipfileset>
<zipfileset dir="${src.dist.version.dir}" filemode="755" prefix="${final.name}">
<include name="src/bin/*" />
</zipfileset>
</zip>
</target> <!-- ================================================================== -->
<!-- Make bin release zip -->
<!-- ================================================================== -->
<target name="zip-bin" depends="package-bin" description="--> generate bin.zip distribution package">
<zip compress="true" casesensitive="yes"
destfile="${bin.dist.version.dir}.zip">
<zipfileset dir="${bin.dist.version.dir}" filemode="664" prefix="${final.name}">
<exclude name="bin/*" />
<include name="**" />
</zipfileset>
<zipfileset dir="${bin.dist.version.dir}" filemode="755" prefix="${final.name}">
<include name="bin/*" />
</zipfileset>
</zip>
</target> <!-- ================================================================== -->
<!-- Clean. Delete the build files, and their directories -->
<!-- ================================================================== --> <!-- target: clean =================================================== -->
<target name="clean" depends="clean-build, clean-lib, clean-dist, clean-runtime" description="--> clean the project" /> <!-- target: clean-local ============================================= -->
<target name="clean-local" depends="" description="--> cleans the local repository for the current module">
<delete dir="${ivy.local.default.root}/${ivy.organisation}/${ivy.module}" />
</target> <!-- target: clean-lib =============================================== -->
<target name="clean-lib" description="--> clean the project libraries directory (dependencies)">
<delete includeemptydirs="true" dir="${build.lib.dir}" />
</target> <!-- target: clean-build ============================================= -->
<target name="clean-build" description="--> clean the project built files">
<delete includeemptydirs="true" dir="${build.dir}" />
</target> <!-- target: clean-dist ============================================= -->
<target name="clean-dist" description="--> clean the project dist files">
<delete includeemptydirs="true" dir="${dist.dir}" />
</target> <!-- target: clean-cache ============================================= -->
<target name="clean-cache" depends="" description="--> delete ivy cache">
<ivy:cleancache />
</target> <target name="clean-runtime" description="--> clean the project runtime area">
<delete includeemptydirs="true" dir="${runtime.dir}" />
</target> <!-- ================================================================== -->
<!-- RAT targets -->
<!-- ================================================================== -->
<target name="rat-sources-typedef" description="--> run RAT antlib task">
<typedef resource="org/apache/rat/anttasks/antlib.xml">
<classpath>
<fileset dir="." includes="rat*.jar" />
</classpath>
</typedef>
</target> <target name="rat-sources" depends="rat-sources-typedef"
description="--> runs the tasks over src/java">
<rat:report xmlns:rat="antlib:org.apache.rat.anttasks">
<fileset dir="src">
<include name="java/**/*" />
<include name="plugin/**/src/**/*" />
</fileset>
</rat:report>
</target> <!-- ================================================================== -->
<!-- SONAR targets -->
<!-- ================================================================== --> <!-- Define the Sonar task if this hasn't been done in a common script -->
<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
<classpath path="${ant.library.dir}" />
<classpath path="${mysql.library.dir}" />
<classpath><fileset dir="." includes="sonar*.jar" /></classpath>
</taskdef> <!-- Add the target -->
<target name="sonar" description="--> run SONAR analysis"> <!-- list of mandatory source directories (required) -->
<property name="sonar.sources" value="${src.dir}" /> <!-- list of properties (optional) -->
<property name="sonar.projectName" value="Nutchgora Branch Sonar Analysis" />
<property name="sonar.binaries" value="${build.dir}/classes" />
<property name="sonar.binaries" value="${build.dir}/plugins" />
<property name="sonar.tests" value="${test.src.dir}" /> <sonar:sonar workDir="${base.dir}" key="org.apache.nutch:branch"
version="2.0-SNAPSHOT" xmlns:sonar="antlib:org.sonar.ant" />
</target> <!-- ================================================================== -->
<!-- Eclipse targets -->
<!-- ================================================================== --> <!-- classpath for generating eclipse project -->
<path id="eclipse.classpath">
<fileset dir="${build.lib.dir}">
<include name="*.jar" />
<exclude name="ant-eclipse-1.0-jvm1.2.jar" />
</fileset>
</path> <!-- target: ant-eclipse-download =================================== -->
<target name="ant-eclipse-download" description="--> Downloads the ant-eclipse binary.">
<get src="http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2"
dest="${build.dir}/ant-eclipse-1.0.bin.tar.bz2" usetimestamp="false" /> <untar src="${build.dir}/ant-eclipse-1.0.bin.tar.bz2"
dest="${build.dir}" compression="bzip2">
<patternset>
<include name="lib/ant-eclipse-1.0-jvm1.2.jar"/>
</patternset>
</untar> <delete file="${build.dir}/ant-eclipse-1.0.bin.tar.bz2" />
</target> <!-- target: eclipse ================================================ -->
<target name="eclipse"
depends="clean,init,job,ant-eclipse-download"
description="--> Create eclipse project files"> <pathconvert property="eclipse.project">
<path path="${basedir}"/>
<regexpmapper from="^.*/([^/]+)$$" to="\1" handledirsep="yes"/>
</pathconvert> <taskdef name="eclipse"
classname="prantl.ant.eclipse.EclipseTask"
classpath="${build.dir}/lib/ant-eclipse-1.0-jvm1.2.jar" />
<eclipse updatealways="true">
<project name="${eclipse.project}" />
<classpath> <library path="${conf.dir}" exported="false" />
<library path="${basedir}/src/bin" exported="false" />
<library pathref="eclipse.classpath" exported="false" />
<library path="${basedir}/build/plugins/urlfilter-automaton/automaton-1.11-8.jar"
exported="false" />
<library path="${basedir}/src/plugin/parse-swf/lib/javaswf.jar"
exported="false" />
<library path="${basedir}/build/plugins/lib-nekohtml/nekohtml-0.9.5.jar"
exported="false" />
<library path="${basedir}/build/plugins/lib-nekohtml/nekohtml-0.9.5.jar"
exported="false" />
<library path="${basedir}/build/plugins/parse-html/tagsoup-1.2.jar"
exported="false" />
<library path="${basedir}/build/plugins/protocol-sftp/jsch-0.1.41.jar"
exported="false" /> <library path="${basedir}/build/plugins/parse-html/tagsoup-1.2.jar"
exported="false" /> <source path="${basedir}/src/java/" />
<source path="${basedir}/src/test/" />
<source path="${basedir}/src/plugin/creativecommons/src/java/" />
<source path="${basedir}/src/plugin/creativecommons/src/test/" />
<!-- feed is currently disabled
<source path="${basedir}/src/plugin/feed/src/java/" />
<source path="${basedir}/src/plugin/feed/src/test/" /> -->
<source path="${basedir}/src/plugin/index-anchor/src/java/" />
<source path="${basedir}/src/plugin/index-anchor/src/test/" />
<source path="${basedir}/src/plugin/index-basic/src/java/" />
<source path="${basedir}/src/plugin/index-basic/src/test/" />
<source path="${basedir}/src/plugin/index-more/src/java/" />
<source path="${basedir}/src/plugin/index-more/src/test/" />
<source path="${basedir}/src/plugin/language-identifier/src/java/" />
<source path="${basedir}/src/plugin/language-identifier/src/test/" />
<source path="${basedir}/src/plugin/lib-http/src/java/" />
<source path="${basedir}/src/plugin/lib-http/src/test/" />
<source path="${basedir}/src/plugin/lib-regex-filter/src/java/" />
<source path="${basedir}/src/plugin/lib-regex-filter/src/test/" />
<source path="${basedir}/src/plugin/microformats-reltag/src/java/" />
<source path="${basedir}/src/plugin/microformats-reltag/src/test/" />
<!-- parse-ext is currently disabled
<source path="${basedir}/src/plugin/parse-ext/src/java/" />
<source path="${basedir}/src/plugin/parse-ext/src/test/" /> -->
<source path="${basedir}/src/plugin/parse-html/src/java/" />
<source path="${basedir}/src/plugin/parse-html/src/test/" />
<source path="${basedir}/src/plugin/parse-js/src/java/" />
<source path="${basedir}/src/plugin/parse-js/src/test/" />
<!-- parse-swf and parse-zip are currently disabled
<source path="${basedir}/src/plugin/parse-swf/src/java/" />
<source path="${basedir}/src/plugin/parse-swf/src/test/" />
<source path="${basedir}/src/plugin/parse-zip/src/java/" />
<source path="${basedir}/src/plugin/parse-zip/src/test/" /> -->
<source path="${basedir}/src/plugin/parse-tika/src/java/" />
<source path="${basedir}/src/plugin/parse-tika/src/test/" />
<source path="${basedir}/src/plugin/protocol-file/src/java/" />
<source path="${basedir}/src/plugin/protocol-file/src/test/" />
<source path="${basedir}/src/plugin/protocol-ftp/src/java/" />
<source path="${basedir}/src/plugin/protocol-httpclient/src/java/" />
<source path="${basedir}/src/plugin/protocol-httpclient/src/test/" />
<source path="${basedir}/src/plugin/protocol-http/src/java/" />
<source path="${basedir}/src/plugin/protocol-sftp/src/java/" />
<source path="${basedir}/src/plugin/scoring-link/src/java/" />
<source path="${basedir}/src/plugin/scoring-opic/src/java/" />
<source path="${basedir}/src/plugin/subcollection/src/java/" />
<source path="${basedir}/src/plugin/subcollection/src/test/" />
<source path="${basedir}/src/plugin/tld/src/java/" />
<source path="${basedir}/src/plugin/urlfilter-automaton/src/java/" />
<source path="${basedir}/src/plugin/urlfilter-automaton/src/test/" />
<source path="${basedir}/src/plugin/urlfilter-domain/src/java/" />
<source path="${basedir}/src/plugin/urlfilter-domain/src/test/" />
<source path="${basedir}/src/plugin/urlfilter-prefix/src/java/" />
<source path="${basedir}/src/plugin/urlfilter-regex/src/java/" />
<source path="${basedir}/src/plugin/urlfilter-regex/src/test/" />
<source path="${basedir}/src/plugin/urlfilter-suffix/src/java/" />
<source path="${basedir}/src/plugin/urlfilter-suffix/src/test/" />
<source path="${basedir}/src/plugin/urlfilter-validator/src/java/" />
<source path="${basedir}/src/plugin/urlnormalizer-basic/src/java/" />
<source path="${basedir}/src/plugin/urlnormalizer-basic/src/test/" />
<source path="${basedir}/src/plugin/urlnormalizer-pass/src/java/" />
<source path="${basedir}/src/plugin/urlnormalizer-pass/src/test/" />
<source path="${basedir}/src/plugin/urlnormalizer-regex/src/java/" />
<source path="${basedir}/src/plugin/urlnormalizer-regex/src/test/" /> <output path="${basedir}/build/classes" />
</classpath>
</eclipse>
</target>
</project>

(4)nutch/ivy/ivy.xml

 <?xml version="1.0" ?>

 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
You under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. --> <ivy-module version="1.0">
<info organisation="org.apache.nutch" module="nutch">
<license name="Apache 2.0"
url="http://www.apache.org/licenses/LICENSE-2.0.txt/" />
<ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org" />
<description homepage="http://nutch.apache.org">Nutch is an open source web-search
software. It builds on Hadoop, Tika and Solr, adding web-specifics, such as a crawler,
a link-graph database etc.
</description>
</info> <configurations>
<include file="${basedir}/ivy/ivy-configurations.xml" />
</configurations> <publications>
<!--get the artifact from our module name -->
<artifact conf="master" />
</publications> <dependencies>
<dependency org="org.elasticsearch" name="elasticsearch" rev="0.19.4"
conf="*->default"/> <dependency org="org.apache.solr" name="solr-solrj" rev="3.4.0"
conf="*->default" />
<dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1"
conf="*->master" /> <dependency org="commons-lang" name="commons-lang" rev="2.4"
conf="*->default" />
<dependency org="commons-collections" name="commons-collections"
rev="3.1" conf="*->default" />
<dependency org="commons-httpclient" name="commons-httpclient"
rev="3.1" conf="*->master" />
<dependency org="commons-codec" name="commons-codec" rev="1.3"
conf="*->default" /> <dependency org="org.apache.hadoop" name="hadoop-core"
rev="1.2.1" conf="*->default">
<exclude org="net.sf.kosmosfs" name="kfs" />
<exclude org="net.java.dev.jets3t" name="jets3t" />
<exclude org="org.eclipse.jdt" name="core" />
<exclude org="org.mortbay.jetty" name="jsp-*" />
</dependency> <dependency org="com.ibm.icu" name="icu4j" rev="4.0.1" />
<dependency org="org.apache.tika" name="tika-core" rev="1.3" />
<dependency org="com.googlecode.juniversalchardet" name="juniversalchardet" rev="1.0.3"/> <dependency org="log4j" name="log4j" rev="1.2.15" conf="*->master" /> <dependency org="xerces" name="xercesImpl" rev="2.9.1" />
<dependency org="xerces" name="xmlParserAPIs" rev="2.6.2" />
<dependency org="xalan" name="serializer" rev="2.7.1" />
<dependency org="oro" name="oro" rev="2.0.8" /> <dependency org="org.jdom" name="jdom" rev="1.1" conf="*->default" /> <dependency org="com.google.guava" name="guava" rev="11.0.2" />
<dependency org="com.google.code.crawler-commons" name="crawler-commons" rev="0.2" /> <!--Configuration: test --> <!--artifacts needed for testing -->
<dependency org="junit" name="junit" rev="4.11" conf="*->default" /> <dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.1" conf="test->default">
<exclude org="net.sf.kosmosfs" name="kfs" />
<exclude org="net.java.dev.jets3t" name="jets3t" />
<exclude org="org.eclipse.jdt" name="core" />
<exclude org="org.mortbay.jetty" name="jsp-*" />
</dependency> <dependency org="org.mortbay.jetty" name="jetty" rev="6.1.26" conf="test->default" />
<dependency org="org.mortbay.jetty" name="jetty-util" rev="6.1.26" conf="test->default" />
<dependency org="org.mortbay.jetty" name="jetty-client" rev="6.1.26" /> <dependency org="org.hsqldb" name="hsqldb" rev="2.2.8" conf="*->default" />
<dependency org="org.jdom" name="jdom" rev="1.1" conf="test->default"/> <dependency org="org.restlet.jse" name="org.restlet" rev="2.0.5" conf="*->default" />
<dependency org="org.restlet.jse" name="org.restlet.ext.jackson" rev="2.0.5"
conf="*->default" /> <!--================-->
<!-- Gora artifacts -->
<!--================-->
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
<!-- Uncomment this to use SQL as Gora backend. It should be noted that the
gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should
downgrade to gora-core 0.2.1 in order to use SQL as a backend. -->
<!--
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
-->
<!-- Uncomment this to use MySQL as database with SQL as Gora store. -->
<!--
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
-->
<!-- Uncomment this to use HBase as Gora backend. --> <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" /> <!-- Uncomment this to use Accumulo as Gora backend. -->
<!--
<dependency org="org.apache.gora" name="gora-accumulo" rev="0.3" conf="*->default" />
-->
<!-- Uncomment this to use Cassandra as Gora backend. -->
<!--
<dependency org="org.apache.gora" name="gora-cassandra" rev="0.3" conf="*->default" />
--> <!--global exclusion -->
<exclude module="ant" />
<exclude module="slf4j-jdk14" />
<exclude module="slf4j-simple" />
<exclude org="hsqldb"/>
<exclude org="maven-plugins"/>
<exclude module="jmxtools" />
<exclude module="jms" />
<exclude module="jmxri" />
<exclude module="thrift" />
</dependencies> </ivy-module>

(5)nutch/ivy/ivysettings.xml

 <ivysettings>

  <!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--> <!--
see http://www.jayasoft.org/ivy/doc/configuration
-->
<!-- you can override this property to use mirrors
http://repo1.maven.org/maven2/
http://mirrors.dotsrc.org/maven2
http://ftp.ggi-project.org/pub/packages/maven2
http://mirrors.sunsite.dk/maven2
http://public.planetmirror.com/pub/maven2
http://ibiblio.lsu.edu/main/pub/packages/maven2
http://www.ibiblio.net/pub/packages/maven2
-->
<property name="oss.sonatype.org"
value="http://oss.sonatype.org/content/repositories/releases/"
override="false"/>
<property name="repo.maven.org"
value="http://repo1.maven.org/maven2/"
override="false"/>
<property name="snapshot.apache.org"
value="http://people.apache.org/repo/m2-snapshot-repository/"
override="false"/>
<property name="maven2.pattern"
value="[organisation]/[module]/[revision]/[module]-[revision]"/>
<property name="maven2.pattern.ext"
value="${maven2.pattern}.[ext]"/>
<!-- pull in the local repository -->
<include url="${ivy.default.conf.dir}/ivyconf-local.xml"/>
<settings defaultResolver="default"/>
<resolvers>
<ibiblio name="maven2"
root="${repo.maven.org}"
pattern="${maven2.pattern.ext}"
m2compatible="true"
/>
<ibiblio name="apache-snapshot"
root="${snapshot.apache.org}"
pattern="${maven2.pattern.ext}"
m2compatible="true"
/>
<ibiblio name="restlet"
root="http://maven.restlet.org"
pattern="${maven2.pattern.ext}"
m2compatible="true"
/>
<ibiblio name="sonatype"
root="${oss.sonatype.org}"
pattern="${maven2.pattern.ext}"
m2compatible="true"
/> <chain name="default" dual="true">
<resolver ref="local"/>
<resolver ref="maven2"/>
<resolver ref="sonatype"/>
</chain>
<chain name="internal">
<resolver ref="local"/>
</chain>
<chain name="external">
<resolver ref="maven2"/>
<resolver ref="sonatype"/>
</chain>
<chain name="external-and-snapshots">
<resolver ref="maven2"/>
<resolver ref="apache-snapshot"/>
<resolver ref="sonatype"/>
</chain>
<chain name="restletchain">
<resolver ref="restlet"/>
</chain>
</resolvers>
<modules> <!--
This forces a requirement for other nutch-artifacts to be built locally
rather than look for them online.
-->
<module organisation="org.apache.nutch" name=".*" resolver="internal"/>
<module organisation="org.restlet" name=".*" resolver="restletchain"/>
<module organisation="org.restlet.jse" name=".*" resolver="restletchain"/>
</modules>
</ivysettings>

Ubuntu环境下Hadoop1.2.1, HBase0.94.25, nutch2.2.1各个配置文件一览的更多相关文章

  1. 谁说他们版本不兼容——hadoop1.2.1+hbase0.94.11+nutch2.2.1+el

    一.背景 最近由于项目和论文的需要,需要搭建一个垂直搜索的环境,查阅了很多资料,决定使用Apache的一套解决方案hadoop+hbase+nutch+es.这几样神器的作用就不多作介绍了,自行参考各 ...

  2. hadoop1.2.1+hbase0.94.11+nutch2.2.1+elasticsearch0.90.5安装配置攻略

    一.背景 最近由于项目和论文的需要,需要搭建一个垂直搜索的环境,查阅了很多资料,决定使用Apache的一套解决方案hadoop+hbase+nutch+es.这几样神器的作用就不多作介绍了,自行参考各 ...

  3. Ubuntu环境下手动配置HBase0.94.25

    /×××××××××××××××××××××××××××××××××××××××××/ Author:xxx0624 HomePage:http://www.cnblogs.com/xxx0624/ ...

  4. Ubuntu环境下nutch2.2.1集成HBase0.94.25

    nutch2.2.1集成HBase0.94.25 (详见:http://duguyiren3476.iteye.com/blog/2085973 ) 1. 修改nutch的hbase配置 //将自己的 ...

  5. (四)伪分布式下jdk1.6+Hadoop1.2.1+HBase0.94+Eclipse下运行wordCount例子

    本篇先介绍HBase在伪分布式环境下的安装方式,然后将MapReduce编程和HBase结合起来使用,完成WordCount这个例子. HBase在伪分布环境下安装 一.   前提条件 已经成功地安装 ...

  6. Linux(Ubuntu)环境下使用Fiddler

    自己的开发环境是Ubuntu, 对于很多优秀的软件但是又没有Linux版本这件事,还是有点遗憾的.比如最近遇到一个问题,在分析某个网站的请求路径和cookie时就遇到了问题.本来Chome浏览器自带的 ...

  7. Go学习笔记(一):Ubuntu 环境下Go的安装

    本文是根据<Go Web 编程>,逐步学习 Ubuntu 环境下go的安装的笔记. <Go Web 编程>的URL地址如下: https://github.com/astaxi ...

  8. Ubuntu环境下SSH的安装及使用

    Ubuntu环境下SSH的安装及使用 SSH是指Secure Shell,是一种安全的传输协议,Ubuntu客户端可以通过SSH访问远程服务器 .SSH的简介和工作机制可参看上篇文章SSH简介及工作机 ...

  9. Ubuntu环境下的Redis 配置与C++使用入门

      Redis是一个高性能的key-value数据库. Redisedis的出现,非常大程度补偿了memcached这类key/value存储的不足,在部分场合能够对关系数据库起到非常好的补充作用.它 ...

随机推荐

  1. VC6.0编译boost

    今天学习了下VC6.0下boost的编译,只是对regex进行了编译,据说全部编译需要2个多小时,在此记录下学习过程中遇到的问题以便今后查看. 最开始直接从网上(www.boost.org)下载了当前 ...

  2. ADO.NET笔记——读取二进制大对象(BLOB)

    相关知识: 在SQL Server中,一般情况下,每行数据的总长度不能超过8K字节.因此,下列数据类型的长度,也不能超过8K字节:binary,char(),nchar(),varchar(),nva ...

  3. php基础小知识

    1.php中的双引号可以正确的解析变量与转义序列,而单引号只会按照声明原样显示:双里面的字段会经过编译器解释,然后再当作HTML代码输出:单引号里面的不进行解释,直接输出. 2.转义序列是针对源代码的 ...

  4. Python脚本控制的WebDriver 常用操作 <十六> 处理对话框

    下面将使用webdriver来处理一些页面跳出的对话框事件 测试用例场景 页面上弹出的对话框是自动化测试经常会遇到的一个问题.前端框架的对话框经常是div形式的,下面是一些常见的对话框操作事件: 打开 ...

  5. poj 3237 Tree 树链剖分+线段树

    Description You are given a tree with N nodes. The tree’s nodes are numbered 1 through N and its edg ...

  6. 【img】 图片是怎么存储的

    用ue 打开一张图片,动动手脚,出现卡碟的画面效果. 可不可以用C#来做一个图片编辑器? 怎么做?路线怎么走? 稍后揭晓答案 根据实际操作获取类一些基础知识: 1. 文件是二进制存储的,为了便于查看编 ...

  7. 调皮的转义之addslashes

    背景: php自5.3版本开始废除set_magic_quotes_runtime函数,并在5.4及以后版本中移除了该函数 今天程序在向mysql插入一个serialize序列化后的数组时,由于一个数 ...

  8. ZOJ 2314 带上下界的可行流

    对于无源汇问题,方法有两种. 1 从边的角度来处理. 新建超级源汇, 对于每一条有下界的边,x->y, 建立有向边 超级源->y ,容量为x->y下界,建立有向边 x-> 超级 ...

  9. Python实战(2)

    在安装python第三方插件库的时候遇到了这个错误 遇到这种问题可以”转战“国内的第三方镜像,问题便可迎刃而解.例如豆瓣镜像——http://pypi.douban.com/simple/ 先安装ea ...

  10. VB数据库经典实例总结(二)

    大家先看一张似图非图的图. 我们先称它为“过程”也许有不对的地方,在我学数据库到这个阶段.到这个刚刚接触.初生牛犊不怕虎的阶段对它的理解是这样的.所有的都是这个过程.只是在这中间掺杂了一些知识点(我们 ...