nutch的定时增量爬取

译文来着：

http://wiki.apache.org/nutch/Crawl

介绍（Introduction）

注意：脚本中没有直接使用Nutch的爬去命令（bin/nutch crawl或者是“Crawl”类），所以url过滤的实现并不依赖“conf/crawl-urlfilter.txt”。而是应该在“regex-urlfilter.txt”中设定实现。

爬取步骤（Steps）

脚本大致分为8部：

Inject URLs（注入urls）
Generate, Fetch, Parse, Update Loop（循环运行：产生待抓取URL。抓取。转换得到的页面。更新各DB）
Merge Segments（合并segments）
Invert Links（得到抓取到的页面的外连接数据）
Index（索引）
Dedup（去重）
Merge Indexes（合并索引）
Load new indexes（tomcat又一次载入新索引文件夹）

两种运行模式（Modes of Execution）

脚本能够两种模式运行:-

Normal Mode（普通模式）
Safe Mode（安全模式）

Normal Mode

用 'bin/runbot'命令运行, 将删除运行后全部的文件夹。

注意: 这意味着假设抓取过程因某些原因中断。并且crawl DB 是不完整的, 那么将没办法恢复。

Safe Mode

用'bin/runbot safe' 命令运行安全模式，将不会删除用到的文件夹文件. 全部暂时文件将被以"BACK_FILE"备份。假设出错。能够利用这些备份文件运行恢复操作。

Normal Mode vs. Safe Mode

除非你能够保证一切都不出问题，否则我们建议您运行安全模式。

Tinkering

依据你的须要设定 'depth', 'threads', 'adddays' and 'topN'。

假设不想设定'topN'。就将其凝视掉或者删掉。

NUTCH_HOME

假设你不是在 nutch的'bin/runbot' 文件夹下运行该脚本, 你应该在脚本中设定 'NUTCH_HOME' 的值为你的nutch路径:-

if [ -z "$NUTCH_HOME" ]

then

  NUTCH_HOME=.

ps:假设你在环境变量中已经设定了 'NUTCH_HOME'的值，则能够忽略此处。

CATALINA_HOME

'CATALINA_HOME' 指向tomcat的安装路径。

须要在脚本或者环境变量中对其设置。类似 'NUTCH_HOME'的设定:-

if [ -z "$CATALINA_HOME" ]

then

  CATALINA_HOME=/opt/apache-tomcat-6.0.10

Can it re-crawl?

尽管作者自己使用过多次，可是否可以适合你的工作，请先測试一下。假设不能非常好的运行重爬，请联系我们。

脚本内容（Script）

# runbot script to run the Nutch bot for crawling and re-crawling.

# Usage: bin/runbot [safe]

#        If executed in 'safe' mode, it doesn't delete the temporary

#        directories generated during crawl. This might be helpful for

#        analysis and recovery in case a crawl fails.

#

# Author: Susam Pal

depth=2

threads=5

adddays=5

topN=15 #Comment this statement if you don't want to set topN value

# Arguments for rm and mv

RMARGS="-rf"

MVARGS="--verbose"

# Parse arguments

if [ "$1" == "safe" ]#推断是以哪种模式运行

then

  safe=yes

fi

if [ -z "$NUTCH_HOME" ]#推断 'NUTCH_HOME'是否设定

then

  NUTCH_HOME=.

  echo runbot: $0 could not find environment variable NUTCH_HOME

  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script

else

  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME

fi

if [ -z "$CATALINA_HOME" ]#推断tomcat路径是否设置

then

  CATALINA_HOME=/opt/apache-tomcat-6.0.10

  echo runbot: $0 could not find environment variable NUTCH_HOME

  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script

else

  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME

fi

if [ -n "$topN" ]#topN设定

then

  topN="-topN $topN"

else

  topN=""

fi

steps=8

echo "----- Inject (Step 1 of $steps) -----"#注入种子urls

$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"#循环运行抓取

for((i=0; i < $depth; i++))

do

  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN \

      -adddays $adddays

  if [ $? -ne 0 ]

  then

    echo "runbot: Stopping at depth $depth. No more URLs to fetch."

    break

  fi

  segment=`ls -d crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

  if [ $?

-ne 0 ]

  then

    echo "runbot: fetch $segment at depth `expr $i + 1` failed."

    echo "runbot: Deleting segment $segment."

    rm $RMARGS $segment

    continue

  fi

  $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment

done

echo "----- Merge Segments (Step 3 of $steps) -----"#合并Segments

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*

if [ "$safe" != "yes" ]

then

  rm $RMARGS crawl/segments

else

  rm $RMARGS crawl/BACKUPsegments

  mv $MVARGS crawl/segments crawl/BACKUPsegments

fi

mv $MVARGS crawl/MERGEDsegments crawl/segments

echo "----- Invert Links (Step 4 of $steps) -----"#得到外连接数据

$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"#建索引

$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"#去重

$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"#合并索引

$NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes

echo "----- Loading New Index (Step 8 of $steps) -----"#tomcat又一次载入索引文件夹

${CATALINA_HOME}/bin/shutdown.sh

if [ "$safe" != "yes" ]

then

  rm $RMARGS crawl/NEWindexes

  rm $RMARGS crawl/index

else

  rm $RMARGS crawl/BACKUPindexes

  rm $RMARGS crawl/BACKUPindex

  mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes

  mv $MVARGS crawl/index crawl/BACKUPindex

fi

mv $MVARGS crawl/NEWindex crawl/index

${CATALINA_HOME}/bin/startup.sh

echo "runbot: FINISHED: Crawl completed!"

echo ""