Nutch配置：nutch-default.xml详解

/×××××××××××××××××××××××××××××××××××××××××/

Author：xxx0624

HomePage：http://www.cnblogs.com/xxx0624/

/×××××××××××××××××××××××××××××××××××××××××/

===============File===============

配置1：

<property>

  <name>file.content.limit</name>

  <value>65536</value>

  <description>The length limit for downloaded content using the file

   protocol, in bytes. If this value is nonnegative (>=0), content longer

   than it will be truncated; otherwise, no truncation at all. Do not

   confuse this setting with the http.content.limit setting.

  </description>

</property>

当使用file协议下载的时候，用来限制下载文件的大小，默认为65535个字节。如果超过大小限制，内容会被截断。

配置2：

<property>

  <name>file.content.ignored</name>

  <value>true</value>

  <description>If true, no file content will be saved during fetch.

  And it is probably what we want to set most of time, since file:// URLs

  are meant to be local and we can always use them directly at parsing

  and indexing stages. Otherwise file contents will be saved.

  !! NO IMPLEMENTED YET !!

  </description>

</property>

如果这个设置为true，当nutch在爬取文件的时候不会下载文件内容。

===============HTTP===============

配置1：(重要！)

<property>

  <name>http.agent.name</name>

  <value></value>

  <description>HTTP 'User-Agent' request header. MUST NOT be empty -

  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents

    http.agent.description

    http.agent.url

    http.agent.email

    http.agent.version

  and set their values appropriately.

  </description>

</property>

这个用于配置HTTP代理。

定义HTTP header中的User-Agent相关属性，一定需要配置！

配置2：(重要！)

<property>

  <name>http.robots.agents</name>

  <value>*</value>

  <description>The agent strings we'll look for in robots.txt files,

  comma-separated, in decreasing order of precedence. You should

  put the value of http.agent.name as the first agent name, and keep the

  default * at the end of the list. E.g.: BlurflDev,Blurfl,*

  </description>

</property>

有些网页会有robots设置，robots.txt设置为了规范爬虫。

设置这个后，nutch会在相应的爬取网站中的robots.txt内寻找是否存在这个agent，否则无法爬取到该网页。

配置3：

<property>

  <name>http.robots.403.allow</name>

  <value>true</value>

  <description>Some servers return HTTP status 403 (Forbidden) if

  /robots.txt doesn't exist. This should probably mean that we are

  allowed to crawl the site nonetheless. If this is set to false,

  then such sites will be treated as forbidden.</description>

</property>

有些服务器在没有robots文件的时候会返回403错误，这时我们就能随意爬取内容。

但是如果这个值被设置成false，我们就无法爬取这个网站。

配置4：

<property>

  <name>http.timeout</name>

  <value>10000</value>

  <description>The default network timeout, in milliseconds.</description>

</property>

默认的网络超时时间是10000ms。

配置5（这一点可以考虑用于优化Nutch的爬取速度）：

<property>

  <name>http.max.delays</name>

  <value>100</value>

  <description>The number of times a thread will delay when trying to

  fetch a page.  Each time it finds that a host is busy, it will wait

  fetcher.server.delay.  After http.max.delays attepts, it will give

  up on the page for now.</description>

</property>

在爬取网页的时候，线程的最多等待次数。每次线程发现主机繁忙的时候，线程就会等待fetch.server.delay这么长的时间，如果总的等待次数超过了http.max.delays，nutch则不再爬取该网页。

配置6：

<property>

  <name>http.content.limit</name>

  <value>65536</value>

  <description>The length limit for downloaded content using the http

  protocol, in bytes. If this value is nonnegative (>=0), content longer

  than it will be truncated; otherwise, no truncation at all. Do not

  confuse this setting with the file.content.limit setting.

  </description>

</property>

在使用HTTP协议下载网页的时候，用来限制下载网页的内容大小，最多是65536个字节。

超过则会被截断。

配置7：（代理服务部分）

<property>

  <name>http.proxy.host</name>

  <value></value>

  <description>The proxy hostname.  If empty, no proxy is used.</description>

</property>

<property>

  <name>http.proxy.port</name>

  <value></value>

  <description>The proxy port.</description>

</property>

<property>

  <name>http.proxy.username</name>

  <value></value>

  <description>Username for proxy. This will be used by

  'protocol-httpclient', if the proxy server requests basic, digest

  and/or NTLM authentication. To use this, 'protocol-httpclient' must

  be present in the value of 'plugin.includes' property.

  NOTE: For NTLM authentication, do not prefix the username with the

  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.

  </description>

</property>

<property>

  <name>http.proxy.password</name>

  <value></value>

  <description>Password for proxy. This will be used by

  'protocol-httpclient', if the proxy server requests basic, digest

  and/or NTLM authentication. To use this, 'protocol-httpclient' must

  be present in the value of 'plugin.includes' property.

  </description>

</property>

分别是代理的主机名，端口号，代理用户名，代理密码。

这几个配置都和protocol-httpclient插件有关。

===============FTP=================

暂无

===============web db===============

（1）Fetch过程中配置（只列出部分配置）

配置1：

<property>

  <name>db.fetch.interval.default</name>

  <value>2592000</value>

  <description>The default number of seconds between re-fetches of a page (30 days).

  </description>

</property>

这个设置为了定期重新爬取网页的时间间隔，默认是30天。

单位是秒。

配置2：

<property>

  <name>db.fetch.interval.max</name>

  <value>7776000</value>

  <description>The maximum number of seconds between re-fetches of a page

  (90 days). After this period every page in the db will be re-tried, no

  matter what is its status.

  </description>

</property>

这个设置表示在db.fetch.interval.max这段时间过后，数据库中的每个网页都肯定会被重新抓取，不管它目前是什么状态。

配置3：

<property>

  <name>db.fetch.schedule.class</name>

  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>

  <description>The implementation of fetch schedule. DefaultFetchSchedule simply

  adds the original fetchInterval to the last fetch time, regardless of

  page changes.</description>

</property>

这个指定的类是实现了网页下载时间安排。

DefaultFetchSchedule 只是简单的将原来的下载时间间隔加到上次下载时间上，不管当前每个网页的改变。

配置4：

<property>

  <name>db.fetch.schedule.adaptive.inc_rate</name>

  <value>0.4</value>

  <description>If a page is unmodified, its fetchInterval will be

  increased by this rate. This value should not

  exceed 0.5, otherwise the algorithm becomes unstable.</description>

</property>

如果重新下载网页并更新数据库的时候，发现这个网页没有发生变化，那么这个网页的更新时间间隔会变成：原来的时间间隔+设置的这个值（这个值不能超过0.5）

配置5：

<property>

  <name>db.fetch.schedule.adaptive.dec_rate</name>

  <value>0.2</value>

  <description>If a page is modified, its fetchInterval will be

  decreased by this rate. This value should not

  exceed 0.5, otherwise the algorithm becomes unstable.</description>

</property>

如果重新下载网页并更新数据库的时候，发现这个网页发生了变化，那么这个网页的更新时间间隔会变成：原来的时间间隔-设置的这个值（这个值不能超过0.5）

配置6：

<property>

  <name>db.fetch.schedule.adaptive.min_interval</name>

  <value>60.0</value>

  <description>Minimum fetchInterval, in seconds.</description>

</property>

最小的网页更新时间间隔。

<property>

  <name>db.fetch.schedule.adaptive.max_interval</name>

  <value>31536000.0</value>

  <description>Maximum fetchInterval, in seconds (365 days).

  NOTE: this is limited by db.fetch.interval.max. Pages with

  fetchInterval larger than db.fetch.interval.max

  will be fetched anyway.</description>

</property>

最大的网页更新时间间隔。

（2）Generate配置

配置7：

<property>

  <name>generate.max.count</name>

  <value>-1</value>

  <description>The maximum number of urls in a single

  fetchlist.  -1 if unlimited. The urls are counted according

  to the value of the parameter generator.count.mode.

  </description>

</property>

设置下载队列的url数量，-1表示无限。

配置8：

<property>

  <name>generate.count.mode</name>

  <value>host</value>

  <description>Determines how the URLs are counted for generator.max.count.

  Default value is 'host' but can be 'domain'. Note that we do not count

  per IP in the new version of the Generator.

  </description>

</property>

设置用来根据host不同来判断该URL是否抓取其内容

配置9：

<property>

  <name>generate.update.crawldb</name>

  <value>false</value>

  <description>For highly-concurrent environments, where several

  generate/fetch/update cycles may overlap, setting this to true ensures

  that generate will create different fetchlists even without intervening

  updatedb-s, at the cost of running an additional job to update CrawlDB.

  If false, running generate twice without intervening

  updatedb will generate identical fetchlists.</description>

</property>

对于高并发的环境来说，可能发生generate/fetch/update循环重叠的情况。

如果设置为true，即使没有中间updatedb，可以以运行一个额外的job来更新crawldb达到目的。

如果设置为false，在没有中间updatedb的情况下，则有可能产生两个相同的下载队列。

（3）partitioner（分发策略的配置）

配置10：

<property>

  <name>partition.url.mode</name>

  <value>byHost</value>

  <description>Determines how to partition URLs. Default value is 'byHost',

  also takes 'byDomain' or 'byIP'.

  </description>

</property>

设置根据Host不同来分发url。

（4）fetcher（下载的配置）

配置11：

<property>

  <name>fetcher.server.delay</name>

  <value>5.0</value>

  <description>The number of seconds the fetcher will delay between

   successive requests to the same server.</description>

</property>

设置对同一server成功请求的时间间隔。

配置12：（重要！）

<property>

  <name>fetcher.threads.fetch</name>

  <value>10</value>

  <description>The number of FetcherThreads the fetcher should use.

  This is also determines the maximum number of requests that are

  made at once (each FetcherThread handles one connection). The total

  number of threads running in distributed mode will be the number of

  fetcher threads * number of nodes as fetcher has one map task per node.

  </description>

</property>

默认10个下载线程

配置13：（重要！）这个可以考虑用于加速nutch爬虫

<property>

  <name>fetcher.threads.per.queue</name>

  <value>1</value>

  <description>This number is the maximum number of threads that

    should be allowed to access a queue at one time.</description>

</property>

设置同一时间内，同一队列能有几个线程访问

配置14：

<property>

  <name>fetcher.store.content</name>

  <value>true</value>

  <description>If true, fetcher will store content.</description>

</property>

设置true表示下载线程会下载内容

配置15：

<property>

  <name>fetcher.throughput.threshold.pages</name>

  <value>-1</value>

  <description>The threshold of minimum pages per second. If the fetcher downloads less

  pages per second than the configured threshold, the fetcher stops, preventing slow queue's

  from stalling the throughput. This threshold must be an integer. This can be useful when

  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.

  </description>

</property>

这个是设置fetcher的下载能力。如果每秒下载少于这个设置值，则下载线程会停止。

-1表示这个设置无效

===============index===============

索引部分不是很懂，只列举出部分配置...

配置1：

<property>

  <name>indexer.max.title.length</name>

  <value></value>

  <description>The maximum number of characters of a title that are indexed. A value of - disables this check.

  Used by index-basic.

  </description>

</property>

设置能索引的标题的最大长度

================plugin===============

配置1：

<property>

  <name>plugin.folders</name>

  <value>plugins</value>

  <description>Directories where nutch plugins are located.  Each

  element may be a relative or absolute path.  If absolute, it is used

  as is.  If relative, it is searched for on the classpath.</description>

</property>

设置nutch插件的地址

配置2：

<property>

  <name>plugin.auto-activation</name>

  <value>true</value>

  <description>Defines if some plugins that are not activated regarding

  the plugin.includes and plugin.excludes properties must be automaticaly

  activated if they are needed by some actived plugins.

  </description>

</property>

当插件不加载时，但是又被其他加载的插件依赖，是否自动启动。

默认true为自动启动。

配置3：

<property>

  <name>plugin.includes</name>

 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>

 <description>Regular expression naming plugin directory names to

  include.  Any plugin not matching this expression is excluded.

  In any case you need at least include the nutch-extensionpoints plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please enable

  protocol-httpclient, but be aware of possible intermittent problems with the

  underlying commons-httpclient library.

  </description>

</property>

要包含的插件名称列表（支持正则表达式）

插件4：

<property>

  <name>plugin.excludes</name>

  <value></value>

  <description>Regular expression naming plugin directory names to exclude.

  </description>

</property>

要排除的插件名称列表（支持正则表达式）

===============parse===============

配置1：

<property>

  <name>parse.plugin.file</name>

  <value>parse-plugins.xml</value>

  <description>The name of the file that defines the associations between

  content-types and parsers.</description>

</property>

这个用于设置文件类型和相应的解析器

配置2：

<property>

  <name>htmlparsefilter.order</name>

  <value></value>

  <description>The order by which HTMLParse filters are applied.

  If empty, all available HTMLParse filters (as dictated by properties

  plugin-includes and plugin-excludes above) are loaded and applied in system

  defined order. If not empty, only named filters are loaded and applied

  in given order.

  HTMLParse filter ordering MAY have an impact

  on end result, as some filters could rely on the metadata generated by a previous filter.

  </description>

</property>

设置HTML解析器的顺序。默认是按照plugin-includes and plugin-excludes来进行加载的

配置3：（重要！）

<property>

  <name>urlfilter.regex.file</name>

  <value>regex-urlfilter.txt</value>

  <description>Name of file on CLASSPATH containing regular expressions

  used by urlfilter-regex (RegexURLFilter) plugin.</description>

</property>

设置url过滤，支持正则表达式

===============solr & elasticSearch================

配置1：（重要！）

<property>

  <name>solr.mapping.file</name>

  <value>solrindex-mapping.xml</value>

  <description>

  Defines the name of the file that will be used in the mapping of internal

  nutch field names to solr index fields as specified in the target Solr schema.

  </description>

</property>

设置solr索引的映射关系

配置2：

<property>

  <name>solr.commit.index</name>

  <value>true</value>

  <description>

  When closing the indexer, trigger a commit to the Solr server.

  </description>

</property>

当关闭索引器时，提交结果到solr服务器

配置3：

<property>

  <name>elastic.index</name>

  <value>index</value>

  <description>

  The name of the elasticsearch index. Will normally be autocreated if it

  doesn't exist.

  </description>

</property>

设置es索引的默认名字

配置4：

<property>

  <name>elastic.max.bulk.docs</name>

  <value>500</value>

  <description>

  The number of docs in the batch that will trigger a flush to elasticsearch.

  </description>

</property>

设置bulk方式提交索引文件的数目

==================store==================

配置1：

<property>

  <name>storage.data.store.class</name>

  <value>org.apache.gora.memory.store.MemStore</value>

  <description>The Gora DataStore class for storing and retrieving data.

   Currently the following stores are available:

  org.apache.gora.sql.store.SqlStore

    Default store. A DataStore implementation for RDBMS with a SQL interface.

    SqlStore uses JDBC drivers to communicate with the DB. As explained in

    ivy.xml, currently >= gora-core 0.3 is not backwards compatable with

    SqlStore.

  org.apache.gora.cassandra.store.CassandraStore

    Gora class for storing data in Apache Cassandra.

  org.apache.gora.hbase.store.HBaseStore

    Gora class for storing data in Apache HBase.

  org.apache.gora.accumulo.store.AccumuloStore

    Gora class for storing data in Apache Accumulo.

  org.apache.gora.avro.store.AvroStore

    Gora class for storing data in Apache Avro.

  org.apache.gora.avro.store.DataFileAvroStore

    Gora class for storing data in Apache Avro. DataFileAvroStore is

    a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend.

    This datastore supports mapreduce.

  org.apache.gora.memory.store.MemStore

    Gora class for storing data in a Memory based implementation for tests.

  </description>

</property>

指定存储的方式，如hbase，avro等方式

方式2：还可以更改gora文件

Nutch配置：nutch-default.xml详解的更多相关文章

struts2 default.xml详解
struts2 default.xml 内容 1 bean节点制定Struts在运行的时候创建的对象类型. 2 指定Struts-default 包用户写的package(struts.xml) ...
Tomcat配置(二)：tomcat配置文件server.xml详解和部署简介
*/ .hljs { display: block; overflow-x: auto; padding: 0.5em; color: #333; background: #f8f8f8; } .hl ...
Maven-pom.xml详解
(看的比较累,可以直接看最后面有针对整个pom.xml的注解) pom的作用 pom作为项目对象模型.通过xml表示maven项目,使用pom.xml来实现.主要描述了项目:包括配置文件:开发者需要遵 ...
【maven】 pom.xml详解
pom.xml详解 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www ...
build.xml详解
build.xml详解1.<project>标签每个构建文件对应一个项目.<project>标签时构建文件的根标签.它可以有多个内在属性,就如代码中所示,其各个属性的含义分别如 ...
【转】maven核心，pom.xml详解
感谢如下博主: http://www.cnblogs.com/qq78292959/p/3711501.html maven核心,pom.xml详解什么是pom? pom作为项目对象模型.通过 ...
Ant 之bulid.xml详解
ANT build.xml文件详解(一) Ant的概念可能有些读者并不连接什么是Ant以及入可使用它,但只要使用通过Linux系统得读者,应该知道 make这个命令.当编译Linux内核及一些软件的 ...
Tomcat(二)：tomcat配置文件server.xml详解和部署简介
Tomcat系列文章:http://www.cnblogs.com/f-ck-need-u/p/7576137.html 1. 入门示例:虚拟主机提供web服务该示例通过设置虚拟主机来提供web服务 ...
Ant之build.xml详解
Ant之build.xml详解关键字: ant build.xml Ant的概念可能有些读者并不连接什么是Ant以及入可使用它,但只要使用通过Linux系统得读者,应该知道make这个命令.当编译 ...

随机推荐

Windows Phone 8.1SDK新特性预览
前言 Windows Phone 8.1的预览版将在近期推送,WP 8.1的SDK也已经进入到RC阶段,可以从这里安装.本次更新的SDK被直接集成到了VS2013Update2里面,不再是单独的 ...
OpenStack:安装Keystone
>安装Keystone1. 安装# apt-get install keystone2. 创建dbcreate database keystone;grant all privileges on ...
JVM学习总结一——内存模型
JVM是java知识体系的基石之一,任何一个java程序的运行,都要借助于他.或许对于我这种初级程序员而言,工作中很少有必要刻意去关注JVM,然而如果能对这块知识有所了解,就能够更清晰的明白程序的运行 ...
pcre 使用
1.主页地址:http://www.pcre.org/ 下载pcre-7.8.tar.bz22.解压缩: tar xjpf pcre-7.8.tar.bz23.配置: cd p ...
开发一个App的成本是多少？
英文出处:savvyapps.欢迎加入翻译小组. 在最近的一个会议上,一个叫Bob的老顾客引用了<App Savvy>(<放飞App:移动产品经理实战指南>)中探讨研发一个io ...
自己编写基于MVC的轻量级PHP框架
做WEB开发已有三年,每次都写重复的东西, 因此,想自己写一下框架,以后开发方便.本人之前asp.NET一年开发,jsp半年,可是后来因为工作的原故换成PHP.其实很不喜欢PHP的语法.还有PHP的函 ...
CommonsChunkPlugin的使用（关于angular2中的polyfills和vendor的疑问解决）
seed: angular2-webpack-starter(在github上可以找到) polyfills:提供api以方便兼容不同的浏览器 vendor:项目插件扩展在学习ng2中一直不明白为什 ...
java字符串替换函数高效实现
public static String removeStr(String src, String str) { if (src == null || str == null) return src; ...
jQuery多库共存处理
jQuery多库共存处理(来自慕课网) 多库共存换句话说可以叫无冲突处理. 总的来说会有2种情况会遇到: 1.$太火热,jQuery采用$作为命名空间,不免会与别的库框架或者插件相冲突. 2.jQue ...
Liferay IDE3.1 M1的一些新功能
定于11月发布的Liferay IDE提供了一些让人期许的功能 1. code upgrade tools 这个工具将会帮助你把liferay 6.2的项目升级为7.0的项目.下面列举其主要功能 1. ...

Nutch配置：nutch-default.xml详解

Nutch配置：nutch-default.xml详解的更多相关文章

随机推荐

热门专题