【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏

nutch-site.xml

在nutch2.2.1中，有两份配置文件：nutch-default.xml与nutch-site.xml。

其中前者是nutch自带的默认属性，一般情况下不要修改。

如果需要修改默认属性，可以在nutch-site.xml中增加一个同名的属性，并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。

1、db.ignore.external.links

若为true，则只抓取本域名内的网页，忽略外部链接。

可以在 regex-urlfilter.txt中增加过滤器达到同样效果，但如果过滤器过多，如几千个，则会大大影响nutch的性能。

<property>

  <name>db.ignore.external.links</name>

  <value>true</value>

  <description>If true, outlinks leading from a page to external hosts

  will be ignored. This is an effective way to limit the crawl to include

  only initially injected hosts, without creating complex URLFilters.

  </description>

</property>

2、fetcher.parse

能否在抓取的同时进行解释：可以，但不建议这样做。

<property>

  <name>fetcher.parse</name>

  <value>false</value>

  <description>If true, fetcher will parse content. NOTE: previous releases would

  default to true. Since 2.0 this is set to false as a safer default.</description>

</property>

官方解释

N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher's reducer stalls you may run out of memory or disk space,
usually after a very long reduce job. Behaviour typical to this is usually
observed in this situation.

In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.

3、db.max.outlinks.per.page

默认情况下，Nutch只抓取某个网页的100个外部链接，导致部分链接无法抓取。若要改变此情况，可以修改此配置项。

<property>

  <name>db.max.outlinks.per.page</name>

  <value>100</value>

  <description>The maximum number of outlinks that we'll process for a page.  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks  will be processed for a page; otherwise, all outlinks will be processed.

  </description>

</property>

官方说明如下：http://wiki.apache.org/nutch/FAQ/

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?

The crawl tool has a default limitation of 100 outlinks of one page that are being fetched. To overcome this limitation change thedb.max.outlinks.per.page property to a higher
value or simply -1 (unlimited).

file: conf/nutch-default.xml

 <property>

   <name>db.max.outlinks.per.page</name>

   <value>-1</value>

   <description>The maximum number of outlinks that we'll process for a page.

   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks

   will be processed for a page; otherwise, all outlinks will be processed.

   </description>

 </property>

4、file.content.limit http.content.limit ftp.content.limit

默认情况下，nutch只抓取网页的前65536个字节，之后的内容将被丢弃。

但对于某些大型网站，首页的内容远远不止65536个字节，甚至前面65536个字节里面均是一些布局信息，并没有任何的超链接。

因此修改默认值如下：

<property>

  <name>file.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content using the file

   protocol, in bytes. If this value is nonnegative (>=0), content longer

   than it will be truncated; otherwise, no truncation at all. Do not

   confuse this setting with the http.content.limit setting.

  </description>

</property>

<property>

  <name>http.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content using the http

  protocol, in bytes. If this value is nonnegative (>=0), content longer

  than it will be truncated; otherwise, no truncation at all. Do not

  confuse this setting with the file.content.limit setting.

  </description>

</property>

<property>

  <name>ftp.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content, in bytes.

  If this value is nonnegative (>=0), content longer than it will be truncated;

  otherwise, no truncation at all.

  Caution: classical ftp RFCs never defines partial transfer and, in fact,

  some ftp servers out there do not handle client side forced close-down very

  well. Our implementation tries its best to handle such situations smoothly.

  </description>

</property>

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏的更多相关文章

【Heritrix基础教程之3】Heritrix的基本架构分类： H3_NUTCH 2014-06-01 16:56 1267人阅读评论(0) 收藏
Heritrix可分为四大模块: 1.控制器CrawlController 2.待处理的uri列表 Frontier 3.线程池 ToeThread 4.各个步骤的处理器 (1)Pre-fetch ...
JVM调优基础分类： B1_JAVA 2015-03-14 09:33 250人阅读评论(0) 收藏
一.JVM调优基本流程 1.划分应用程序的系统需求优先级 2.选择JVM部署模式:单JVM.多JVM 3.选择JVM运行模式 4.调优应用程序内存使用 5.调优应用程序延迟 6.调优应用程序吞吐量二 ...
makefile基础实例讲解分类： C/C++ 2015-03-16 10:11 66人阅读评论(0) 收藏
一.makefile简介定义:makefile定义了软件开发过程中,项目工程编译链.接接的方法和规则. 产生:由IDE自动生成或者开发者手动书写. 作用:Unix(MAC OS.Solars)和Li ...
OC基础:内存(内存管理) 分类： ios学习 OC 2015-06-25 16:50 73人阅读评论(0) 收藏
自动释放池: @autoreleasepool { } 内存管理机制谁污染,谁治理垃圾回收机制:gc(Garbage collection),由系统管理内存,开发人员不需要管理. OC ...
Mahout快速入门教程分类： B10_计算机基础 2015-03-07 16:20 508人阅读评论(0) 收藏
Mahout 是一个很强大的数据挖掘工具,是一个分布式机器学习算法的集合,包括:被称为Taste的分布式协同过滤的实现.分类.聚类等.Mahout最大的优点就是基于hadoop实现,把很多以前运行于单 ...
【solr基础教程之九】客户端分类： H4_SOLR/LUCENCE 2014-07-30 15:28 904人阅读评论(0) 收藏
一.Java Script 1.由于Solr本身可以返回Json格式的结果,而JavaScript对于处理Json数据具有天然的优势,因此使用JavaScript实现Solr客户端是一个很好的选择. ...
【solr基础教程之二】索引分类： H4_SOLR/LUCENCE 2014-07-18 21:06 3331人阅读评论(0) 收藏
一.向Solr提交索引的方式 1.使用post.jar进行索引 (1)创建文档xml文件 <add> <doc> <field name="id"&g ...
【Nutch2.2.1基础教程之1】nutch相关异常分类： H3_NUTCH 2014-08-08 21:46 1549人阅读评论(2) 收藏
1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...
【Heritrix基础教程之2】Heritrix基本内容介绍分类： B1_JAVA H3_NUTCH 2014-06-01 13:02 878人阅读评论(0) 收藏
1.版本说明 (1)最新版本:3.3.0 (2)最新release版本:3.2.0 (3)重要历史版本:1.14.4 3.1.0及之前的版本:http://sourceforge.net/projec ...

随机推荐

C++ static 静态成员变量和静态成员函数
静态(static) 成员变量 1• 静态成员变量的初始化须要在类外完毕. 2• 静态成员不属于详细的某个对象,而属于整个类: 3• 全部对象共享本类中的静态成员: 4• 静态成员最好直接通 ...
用for和while循环求e的值[e=1+1/1!+1/2!+1/3!+1/4!+1/5!+...+1/n!]
/*编敲代码,依据下面公式求e的值. 要求用两种方法计算: 1)for循环.计算前50项 2)while循环,直至最后一项的值小于10-4 e=1+1/1!+1/2!+1/3!+1/4!+1/5!+. ...
最优子结构（Optimal Substructure）
最优子结构的存在是应用动态规划的前提(或者说必要条件),由此可以避免重复计算: 1. 图算法最短路径的子路径也一定是最短的: 简单地反证,如果最短路径的中间两点,之间的路径不是最短路径的话,那么一定 ...
4.使用fastjson进行json字符串和List的转换
转自:https://blog.csdn.net/lipr86/article/details/80833952 使用fastjson进行自定义类的列表和字符串转换 1.环境 jdk1.8,fastj ...
web存储方法，现成代码
1.cookie的设置与取用 function setCookie(cname,cvalue,exdays){ var d = new Date(); d.setTime(d.getTime()+(e ...
[Vue + TS] Create Type-Safe Vue Directives in TypeScript
Directives allow us to apply DOM manipulations as side effects. We’ll show you how you can create yo ...
Maven学习总结（18）——深入理解Maven仓库
一.本地仓库(Local Repository) 本地仓库就是一个本机的目录,这个目录被用来存储我们项目的所有依赖(插件的jar包还有一些其他的文件),简单的说,当你build一个Maven项目的时候 ...
ASM学习笔记--ASM 4 user guide 第二章要点翻译总结
参考:ASM 4 user guide 第一部分 core API 第二章类 2.1.1概观编译后的类包括: l 一个描述部分:包括修饰语(比如public或private).名字.父类.接口 ...
SafeSEH原理及绕过技术浅析
SafeSEH原理及绕过技术浅析作者:magictong 时间:2012年3月16日星期五摘要:主要介绍SafeSEH的基本原理和SafeSEH的绕过技术,重点在原理介绍. 关键词:SafeSEH ...
语言模型（Language Modeling）与统计语言模型
1. n-grams 统计语言模型研究的是一个单词序列出现的概率分布(probability distribution).例如对于英语,全体英文单词构成整个状态空间(state space). 边缘概 ...

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件 分类： H3_NUTCH 2014-08-18 16:33 1376人阅读 评论(0) 收藏

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件 分类： H3_NUTCH 2014-08-18 16:33 1376人阅读 评论(0) 收藏的更多相关文章

随机推荐

热门专题

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏

【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件分类： H3_NUTCH 2014-08-18 16:33 1376人阅读评论(0) 收藏的更多相关文章