Nutch网页抓取速度优化

Here are the things that could potentially slow down fetching

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).

9) Others

Now how do we fix them

1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup.

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content.

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.

9) Other things that could be causing slow fetching:

Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors.
Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider.
Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use tcpdump and network monitoring tools on the single interface.

Nutch网页抓取速度优化的更多相关文章

Heritrix源码分析(三) 修改配置文件order.xml加快你的抓取速度(转）
本博客属原创文章,欢迎转载!转载请务必注明出处:http://guoyunsky.iteye.com/blog/629891 本博客已迁移到本人独立博客: http://www.yun5u ...
基于Casperjs的网页抓取技术【抓取豆瓣信息网络爬虫实战示例】
CasperJS is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Ge ...
Python网络爬虫笔记（一）：网页抓取方式和LXML示例
(一) 三种网页抓取方法 1. 正则表达式: 模块使用C语言编写,速度快,但是很脆弱,可能网页更新后就不能用了. 2. Beautiful Soup 模块使用Python编写,速度慢. ...
PID控制器的应用：控制网络爬虫抓取速度
一.初识PID控制器冬天乡下人喜欢烤火取暖,常见的情形就是四人围着麻将桌,桌底放一盆碳火.有人觉得火不够大,那加点木炭吧,还不够,再加点.片刻之后,又觉得火太大,脚都快被烤熟了,那就取出一些木碳…… ...
实现织梦dedecms百度主动推送(实时)网页抓取
做百度推广的时候,如何让百度快速收录呢,下面提供了三种方式,今天我们主要讲的是第一种. 如何选择链接提交方式 1.主动推送:最为快速的提交方式,推荐您将站点当天新产出链接立即通过此方式推送给百度,以保 ...
分享一个c#t的网页抓取类
using System; using System.Collections.Generic; using System.Web; using System.Text; using System.Ne ...
java网页抓取
网页抓取就是,我们想要从别人的网站上得到我们想要的,也算是窃取了,有的网站就对这个网页抓取就做了限制,比如百度直接进入正题 //要抓取的网页地址 String urlStr = "http ...
网页抓取：PHP实现网页爬虫方式小结
来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...
Java实现网页抓取的一个Demo
这个小案例的话我是存放在我的github 上. 下面给出链接自己可以去看下,也可以直接下载源码.有具体的说明 <Java网页抓取>

随机推荐

js函数易犯的错误
1.定义一个js函数时不能写在另一个函数里面. 2.定义一个打开网页自动执行的函数,一定要注意加载的顺序.如果是不是自动执行的,而是等页面加载完后事件触发的,那写在任何地方都没问题. 错误的:
1.27eia原油
容器云平台使用体验：数人云Crane（续）
数人云在9月6日开通了容器管理面板Crane的试用活动,这是国内首个基于DockerSwarmKit的容器管理工具.它具有Docker原生编排功能,采用轻量化架构,帮助开发者快速搭建DevOps环境, ...
python 常见包中的不定参数
LeetcCode102 Binary Tree Level Order Traversal
Given a binary tree, return the level order traversal of its nodes' values. (ie, from left to right, ...
ELK2之ELK的语法学习
1.回顾 (1)es是什么? es是基于Apache Lucene的开源分布式(全文)搜索引擎,提供简单的RESTful API来隐藏Lucene的复杂性. es除了全文搜索引擎之外,还可以这样描述它 ...
用div漂浮快实现与表单无关的多文件上传功能。
我项目有这个需求,多文件上传,而且要及时显示到表单上,这样的话就不能与表单相关. 由于我对前端不熟,我就实现了这么一个功能,通过button触发一个div漂浮块,然后多文件上传,之后通过js把文件名显 ...
Maven command
mvn eclispe:clean mvn eclispe:eclispe mvn clean package mvn clean package -Dmaven.test.skip=true mvn ...
算法导论笔记：18B树
磁盘作为辅存,它的容量要比内存大得多,但是速度也要慢许多,下面就是磁盘的的结构图: 磁盘驱动器由一个或多个盘片组成,它们以固定的速度绕着主轴旋转,数据存储于盘片的表面,磁盘驱动器通过磁臂末尾的磁头来读 ...
vmware中centos、redhat桥接网络配置
第一步第二步第三步 centos: redhat:

Nutch网页抓取速度优化

Nutch网页抓取速度优化的更多相关文章

随机推荐

热门专题