一、抓取流程概述
1、nutch抓取流程
当使用crawl命令进行抓取任务时,其基本流程步骤如下:
(1)InjectorJob
开始第一个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第二个迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
开始第三个迭代
……

2、抓取日志


使用crawl命令进行抓取时,console输出日志如下:

InjectorJob: starting at 2014-07-08 10:41:27
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:41:34
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787293-26339
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1404787293-26339
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798101129
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.csdn.net/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.itpub.net/ (queue crawl delay=5000ms)
-finishing thread FetcherThread47, activeThreads=48
-finishing thread FetcherThread46, activeThreads=47
-finishing thread FetcherThread45, activeThreads=46
-finishing thread FetcherThread44, activeThreads=45
-finishing thread FetcherThread43, activeThreads=44
-finishing thread FetcherThread42, activeThreads=43
-finishing thread FetcherThread41, activeThreads=42
-finishing thread FetcherThread40, activeThreads=41
-finishing thread FetcherThread39, activeThreads=40
-finishing thread FetcherThread38, activeThreads=39
-finishing thread FetcherThread37, activeThreads=38
-finishing thread FetcherThread36, activeThreads=37
-finishing thread FetcherThread35, activeThreads=36
-finishing thread FetcherThread34, activeThreads=35
-finishing thread FetcherThread33, activeThreads=34
-finishing thread FetcherThread32, activeThreads=33
-finishing thread FetcherThread31, activeThreads=32
-finishing thread FetcherThread30, activeThreads=31
-finishing thread FetcherThread29, activeThreads=30
-finishing thread FetcherThread48, activeThreads=29
-finishing thread FetcherThread27, activeThreads=29
-finishing thread FetcherThread26, activeThreads=28
-finishing thread FetcherThread25, activeThreads=27
-finishing thread FetcherThread24, activeThreads=26
-finishing thread FetcherThread23, activeThreads=25
-finishing thread FetcherThread22, activeThreads=24
-finishing thread FetcherThread21, activeThreads=23
-finishing thread FetcherThread20, activeThreads=22
-finishing thread FetcherThread19, activeThreads=21
-finishing thread FetcherThread18, activeThreads=20
-finishing thread FetcherThread17, activeThreads=19
-finishing thread FetcherThread16, activeThreads=18
-finishing thread FetcherThread15, activeThreads=17
-finishing thread FetcherThread14, activeThreads=16
-finishing thread FetcherThread13, activeThreads=15
-finishing thread FetcherThread12, activeThreads=14
-finishing thread FetcherThread11, activeThreads=13
-finishing thread FetcherThread10, activeThreads=12
-finishing thread FetcherThread9, activeThreads=11
-finishing thread FetcherThread8, activeThreads=10
-finishing thread FetcherThread7, activeThreads=9
-finishing thread FetcherThread5, activeThreads=8
-finishing thread FetcherThread4, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread2, activeThreads=5
-finishing thread FetcherThread49, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread28, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null
-finishing thread FetcherThread1, activeThreads=0
0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
Parsing :
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1404787293-26339
Parsing http://www.csdn.net/
http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
Parsing http://www.itpub.net/
ParserJob: success
CrawlDB update for csdnitpub
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing csdnitpub on SOLR index -> http://ip:8983/solr/
SolrIndexerJob: starting
SolrIndexerJob: done.
SOLR dedup -> http://ip:8983/solr/
Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2014-07-08 10:42:19
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1404787338-30453
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1404787338-30453
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1404798146676
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0

二、使用命令进行逐步抓取

1、InjectorJob
此步骤将seed.txt中的url注入抓取队列中进行初始化。
(1)基本命令 $ bin/nutch inject Usage: InjectorJob <url_dir> [-crawlId <id>] $ bin/nutch inject urls InjectorJob: starting at 2014-12-20 22:32:01 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1 Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14 其中urls/seed.txt的内容如下:
http://stackoverflow.com/
(2)查看注入的url
上述步骤会在hbase中新建一个表,表名为test_1_webpage,url的相应内容会写入这张表
hbase(main):002:0> scan '334_webpage'
ROW                              COLUMN+CELL                                                                               
 com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00                                  
 com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D                     
 com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       
 com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           
 com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00                            
 com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?\x80\x00\x00                                   
1 row(s) in 0.3020 seconds
(3)关于**_webpage表
对于每一个任务,均会生成一个crawlId_webpage的表,所有已抓取及未抓取的url相关信息均会存入此表。
若url未抓取,则该url相应的行信息较少。若url已经抓取,则抓取到的内容也会放入该行,如网页内容等。 2、GeneratorJob
(1)基本命令
[jediael@jediael local]$  bin/nutch generate -crawlId 334
GeneratorJob: starting at 2014-08-25 15:57:12
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06
GeneratorJob: generated batch id: 1408953432-1171377744
(2)命令选项
[root@jediael local]# bin/nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)"); 
   -noFilter      - do not activate the filter plugin to filter the url, default is true 
    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 
    -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.
    -batchId       - the batch id 
----------------------
Please set the params.
(3)查看数据库
hbase(main):003:0> scan '334_webpage' 
ROW                              COLUMN+CELL                                                                                
 com.stackoverflow:http/         column=f:bid, timestamp=1408953437910, value=1408953432-1171377744                         
 com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=\x00'\x8D\x00                                  
 com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=\x00\x00\x01H\x0C&\x11\x8D                     
 com.stackoverflow:http/         column=mk:_gnmrk_, timestamp=1408953437910, value=1408953432-1171377744                    
 com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       
 com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           
 com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?\x80\x00\x00                            
 com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?\x80\x00\x00                                   
1 row(s) in 0.0490 seconds
此步骤新增了f:bid,mk:_gnmrk_  两列。
3、FetcherJob
(1)基本命令
[jediael@jediael local]$  bin/nutch generate -crawlId 334
GeneratorJob: starting at 2014-08-25 15:57:12
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06
GeneratorJob: generated batch id: 1408953432-1171377744
[jediael@jediael local]$  bin/nutch fetch -all -crawlId 334
FetcherJob: starting
FetcherJob: fetching all
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://stackoverflow.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread1, activeThreads=8
-finishing thread FetcherThread7, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread5, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=3
-finishing thread FetcherThread2, activeThreads=2
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 102 102 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done (2)查看数据库
见db1.txt
新增f:bas,column=f:cnt,column=f:prot,f:pts,f:st,f:ts,f:typ,h:Cache-Control,h:Connection,h:Content-Encoding,h:Content-Length, h:Content-Type,h:Date,h:Expires, h:Last-Modified,h:Set-Cookie,h:Vary,h:X-Frame-Options, mk:_ftcmrk_等字段
4、ParserJob
(1)基本命令
[jediael@jediael local]$ bin/nutch parse  -all -crawlId 334
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: parsing all
Parsing http://stackoverflow.com/
ParserJob: success
(2)命令参数
[root@jediael local]# bin/nutch parse 
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
    <batchId>     - symbolic batch ID created by Generator
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)
    -all          - consider pages from all crawl jobs
    -resume       - resume a previous incomplete job
    -force        - force re-parsing even if a page is already parsed
(3)查看数据库
见db_parse.txt
新增了很多类似column=ol:http://stackoverflow.com/help的列,在此例中共有115个。 5、DbUpdaterJob
(1)基本命令
[jediael@jediael local]$ bin/nutch updatedb -crawlId 334
DbUpdaterJob: starting
DbUpdaterJob: done
(2)查看数据库
见db_updatedb.txt
解释了上述的115个column=ol:http,并生成了115行新数据,举其中一个例子如下:
com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00                                  
 44974/silviu-oncioiu                                                                                                       
 com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01                               
 44974/silviu-oncioiu                                                                                                       
 com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09                     
 44974/silviu-oncioiu                                                                                                       
 com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           
 44974/silviu-oncioiu                                                                                                       
 com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5                                  
 44974/silviu-oncioiu                                                                                                       
 com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5                                         
 44974/silviu-oncioiu                                                                                                       
 com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=\x00'\x8D\x00                                  
 74525/laosi                                                                                                                
 com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=\x00\x00\x00\x01                               
 74525/laosi                                                                                                                
 com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=\x00\x00\x01H\x0CB\xD4\x09                     
 74525/laosi                                                                                                                
 com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           
 74525/laosi                                                                                                                
 com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<\x0Ex5                                  
 74525/laosi                                                                                                                
 com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<\x0Ex5                                         
 74525/laosi 
此时数据已准备好,等待下一轮的抓取。
6、SolrIndexerJob
(1)基本命令
[jediael@jediael local]$  bin/nutch solrindex http://****/solr/  -all -crawlId 334
SolrIndexerJob: starting
Adding 1 documents
SolrIndexerJob: done.
(2)命令参数
[root@jediael local]# bin/nutch solrindex 
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
(3)查看数据库
无变化



版权声明:本文为博主原创文章,未经博主允许不得转载。

【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程 分类: H3_NUTCH 2014-08-15 21:39 2530人阅读 评论(1) 收藏的更多相关文章

  1. C语言基础:进制转换,变量,常量,表达式,基本数据类型,输出函数,输入函数,运算符. 分类: iOS学习 c语言基础 2015-06-10 21:39 25人阅读 评论(0) 收藏

    二进制:以0b开头,只有0和1两种数字.如0101 十进制:0~9十个数字表示.如25 十六进制:以0~9,A~F表示,以0X开头.如0X2B 十进制转换为X进制:连除倒取余 X进制转换为十进制:按权 ...

  2. UI基础:UIButton.UIimage 分类: iOS学习-UI 2015-07-01 21:39 85人阅读 评论(0) 收藏

    UIButton是ios中用来响应用户点击事件的控件.继承自UIControl 1.创建控件 UIButton *button=[UIButton buttonWithType:UIButtonTyp ...

  3. 【solr基础教程之二】索引 分类: H4_SOLR/LUCENCE 2014-07-18 21:06 3331人阅读 评论(0) 收藏

    一.向Solr提交索引的方式 1.使用post.jar进行索引 (1)创建文档xml文件 <add> <doc> <field name="id"&g ...

  4. 【Nutch2.2.1基础教程之1】nutch相关异常 分类: H3_NUTCH 2014-08-08 21:46 1549人阅读 评论(2) 收藏

    1.在任务一开始运行,注入Url时即出现以下错误. InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.go ...

  5. UI基础:UITextField 分类: iOS学习-UI 2015-07-01 21:07 68人阅读 评论(0) 收藏

    UITextField 继承自UIControl,他是在UILabel基础上,对了文本的编辑.可以允许用户输入和编辑文本 UITextField的使用步骤 1.创建控件 UITextField *te ...

  6. makefile基础实例讲解 分类: C/C++ 2015-03-16 10:11 66人阅读 评论(0) 收藏

    一.makefile简介 定义:makefile定义了软件开发过程中,项目工程编译链.接接的方法和规则. 产生:由IDE自动生成或者开发者手动书写. 作用:Unix(MAC OS.Solars)和Li ...

  7. C语言基础:数组 分类: iOS学习 c语言基础 2015-06-10 21:40 7人阅读 评论(0) 收藏

    数组:是由一组具有相同数据类型的数据组合而来. 数组定义:元素类型修饰符 数组名[数组个数]={元素1,元素2....};  int arr[ 2 ]={1,2};    //正确 int arr[ ...

  8. C语言基础:内存 分类: iOS学习 c语言基础 2015-06-10 21:59 23人阅读 评论(0) 收藏

    全局变量:定义在函数之外.(不安全)   局部变量;定义在函数之内. 内存的划分:1栈区   2堆区  3静态区(全局区) 4常量区 5代码区 栈区..静态区.常量区.代码区的数据都是由系统分配和释放 ...

  9. C语言基础:指针初级(补充) 分类: iOS学习 c语言基础 2015-06-10 21:54 19人阅读 评论(0) 收藏

    结构体指针:指向结构体指针的变量的指针. 结构体指针指向结构体第一个成员变量的首地址 ->:   指向操作符 定义的指针变量必须指向结构体的首地址,才可以使用  ->  访问结构体成员变量 ...

随机推荐

  1. 关于XAMPP安装后APACH无法启动的问题

    Xampp的获得和安装都十分简单,你仅仅要到下面网址: http://www.apachefriends.org/zh_cn/xampp.html 下载xampp就可以.我安装的是windows版本号 ...

  2. 71.sscanf数据挖掘

    数据挖掘 sscanf(str, "%d %s %s %d %d %s %s %s", &ph[i].id, ph[i].name, ph[i].sex, &ph[ ...

  3. 69.fprintf fscanf

    fprintf //从读文件中提取字符串到info1.user和info1.password中 fscanf(pfr, "%s%s", info1.user, info1.pass ...

  4. vue踩坑记-在项目中安装依赖模块npm install报错

    在维护别人的项目的时候,在项目文件夹中安装npm install模块的时候,报错如下: npm ERR! path D:\ShopApp\node_modules\fsevents\node_modu ...

  5. android对话框(Dialog)的使用方法

    Activities提供了一种方便管理的创建.保存.回复的对话框机制.比如 onCreateDialog(int), onPrepareDialog(int, Dialog), showDialog( ...

  6. 2lession-文件访问

    今天继续学习python,因为是根据网上的教程,里面用到了一些例子,包含有后面的知识点.但是,因为自己稍微有点c.java等语言基础,所以并没有严格按照教程来学习,反而是遇到知识点就记录下来. 代码如 ...

  7. 对DataTable进行过滤筛选的一些方法Select,dataview

    当你从数据库里取出一些数据,然后要对数据进行整合,你很容易就会想到: DataTable dt = new DataTable();//假设dt是由"SELECT C1,C2,C3 FROM ...

  8. 【例题 7-5 UVA - 129】Krypton Factor

    [链接] 我是链接,点我呀:) [题意] 在这里输入题意 [题解] 每次枚举增加一个字符; 然后看看新生成的字符的后缀里面有没有出现连续子串就好,前面已经确认过的没必要重复确认 (枚举长度为偶数的一个 ...

  9. 【Codeforces Round #447 (Div. 2) B】Ralph And His Magic Field

    | [链接] 我是链接,点我呀:) [题意] 给你一个n*m矩阵,让你在里面填数字. 使得每一行的数字的乘积都为k; 且每一列的数字的乘积都为k; k只能为1或-1 [题解] 显然每个位置只能填1或- ...

  10. C# 进制转换 在什么情况下使用16进制,字节数组,字符串

    C# 进制转换 Admin2013年9月18日 名人名言:从工作里爱了生命,就是通彻了生命最深的秘密.——纪伯伦 1.请问c#中如何将十进制数的字符串转化成十六进制数的字符串 //十进制转二进制Con ...