【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析

在创建一个job后，就要开始job的运行，运行的全流程如下：

1、在界面上启动job

2、index.jsp

查看上述页面对应的源代码

<a href='"+request.getContextPath()+"/console/action.jsp?action=start'>Start</a>

3、action.jsp

    String sAction = request.getParameter("action");

    if(sAction != null)

    {

        // Need to handle an action

        if(sAction.equalsIgnoreCase("start"))

        {

            // Tell handler to start crawl job

            handler.startCrawler();

        } else if(sAction.equalsIgnoreCase("stop")) {

            // Tell handler to stop crawl job

            handler.stopCrawler();

        } else if(sAction.equalsIgnoreCase("terminate")) {

            // Delete current job

            if(handler.getCurrentJob()!=null){

                handler.deleteJob(handler.getCurrentJob().getUID());

            }

        } else if(sAction.equalsIgnoreCase("pause")) {

            // Tell handler to pause crawl job

            handler.pauseJob();

        } else if(sAction.equalsIgnoreCase("resume")) {

            // Tell handler to resume crawl job

            handler.resumeJob();

        } else if(sAction.equalsIgnoreCase("checkpoint")) {

            if(handler.getCurrentJob() != null) {

                handler.checkpointJob();

            }

        }

    }

    response.sendRedirect(request.getContextPath() + "/index.jsp");

4、CrawlJobHandler.jsp

（1）

    public void startCrawler() {

        running = true;

        if (pendingCrawlJobs.size() > 0 && isCrawling() == false) {

            // Ok, can just start the next job

            startNextJob();

        }

    }

（2）

    protected final void startNextJob() {

        synchronized (this) {

            if(startingNextJob != null) {

                try {

                    startingNextJob.join();

                } catch (InterruptedException e) {

                    e.printStackTrace();

                    return;

                }

            }

            startingNextJob = new Thread(new Runnable() {

                public void run() {

                    startNextJobInternal();

                }

            }, "StartNextJob");

            startingNextJob.start();

        }

    }

（3）

   protected void startNextJobInternal() {

        if (pendingCrawlJobs.size() == 0 || isCrawling()) {

            // No job ready or already crawling.

            return;

        }

        this.currentJob = (CrawlJob)pendingCrawlJobs.first();

        assert pendingCrawlJobs.contains(currentJob) :

            "pendingCrawlJobs is in an illegal state";

        pendingCrawlJobs.remove(currentJob);

        try {

            this.currentJob.setupForCrawlStart();

            // This is ugly but needed so I can clear the currentJob

            // reference in the crawlEnding and update the list of completed

            // jobs.  Also, crawlEnded can startup next job.

            this.currentJob.getController().addCrawlStatusListener(this);

            // now, actually start

            this.currentJob.getController().requestCrawlStart();

        } catch (InitializationException e) {

            loadJob(getStateJobFile(this.currentJob.getDirectory()));

            this.currentJob = null;

            startNextJobInternal(); // Load the next job if there is one.

        }

    }

（4）

    public void requestCrawlStart() {

        runProcessorInitialTasks();

        sendCrawlStateChangeEvent(STARTED, CrawlJob.STATUS_PENDING);

        String jobState;

        state = RUNNING;

        jobState = CrawlJob.STATUS_RUNNING;

        sendCrawlStateChangeEvent(this.state, jobState);

        // A proper exit will change this value.

        this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL;

        Thread statLogger = new Thread(statistics);

        statLogger.setName("StatLogger");

        statLogger.start();

        frontier.start();

    }

【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析的更多相关文章

【Heritrix基础教程之1】在Eclipse中配置Heritrix
一.新建项目并将Heritrix源代码导入 1.下载heritrix-1.14.4-src.zip和heritrix-1.14.4.zip两个压缩包,并解压,以后分别简称SRC包和ZIP包: 2.在E ...
【Heritrix基础教程之1】在Eclipse中配置Heritrix 分类： H3_NUTCH 2014-06-01 00:00 1262人阅读评论(0) 收藏
一.新建项目并将Heritrix源码导入 1.下载heritrix-1.14.4-src.zip和heritrix-1.14.4.zip两个压缩包,并解压,以后分别简称SRC包和ZIP包: 2.在Ec ...
【Heritrix基础教程之3】Heritrix的基本架构
Heritrix可分为四大模块: 1.控制器CrawlController 2.待处理的uri列表 Frontier 3.线程池 ToeThread 4.各个步骤的处理器 (1)Pre-fetch ...
【Heritrix基础教程之3】Heritrix的基本架构分类： H3_NUTCH 2014-06-01 16:56 1267人阅读评论(0) 收藏
Heritrix可分为四大模块: 1.控制器CrawlController 2.待处理的uri列表 Frontier 3.线程池 ToeThread 4.各个步骤的处理器 (1)Pre-fetch ...
【Heritrix基础教程之2】Heritrix基本内容介绍
1.版本说明 (1)最新版本:3.3.0 (2)最新release版本:3.2.0 (3)重要历史版本:1.14.4 3.1.0及之前的版本:http://sourceforge.net/projec ...
【Heritrix基础教程之2】Heritrix基本内容介绍分类： B1_JAVA H3_NUTCH 2014-06-01 13:02 878人阅读评论(0) 收藏
1.版本说明 (1)最新版本:3.3.0 (2)最新release版本:3.2.0 (3)重要历史版本:1.14.4 3.1.0及之前的版本:http://sourceforge.net/projec ...
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务
OpenVAS漏洞扫描基础教程之OpenVAS概述及安装及配置OpenVAS服务 1. OpenVAS基础知识 OpenVAS(Open Vulnerability Assessment Sys ...
Python基础教程之List对象转
Python基础教程之List对象时间:2014-01-19 来源:服务器之家投稿:root 1.PyListObject对象typedef struct { PyObjec ...
Python基础教程之udp和tcp协议介绍
Python基础教程之udp和tcp协议介绍 UDP介绍 UDP --- 用户数据报协议,是一个无连接的简单的面向数据报的运输层协议.UDP不提供可靠性,它只是把应用程序传给IP层的数据报发送出去,但 ...

随机推荐

Windows I/O模型之一：Select模型
1.概念理解在进行网络编程时,我们常常见到同步(Sync)/异步(Async),阻塞(Block)/非阻塞(Unblock) 四种调用模式: 同步:所谓同步,就是在发出一个功能调用时,在没有得到结果 ...
Lucene学习总结之三：Lucene的索引文件格式(1)
Lucene的索引里面存了些什么,如何存放的,也即Lucene的索引文件格式,是读懂Lucene源代码的一把钥匙. 当我们真正进入到Lucene源代码之中的时候,我们会发现: Lucene的索引过程, ...
set up size, title to tcl tk main window
#!/usr/bin/wish wm title . "this is main title" wm geometry . 500x300+30+200 500 --width 3 ...
php error_log 详解
定义和用法 error_log() 函数向服务器错误记录.文件或远程目标发送一个错误. 成功,返回 true,否则返回 false. error_log(error,type,destination, ...
Redis 入门之编译安装
Redis是一个开源的使用ANSI C语言编写.支持网络.可基于内存亦可持久化的日志型.Key-Value数据库,并提供多种语言的API.从2010年3月15日起,Redis的开发工作由VMware主 ...
nginx之依据IP做限制
环境如下: [root@localhost ~]# cat /etc/issueCentOS release 6.5 (Final)Kernel \r on an \m[root@localhost ...
负载均衡集群之LVS配置命令
ipvs/ipvsadm 添加集群服务--> ipvsadm -A|E -t|u|f VIP[:Port] -s scheduler [-p timeout] [-O] [-M netmask] ...
Cleaning Shifts（POJ 2376 贪心）
Cleaning Shifts Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 15143 Accepted: 3875 ...
Ext4报错Uncaught Ext.Loader is not enabled
提示: Uncaught Ext.Loader is not enabled, so dependencies cannot be resolved dynamically. Missing requ ...
Arcgis api For silverlight 加载QQ地图
原文 http://www.cnblogs.com/thinkaspx/archive/2012/11/07/2759079.html //本篇博客仅在技术上探讨可行性 //如果要使用Q 地图,请 ...

【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析

【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析的更多相关文章

随机推荐

热门专题