今天有兴趣重新看了一下heritrix3.1.0系统里面的线程池源码,heritrix系统没有采用java的cocurrency包里面的并发框架,而是采用了线程组ThreadGroup类来实现线程池的(线程组类似于树结构,一个线程组包含多个子线程组或多个子线程,数据结构类似于composite模式,不过枝节点与叶子节点没有实现类似composite模式的共同接口)

关键类是org.archive.crawler.framework包里面的ToePool类与ToeThread类,前者继承自ThreadGroup类,后者继承自Thread类

ToeThread显然是工作线程,用于执行采集任务,构造函数初始化成员变量CrawlController controller,用于获取Frontier对象及相关处理器链

    private CrawlController controller; 
private String coreName;
private CrawlURI currentCuri; /**
* Create a ToeThread
*
* @param g ToeThreadGroup
* @param sn serial number
*/
public ToeThread(ToePool g, int sn) {
// TODO: add crawl name?
super(g,"ToeThread #" + sn);
coreName="ToeThread #" + sn + ": ";
controller = g.getController();
serialNumber = sn;
setPriority(DEFAULT_PRIORITY);
int outBufferSize = controller.getRecorderOutBufferBytes();
int inBufferSize = controller.getRecorderInBufferBytes();
httpRecorder = new Recorder(controller.getScratchDir().getFile(),
"tt" + sn + "http", outBufferSize, inBufferSize);
lastFinishTime = System.currentTimeMillis();
} /** (non-Javadoc)
* @see java.lang.Thread#run()
*/
public void run() {
String name = controller.getMetadata().getJobName();
logger.fine(getName()+" started for order '"+name+"'");
Recorder.setHttpRecorder(httpRecorder); try {
while ( true ) {
ArchiveUtils.continueCheck(); setStep(Step.ABOUT_TO_GET_URI, null); CrawlURI curi = controller.getFrontier().next(); synchronized(this) {
ArchiveUtils.continueCheck();
setCurrentCuri(curi);
currentCuri.setThreadNumber(this.serialNumber);
lastStartTime = System.currentTimeMillis();
currentCuri.setRecorder(httpRecorder);
} try {
KeyedProperties.loadOverridesFrom(curi); controller.getFetchChain().process(curi,this); controller.getFrontier().beginDisposition(curi); controller.getDispositionChain().process(curi,this); } catch (RuntimeExceptionWrapper e) {
// Workaround to get cause from BDB
if(e.getCause() == null) {
e.initCause(e.getCause());
}
recoverableProblem(e);
} catch (AssertionError ae) {
// This risks leaving crawl in fatally inconsistent state,
// but is often reasonable for per-Processor assertion problems
recoverableProblem(ae);
} catch (RuntimeException e) {
recoverableProblem(e);
} catch (InterruptedException e) {
if(currentCuri!=null) {
recoverableProblem(e);
Thread.interrupted(); // clear interrupt status
} else {
throw e;
}
} catch (StackOverflowError err) {
recoverableProblem(err);
} catch (Error err) {
// OutOfMemory and any others
seriousError(err);
} finally {
httpRecorder.endReplays();
KeyedProperties.clearOverridesFrom(curi);
} setStep(Step.ABOUT_TO_RETURN_URI, null);
ArchiveUtils.continueCheck(); synchronized(this) {
controller.getFrontier().finished(currentCuri);
controller.getFrontier().endDisposition();
setCurrentCuri(null);
}
curi = null; setStep(Step.FINISHING_PROCESS, null);
lastFinishTime = System.currentTimeMillis();
if(shouldRetire) {
break; // from while(true)
}
}
} catch (InterruptedException e) {
if(currentCuri!=null){
logger.log(Level.SEVERE,"Interrupt leaving unfinished CrawlURI "+getName()+" - job may hang",e);
}
// thread interrupted, ok to end
logger.log(Level.FINE,this.getName()+ " ended with Interruption");
} catch (Exception e) {
// everything else (including interruption)
logger.log(Level.SEVERE,"Fatal exception in "+getName(),e);
} catch (OutOfMemoryError err) {
seriousError(err);
} finally {
controller.getFrontier().endDisposition(); } setCurrentCuri(null);
// Do cleanup so that objects can be GC.
this.httpRecorder.closeRecorders();
this.httpRecorder = null; logger.fine(getName()+" finished for order '"+name+"'");
setStep(Step.FINISHED, null);
controller = null;
}

ToePool是线程组,用于管理上面的工作线程,初始化、查看活动线程、中断或终止工作线程等

protected CrawlController controller;
protected int nextSerialNumber = 1;
protected int targetSize = 0; /**
* Constructor. Creates a pool of ToeThreads.
*
* @param c A reference to the CrawlController for the current crawl.
*/
public ToePool(AlertThreadGroup atg, CrawlController c) {
//传入父线程组
super(atg, "ToeThreads");
this.controller = c;
setDaemon(true);
} public void cleanup() {
// force all Toes waiting on queues, etc to proceed
Thread[] toes = getToes();
for(Thread toe : toes) {
if(toe!=null) {
toe.interrupt();
}
}
// this.controller = null;
} /**
* @return The number of ToeThreads that are not available (Approximation).
*/
public int getActiveToeCount() {
Thread[] toes = getToes();
int count = 0;
for (int i = 0; i < toes.length; i++) {
if((toes[i] instanceof ToeThread) &&
((ToeThread)toes[i]).isActive()) {
count++;
}
}
return count;
} /**
* @return The number of ToeThreads. This may include killed ToeThreads
* that were not replaced.
*/
public int getToeCount() {
Thread[] toes = getToes();
int count = 0;
for (int i = 0; i<toes.length; i++) {
if((toes[i] instanceof ToeThread)) {
count++;
}
}
return count;
}
//获取活动线程数组
private Thread[] getToes() {
Thread[] toes = new Thread[activeCount()+10];
this.enumerate(toes);
return toes;
} /**
* Change the number of ToeThreads.
*
* @param newsize The new number of ToeThreads.
*/
public void setSize(int newsize)
{
targetSize = newsize;
int difference = newsize - getToeCount();
if (difference > 0) {
// must create threads
for(int i = 1; i <= difference; i++) {
//启动线程
startNewThread();
}
} else {
//退出多余线程
// must retire extra threads
int retainedToes = targetSize;
Thread[] toes = this.getToes();
for (int i = 0; i < toes.length ; i++) {
if(!(toes[i] instanceof ToeThread)) {
continue;
}
retainedToes--;
if (retainedToes>=0) {
continue; // this toe is spared
}
// otherwise:
ToeThread tt = (ToeThread)toes[i];
tt.retire();
}
}
} /**
* Kills specified thread. Killed thread can be optionally replaced with a
* new thread.
*
* <p><b>WARNING:</b> This operation should be used with great care. It may
* destabilize the crawler.
*
* @param threadNumber Thread to kill
* @param replace If true then a new thread will be created to take the
* killed threads place. Otherwise the total number of threads
* will decrease by one.
*/
public void killThread(int threadNumber, boolean replace){ Thread[] toes = getToes();
for (int i = 0; i< toes.length; i++) {
if(! (toes[i] instanceof ToeThread)) {
continue;
}
ToeThread toe = (ToeThread) toes[i];
if(toe.getSerialNumber()==threadNumber) {
toe.kill();
}
} if(replace){
// Create a new toe thread to take its place. Replace toe
startNewThread();
}
}
//锁定,防止并发初始化线程
private synchronized void startNewThread() {
ToeThread newThread = new ToeThread(this, nextSerialNumber++);
newThread.setPriority(DEFAULT_TOE_PRIORITY);
newThread.start();
} public void waitForAll() {
while (true) try {
if (isAllAlive(getToes())) {
return;
}
Thread.sleep(1000);
} catch (InterruptedException e) {
throw new IllegalStateException(e);
}
} private static boolean isAllAlive(Thread[] threads) {
for (Thread t: threads) {
if ((t != null) && (!t.isAlive())) {
return false;
}
}
return true;
}

最后,线程组的初始化及工作线程的相关管理在CrawlController对象的相关方法执行

/**
* Maximum number of threads processing URIs at the same time.
*/
int maxToeThreads;
public int getMaxToeThreads() {
return maxToeThreads;
}
@Value("25")
public void setMaxToeThreads(int maxToeThreads) {
this.maxToeThreads = maxToeThreads;
if(toePool!=null) {
toePool.setSize(this.maxToeThreads);
}
} private transient ToePool toePool; /**
* Called when the last toethread exits.
*/
protected void completeStop() {
LOGGER.fine("Entered complete stop."); statisticsTracker.getSnapshot(); // ??? this.reserveMemory = null;
if (this.toePool != null) {
this.toePool.cleanup();
}
this.toePool = null; LOGGER.fine("Finished crawl."); try {
appCtx.stop();
} catch (RuntimeException re) {
LOGGER.log(Level.SEVERE,re.getMessage(),re);
} sendCrawlStateChangeEvent(State.FINISHED, this.sExit); // CrawlJob needs to be sure all beans have received FINISHED signal before teardown
this.isStopComplete = true;
appCtx.publishEvent(new StopCompleteEvent(this));
} /**
* Operator requested for crawl to stop.
*/
public synchronized void requestCrawlStop() {
if(state == State.STOPPING) {
// second stop request; nudge the threads with interrupts
getToePool().cleanup();
}
requestCrawlStop(CrawlStatus.ABORTED);
} /**
* @return Active toe thread count.
*/
public int getActiveToeCount() {
if (toePool == null) {
return 0;
}
return toePool.getActiveToeCount();
} protected void setupToePool() {
toePool = new ToePool(alertThreadGroup,this);
// TODO: make # of toes self-optimizing
toePool.setSize(getMaxToeThreads());
toePool.waitForAll();
} /**
* @return The number of ToeThreads
*
* @see ToePool#getToeCount()
*/
public int getToeCount() {
return this.toePool == null? 0: this.toePool.getToeCount();
} /**
* @return The ToePool
*/
public ToePool getToePool() {
return toePool;
} /**
* Kills a thread. For details see
* {@link org.archive.crawler.framework.ToePool#killThread(int, boolean)
* ToePool.killThread(int, boolean)}.
* @param threadNumber Thread to kill.
* @param replace Should thread be replaced.
* @see org.archive.crawler.framework.ToePool#killThread(int, boolean)
*/
public void killThread(int threadNumber, boolean replace){
toePool.killThread(threadNumber, replace);
}

说得够清楚吧

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

本人邮箱:chenying998179@163#com (#改为.)

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/p/3213556.html

Heritrix 3.1.0 源码解析(三十七)的更多相关文章

  1. AFNetworking2.0源码解析<三>

    本篇说说安全相关的AFSecurityPolicy模块,AFSecurityPolicy用于验证HTTPS请求的证书,先来看看HTTPS的原理和证书相关的几个问题. HTTPS HTTPS连接建立过程 ...

  2. AFNetworking (3.1.0) 源码解析 <三>

    今天要介绍的是Reachability文件夹下的AFNetworkReachabilityManager类.通过字面意思我们就可以知道AFNetworkReachabilityManager是用来监测 ...

  3. solr&lucene3.6.0源码解析(三)

    solr索引操作(包括新增 更新 删除 提交 合并等)相关UML图如下 从上面的类图我们可以发现,其中体现了工厂方法模式及责任链模式的运用 UpdateRequestProcessor相当于责任链模式 ...

  4. solr&lucene3.6.0源码解析(四)

    本文要描述的是solr的查询插件,该查询插件目的用于生成Lucene的查询Query,类似于查询条件表达式,与solr查询插件相关UML类图如下: 如果我们强行将上面的类图纳入某种设计模式语言的话,本 ...

  5. Celery 源码解析三: Task 对象的实现

    Task 的实现在 Celery 中你会发现有两处,一处位于 celery/app/task.py,这是第一个:第二个位于 celery/task/base.py 中,这是第二个.他们之间是有关系的, ...

  6. Android事件总线(二)EventBus3.0源码解析

    1.构造函数 当我们要调用EventBus的功能时,比如注册或者发送事件,总会调用EventBus.getDefault()来获取EventBus实例: public static EventBus ...

  7. solr&lucene3.6.0源码解析(二)

    上文描述了solr3.6.0怎么采用maven管理的方式在eclipse中搭建开发环境,在solr中,为了提高搜索性能,采用了缓存机制,这里描述的是LRU缓存,这里用到了 LinkedHashMap类 ...

  8. solr&lucene3.6.0源码解析(一)

      本文作为系列的第一篇,主要描述的是solr3.6.0开发环境的搭建   首先我们需要从官方网站下载solr的相关文件,下载地址为http://archive.apache.org/dist/luc ...

  9. apache mina2.0源码解析(一)

    apache mina是一个基于java nio的网络通信框架,为TCP UDP ARP等协议提供了一致的编程模型:其源码结构展示了优秀的设计案例,可以为我们的编程事业提供参考. 依照惯例,首先搭建a ...

随机推荐

  1. mysql大数据导出导入

    1)导出 select * from users into outfile '/tmp/users.txt';或 select * from users where sex=1 into outfil ...

  2. 使用WINRAR来制作安装程序

    1. WINRAR版本 2. 将所有文件放在同一个文件夹下 3. 选中所有文件点击右键 -> Add to archive 4. General设置 5. Advanced 设置 6. 确定开始 ...

  3. *ecshop 模板中foreach用法详解

    1.foreach分以下几个参数 from, item, name, iteration, index 2.使用foreach循环      如果php要传递一个数组(如:$array)给ecshop ...

  4. html常用笔记

    <?php //CSS可以对文本格式进行精确的控制 //HTML标记更有利于搜索引擎 //一.标签 <br> <p>//换行后插入一个空行,单字节不换行,双字节自动换行 ...

  5. 文件IO一些注意的地方

    两个各自独立的进程各自打开同一个文件,则每个进程都有各自的文件表项.这是因为每个进程都有它自己对该文件的当前偏移量.但是对一个给定的文件只有一个v节点表项.lseek()只修改文件表项中的当前文件偏移 ...

  6. 垂直的TextView

    所先声明一下这个类是我从网上找到的一篇文章,只是保留并没有侵权的意思. public class TextViewVertical extends View { public static final ...

  7. nodejs的调试(node-inspector)

    我们在接触客户端javascript的时候,调试利器就是firebug ,也是当年为何喜欢用上firefox 浏览器的主要动力,当然,后来 chrome 插件里也出现了firebug的身影..... ...

  8. centos6.5 安装fctix 五笔输入法

    摸索了大半晚上,终于搞定,网上的东西看了N多篇不是这问题就是那问题,看来不同的OS下,小白我还是太嫩了些. 1,删除输入法,这一步是清除输入法,操作完成后,桌面/系统/首先项/输入法的IM Choos ...

  9. Linux基本命令(10)其他命令

    其他命令 命令 功能 命令 功能 echo 显示一字串 passwd 修改密码 clear 清除显示器 lpr 打印 lpq 查看在打印队列中等待的作业 lprm 取消打印队列中的作业 10.1 ec ...

  10. LoadRunner界面分析(二)

    1.Controller 2.创建运行场景 3.方案设计 4.Resuls settting 5.监视方案