Spider剩下的CountableThreadPool

在上一篇的Spider中我们一定注意到了threadpool这个变量,这个变量是Spider中的线程池,具体代码

public class CountableThreadPool {

private int threadNum;

private AtomicInteger threadAlive = new AtomicInteger();

private ReentrantLock reentrantLock = new ReentrantLock();

private Condition condition = reentrantLock.newCondition();

public CountableThreadPool(int threadNum) {
this.threadNum = threadNum;
this.executorService = Executors.newFixedThreadPool(threadNum);
}

public CountableThreadPool(int threadNum, ExecutorService executorService) {
this.threadNum = threadNum;
this.executorService = executorService;
}

public void setExecutorService(ExecutorService executorService) {
this.executorService = executorService;
}

public int getThreadAlive() {
return threadAlive.get();
}

public int getThreadNum() {
return threadNum;
}

private ExecutorService executorService;

public void execute(final Runnable runnable) {

if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}

public boolean isShutdown() {
return executorService.isShutdown();
}

public void shutdown() {
executorService.shutdown();
}

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
CountableThreadPool提供了设置Executor或者默认创建的方式,如果不是很懂Java的线程池先去补习一下~最主要的三个变量

private AtomicInteger threadAlive = new AtomicInteger();

private ReentrantLock reentrantLock = new ReentrantLock();

private Condition condition = reentrantLock.newCondition();
1
2
3
4
5
1
2
3
4
5
threadAlive表示目前正在执行的线程,reentrantLock是一个自旋锁,用于对条件变量操作的同步,condition用户唤醒阻塞线程的条件变量。

关键的方法:

public void execute(final Runnable runnable) {

if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
使用threadAlive这个变量来控制目前的活动线程,如果超出定义的线程数就阻塞,为什么这样呢,因为我们创建的是固定大小的线程池,默认的newFixedThreadPool创建的最大线程数就是传入的参数,如果线程数量超过线程池中的数值,对于默认的操作就是抛异常了。可以看一下这篇博客:
http://uule.iteye.com/blog/1123185
http://blog.csdn.net/sd0902/article/details/8395677

Spider剩下的SpiderMonitor

先说一句

SpiderMonitor是负责监控Spider的运行状态的,建议仔细阅读官方文档
http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/monitor.html
http://my.oschina.net/xpbug/blog/221547
所以如果这部分对你没什么用,你可以跳过去,我就没用到~

开始吧

在Spider的代码中我们看到了这个

public void run() {
try {
processRequest(requestFinal);
onSuccess(requestFinal);
} catch (Exception e) {
onError(requestFinal);
logger.error("process request " + requestFinal + " error", e);
} finally {
pageCount.incrementAndGet();
signalNewUrl();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
对于onSuccess(requestFinal)和onError(requestFinal)这个方法名,如果你看的多了一眼就知道这是个接口,那么回调在哪里?也就是SpiderMonitor的内部类

public class SpiderMonitor {

private static SpiderMonitor INSTANCE = new SpiderMonitor();

private AtomicBoolean started = new AtomicBoolean(false);

private Logger logger = LoggerFactory.getLogger(getClass());

private MBeanServer mbeanServer;

private String jmxServerName;

private List<SpiderStatusMXBean> spiderStatuses = new ArrayList<SpiderStatusMXBean>();

protected SpiderMonitor() {
jmxServerName = "WebMagic";
mbeanServer = ManagementFactory.getPlatformMBeanServer();
}

/**
* Register spider for monitor.
*
* @param spiders spiders
* @return this
*/
public synchronized SpiderMonitor register(Spider... spiders) throws JMException {
for (Spider spider : spiders) {
MonitorSpiderListener monitorSpiderListener = new MonitorSpiderListener();
if (spider.getSpiderListeners() == null) {
List<SpiderListener> spiderListeners = new ArrayList<SpiderListener>();
spiderListeners.add(monitorSpiderListener);
spider.setSpiderListeners(spiderListeners);
} else {
spider.getSpiderListeners().add(monitorSpiderListener);
}
SpiderStatusMXBean spiderStatusMBean = getSpiderStatusMBean(spider, monitorSpiderListener);
registerMBean(spiderStatusMBean);
spiderStatuses.add(spiderStatusMBean);
}
return this;
}

protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) {
return new SpiderStatus(spider, monitorSpiderListener);
}

public static SpiderMonitor instance() {
return INSTANCE;
}

public class MonitorSpiderListener implements SpiderListener {

private final AtomicInteger successCount = new AtomicInteger(0);

private final AtomicInteger errorCount = new AtomicInteger(0);

private List<String> errorUrls = Collections.synchronizedList(new ArrayList<String>());

@Override
public void onSuccess(Request request) {
successCount.incrementAndGet();
}

@Override
public void onError(Request request) {
errorUrls.add(request.getUrl());
errorCount.incrementAndGet();
}

public AtomicInteger getSuccessCount() {
return successCount;
}

public AtomicInteger getErrorCount() {
return errorCount;
}

public List<String> getErrorUrls() {
return errorUrls;
}
}

protected void registerMBean(SpiderStatusMXBean spiderStatus) throws MalformedObjectNameException, InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException {
ObjectName objName = new ObjectName(jmxServerName + ":name=" + spiderStatus.getName());
mbeanServer.registerMBean(spiderStatus, objName);
}

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
我们看到了在onSuccess和onError做了一些记录,主要是为了监控,如果你希望在爬虫成功或者失败实现一些自己方法也可以实现这个接口

public interface SpiderListener {

public void onSuccess(Request request);

public void onError(Request request);
}
1
2
3
4
5
6
1
2
3
4
5
6
如果你对于Java的接口回调不是很懂那么推荐你看看《Head First设计模式》第一章,策略模式。

写在后面

这篇博客还是很简单的,主要完善了Spider的细小模块,后面将会介绍Spider的四大组件,如果喜欢多多支持~

CountableThreadPool的更多相关文章

  1. 【转】WebMagic-总体流程源码分析

    转自:http://m.blog.csdn.net/article/details?id=51943601 写在前面 前一段时间开发[知了]用到了很多技术(可以看我前面的博文http://blog.c ...

  2. HttpClient 专题

    HttpClient is a HTTP/1.1 compliant HTTP agent implementation based on HttpCore. It also provides reu ...

  3. Java 多线程爬虫及分布式爬虫架构探索

    这是 Java 爬虫系列博文的第五篇,在上一篇 Java 爬虫服务器被屏蔽,不要慌,咱们换一台服务器 中,我们简单的聊反爬虫策略和反反爬虫方法,主要针对的是 IP 被封及其对应办法.前面几篇文章我们把 ...

  4. Java 多线程爬虫及分布式爬虫架构

    这是 Java 爬虫系列博文的第五篇,在上一篇 Java 爬虫服务器被屏蔽,不要慌,咱们换一台服务器 中,我们简单的聊反爬虫策略和反反爬虫方法,主要针对的是 IP 被封及其对应办法.前面几篇文章我们把 ...

  5. webmagic源码浅析

    webmagic简介 webmagic可以说是中国传播度最广的Java爬虫框架,https://github.com/code4craft/webmagic,阅读相关源码,获益良多.阅读作者博客[代码 ...

随机推荐

  1. koa上传excel文件并解析

    1.中间键使用 koa-body npm install koa-body --save const koaBody = require('koa-body'); app.use(koaBody({ ...

  2. 【To Read】

    Maximal Rectangle Given a 2D binary matrix filled with 0's and 1's, find the largest rectangle conta ...

  3. Mac下搭建python开发环境

    目录 1. 安装brew 2. 安装 mysql 3. 安装 pycharm 4. 安装python3.6 5. 安装virtualenvwrapper 6. 虚拟环境下安装mysqlclient 1 ...

  4. oracle水线的定义

    1.水线定义了表的数据在一个BLOCK中所达到的最高的位置. 2.当有新的记录插入,水线增高 3.当删除记录时,水线不回落 4.减少查询量

  5. display: none;、visibility: hidden、opacity=0区别总结

    display: none; 1.浏览器不会生成属性为display: none;的元素. 2.display: none;不占据空间(毕竟都不熏染啦),所以动态改变此属性时会引起重排. 3.disp ...

  6. MaxCompute助力小影短视频走向全球化

    数字时代,中国已经成为世界互联网的中心,小影(海外版称作为VivaVideo,后简称VivaVideo)作为国内首批短视频出海企业,借助统一的云计算平台快速实现全球业务的线上部署,已经让每一行代码都获 ...

  7. QT语言翻译

    QT中多语言的实现方式: 1.代码中tr运用 2.使用工具生成ts文件 3.翻译ts文件 4.生成qm文件 5.程序加载 以下内容程序加载时放入即可. QString appPath = QCoreA ...

  8. HZOJ 简单的期望

    性质:一个数分解质因数后2的次数=二进制下末尾连续0的个数. 乘2比较好考虑,比较恶心的是+1.一个$k*2^0$的数+1后可能会出现很多情况.但是k这个数表示不出来. 但是加的操作最多有200次,也 ...

  9. linux centos 一键安装环境

    phpStudy for Linux 支持Apache/Nginx/Tengine/Lighttpd, 支持php5.2/5.3/5.4/5.5切换 已经在centos-6.5,debian-7.4. ...

  10. PHP5.2 汉字json_encode

    //对汉字编码 private function url_encode($str) { if(is_array($str)) { foreach($str as $key=>$value) { ...