python实现并发爬虫

在进行单个爬虫抓取的时候，我们不可能按照一次抓取一个url的方式进行网页抓取，这样效率低，也浪费了cpu的资源。目前python上面进行并发抓取的实现方式主要有以下几种：进程，线程，协程。进程不在的讨论范围之内，一般来说，进程是用来开启多个spider，比如我们开启了4进程，同时派发4个spider进行网络抓取，每个spider同时抓取4个url。

所以，我们今天讨论的是，在单个爬虫的情况下，尽可能的在同一个时间并发抓取，并且抓取的效率要高。

一.顺序抓取

顺序抓取是最最常见的抓取方式，一般初学爬虫的朋友就是利用这种方式，下面是一个测试代码，顺序抓取8个url，我们可以来测试一下抓取完成需要多少时间：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',

   'Accept-Language': 'zh-CN,zh;q=0.8',

   'Accept-Encoding': 'gzip, deflate',}

URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',

        'https://www.zhihu.com/topic/19804387/newest',

        'http://blog.csdn.net/yueguanghaidao/article/details/24281751',

        'https://my.oschina.net/visualgui823/blog/36987',

        'http://blog.chinaunix.net/uid-9162199-id-4738168.html',

        'http://www.tuicool.com/articles/u67Bz26',

        'http://rfyiamcool.blog.51cto.com/1030776/1538367/',

        'http://itindex.net/detail/26512-flask-tornado-gevent']                               

#url为随机获取的一批url                                                                               

def func():

    """

    顺序抓取

    """

    import requests

    import time

    urls = URLS

    headers = HEADERS

    headers['user-agent'] = "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537" \

                            ".36+(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36"

    print(u'顺序抓取')

    starttime= time.time()

    for url in urls:

        try:

            r = requests.get(url, allow_redirects=False, timeout=2.0, headers=headers)

        except:

            pass

        else:

            print(r.status_code, r.url)

    endtime=time.time()

    print(endtime-starttime)                                                                  

func()

我们直接采用内建的time.time()来计时，较为粗略，但可以反映大概的情况。下面是顺序抓取的结果计时：

可以从图片中看到，显示的顺序与urls的顺序是一模一样的，总共耗时为7.763269901275635秒，一共8个url，平均抓取一个大概需要0.97秒。总体来看，还可以接受。

二.多线程抓取

线程是python内的一种较为不错的并发方式，我们也给出相应的代码，并且为每个url创建了一个线程，一共8线程并发抓取，下面的代码：

下面是我们运行8线程的测试代码：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',

   'Accept-Language': 'zh-CN,zh;q=0.8',

   'Accept-Encoding': 'gzip, deflate',}

URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',

        'https://www.zhihu.com/topic/19804387/newest',

        'http://blog.csdn.net/yueguanghaidao/article/details/24281751',

        'https://my.oschina.net/visualgui823/blog/36987',

        'http://blog.chinaunix.net/uid-9162199-id-4738168.html',

        'http://www.tuicool.com/articles/u67Bz26',

        'http://rfyiamcool.blog.51cto.com/1030776/1538367/',

        'http://itindex.net/detail/26512-flask-tornado-gevent']                                            

def thread():

    from threading import Thread

    import requests

    import time

    urls = URLS

    headers = HEADERS

    headers['user-agent'] = "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+" \

                            "(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36"

    def get(url):

        try:

            r = requests.get(url, allow_redirects=False, timeout=2.0, headers=headers)

        except:

            pass

        else:

            print(r.status_code, r.url)                                                                    

    print(u'多线程抓取')

    ts = [Thread(target=get, args=(url,)) for url in urls]

    starttime= time.time()

    for t in ts:

        t.start()

    for t in ts:

        t.join()

    endtime=time.time()

    print(endtime-starttime)

thread()

多线程抓住的时间如下：

可以看到相较于顺序抓取，8线程的抓取效率明显上升了3倍多，全部完成只消耗了2.154秒。可以看到显示的结果已经不是urls的顺序了，说明每个url各自完成的时间都是不一样的。线程就是在一个进程中不断的切换，让每个线程各自运行一会，这对于网络io来说，性能是非常高的。但是线程之间的切换是挺浪费资源的。

三.gevent并发抓取

gevent是一种轻量级的协程，可用它来代替线程，而且，他是在一个线程中运行，机器资源的损耗比线程低很多。如果遇到了网络io阻塞，会马上切换到另一个程序中去运行，不断的轮询，来降低抓取的时间
下面是测试代码：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',

   'Accept-Language': 'zh-CN,zh;q=0.8',

   'Accept-Encoding': 'gzip, deflate',}

URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',

        'https://www.zhihu.com/topic/19804387/newest',

        'http://blog.csdn.net/yueguanghaidao/article/details/24281751',

        'https://my.oschina.net/visualgui823/blog/36987',

        'http://blog.chinaunix.net/uid-9162199-id-4738168.html',

        'http://www.tuicool.com/articles/u67Bz26',

        'http://rfyiamcool.blog.51cto.com/1030776/1538367/',

        'http://itindex.net/detail/26512-flask-tornado-gevent']

def main():

    """

    gevent并发抓取

    """

    import requests

    import gevent

    import time

    headers = HEADERS

    headers['user-agent'] = "Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+" \

                            "(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36"

    urls = URLS

    def get(url):

        try:

            r = requests.get(url, allow_redirects=False, timeout=2.0, headers=headers)

        except:

            pass

        else:

            print(r.status_code, r.url)

    print(u'基于gevent的并发抓取')

    starttime= time.time()

    g = [gevent.spawn(get, url) for url in urls]

    gevent.joinall(g)

    endtime=time.time()

    print(endtime - starttime)

main()

协程的抓取时间如下：

正常情况下，gevent的并发抓取与多线程的消耗时间差不了多少，但是可能是我网络的原因，或者机器的性能的原因，时间有点长......,请各位小主在自己电脑进行跑一下看运行时间

四.基于tornado的coroutine并发抓取

tornado中的coroutine是python中真正意义上的协程，与python3中的asyncio几乎是完全一样的，而且两者之间的future是可以相互转换的，tornado中有与asyncio相兼容的接口。
下面是利用tornado中的coroutine进行并发抓取的代码：

利用coroutine编写并发略显复杂，但这是推荐的写法，如果你使用的是python3，强烈建议你使用coroutine来编写并发抓取。

下面是测试代码：

HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',

   'Accept-Language': 'zh-CN,zh;q=0.8',

   'Accept-Encoding': 'gzip, deflate',}

URLS = ['http://www.cnblogs.com/moodlxs/p/3248890.html',

        'https://www.zhihu.com/topic/19804387/newest',

        'http://blog.csdn.net/yueguanghaidao/article/details/24281751',

        'https://my.oschina.net/visualgui823/blog/36987',

        'http://blog.chinaunix.net/uid-9162199-id-4738168.html',

        'http://www.tuicool.com/articles/u67Bz26',

        'http://rfyiamcool.blog.51cto.com/1030776/1538367/',

        'http://itindex.net/detail/26512-flask-tornado-gevent']

import time

from tornado.gen import coroutine

from tornado.ioloop import IOLoop

from tornado.httpclient import AsyncHTTPClient, HTTPError

from tornado.httpclient import HTTPRequest

#urls与前面相同

class MyClass(object):

    def __init__(self):

        #AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")

        self.http = AsyncHTTPClient()

    @coroutine

    def get(self, url):

        #tornado会自动在请求首部带上host首部

        request = HTTPRequest(url=url,

                            method='GET',

                            headers=HEADERS,

                            connect_timeout=2.0,

                            request_timeout=2.0,

                            follow_redirects=False,

                            max_redirects=False,

                            user_agent="Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+\

                            (KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36",)

        yield self.http.fetch(request, callback=self.find, raise_error=False)

    def find(self, response):

        if response.error:

            print(response.error)

        print(response.code, response.effective_url, response.request_time)

class Download(object):

    def __init__(self):

        self.a = MyClass()

        self.urls = URLS

    @coroutine

    def d(self):

        print(u'基于tornado的并发抓取')

        starttime = time.time()

        yield [self.a.get(url) for url in self.urls]

        endtime=time.time()

        print(endtime-starttime)

if __name__ == '__main__':

    dd = Download()

    loop = IOLoop.current()

    loop.run_sync(dd.d)

抓取的时间如下：

可以看到总共花费了128087秒，而这所花费的时间恰恰就是最后一个url抓取所需要的时间，tornado中自带了查看每个请求的相应时间。我们可以从图中看到，最后一个url抓取总共花了1.28087秒，相较于其他时间大大的增加，这也是导致我们消耗时间过长的原因。那可以推断出，前面的并发抓取，也在这个url上花费了较多的时间。

总结：
以上测试其实非常的不严谨，因为我们选取的url的数量太少了，完全不能反映每一种抓取方式的优劣。如果有一万个不同的url同时抓取，那么记下总抓取时间，是可以得出一个较为客观的结果的。
并且，已经有人测试过，多线程抓取的效率是远不如gevent的。所以，如果你使用的是python2，那么我推荐你使用gevent进行并发抓取；如果你使用的是python3，我推荐你使用tornado的http客户端结合coroutine进行并发抓取。从上面的结果来看，tornado的coroutine是高于gevent的轻量级的协程的。但具体结果怎样，我没测试过。