就库的范围,个人认为网络爬虫必备库知识包括urllib、requests、re、BeautifulSoup、concurrent.futures,接下来将结对concurrent.futures库的使用方法进行总结
建议阅读本博的博友先阅读下上篇博客:python究竟要不要使用多线程,将会对concurrent.futures库的使用有帮助。

1. concurrent.futures库简介

  python标准库为我们提供了threading和mutiprocessing模块实现异步多线程/多进程功能。从python3.2版本开始,标准库又为我们提供了concurrent.futures模块来实现线程池和进程池功能,实现了对threading和mutiprocessing模块的高级抽象,更大程度上方便了我们python程序员。

  concurrent.futures模块提供了ThreadPoolExecutorProcessPoolExecutor两个类

(1)看下来个类的继承关系和关键属性

  1. from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
  2.  
  3. print('ThreadPoolExecutor继承关系:',ThreadPoolExecutor.__mro__)
  4. print('ThreadPoolExecutor属性:',[attr for attr in dir(ThreadPoolExecutor) if not attr.startswith('_')])
  5. print('ProcessPoolExecutor继承关系:',ProcessPoolExecutor.__mro__)
  6. print('ThreadPoolExecutor属性:',[attr for attr in dir(ProcessPoolExecutor) if not attr.startswith('_')])

  都继承自futures._base.Executor类,拥有三个重要方法map、submit和shutdow,这样看起来就很简单了

(2)再看下futures._base.Executor基类实现

  1. class Executor(object):
  2. """This is an abstract base class for concrete asynchronous executors."""
  3.  
  4. def submit(self, fn, *args, **kwargs):
  5. """Submits a callable to be executed with the given arguments.
  6.  
  7. Schedules the callable to be executed as fn(*args, **kwargs) and returns
  8. a Future instance representing the execution of the callable.
  9.  
  10. Returns:
  11. A Future representing the given call.
  12. """
  13. raise NotImplementedError()
  14.  
  15. def map(self, fn, *iterables, timeout=None, chunksize=):
  16. """Returns an iterator equivalent to map(fn, iter).
  17.  
  18. Args:
  19. fn: A callable that will take as many arguments as there are
  20. passed iterables.
  21. timeout: The maximum number of seconds to wait. If None, then there
  22. is no limit on the wait time.
  23. chunksize: The size of the chunks the iterable will be broken into
  24. before being passed to a child process. This argument is only
  25. used by ProcessPoolExecutor; it is ignored by
  26. ThreadPoolExecutor.
  27.  
  28. Returns:
  29. An iterator equivalent to: map(func, *iterables) but the calls may
  30. be evaluated out-of-order.
  31.  
  32. Raises:
  33. TimeoutError: If the entire result iterator could not be generated
  34. before the given timeout.
  35. Exception: If fn(*args) raises for any values.
  36. """
  37. if timeout is not None:
  38. end_time = timeout + time.time()
  39.  
  40. fs = [self.submit(fn, *args) for args in zip(*iterables)]
  41.  
  42. # Yield must be hidden in closure so that the futures are submitted
  43. # before the first iterator value is required.
  44. def result_iterator():
  45. try:
  46. # reverse to keep finishing order
  47. fs.reverse()
  48. while fs:
  49. # Careful not to keep a reference to the popped future
  50. if timeout is None:
  51. yield fs.pop().result()
  52. else:
  53. yield fs.pop().result(end_time - time.time())
  54. finally:
  55. for future in fs:
  56. future.cancel()
  57. return result_iterator()
  58.  
  59. def shutdown(self, wait=True):
  60. """Clean-up the resources associated with the Executor.
  61.  
  62. It is safe to call this method several times. Otherwise, no other
  63. methods can be called after this one.
  64.  
  65. Args:
  66. wait: If True then shutdown will not return until all running
  67. futures have finished executing and the resources used by the
  68. executor have been reclaimed.
  69. """
  70. pass
  71.  
  72. def __enter__(self):
  73. return self
  74.  
  75. def __exit__(self, exc_type, exc_val, exc_tb):
  76. self.shutdown(wait=True)
  77. return False

  提供了map、submit、shutdow和with方法,下面首先对这个几个方法的使用进行说明

2. map函数

  函数原型:def map(self, fn, *iterables, timeout=None, chunksize=1)

  map函数和python自带的map函数用法一样,只不过该map函数从迭代器获取参数后异步执行,timeout用于设置超时时间

  参数chunksize的理解

  1. The size of the chunks the iterable will be broken into
  2. before being passed to a child process. This argument is only
  3. used by ProcessPoolExecutor; it is ignored by ThreadPoolExecutor.

  例:

  1. from concurrent.futures import ThreadPoolExecutor
  2. import time
  3. import requests
  4.  
  5. def download(url):
  6. headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
  7. 'Connection':'keep-alive',
  8. 'Host':'example.webscraping.com'}
  9. response = requests.get(url, headers=headers)
  10. return(response.status_code)
  11.  
  12. if __name__ == '__main__':
  13. urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
  14. 'http://example.webscraping.com/places/default/view/Aland-Islands-2']
  15.  
  16. pool = ProcessPoolExecutor(max_workers = 2)
    start = time.time()
  17. result = list(pool.map(download, urllist))
  18. end = time.time()
  19. print('status_code:',result)
  20. print('使用多线程--timestamp:{:.3f}'.format(end-start))

3. submit函数

  函数原型:def submit(self, fn, *args, **kwargs)

  fn:需要异步执行的函数

  args、kwargs:函数传递的参数

  :下例中future类的使用的as_complete后面介绍

  1. from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,as_completed
  2. import time
  3. import requests
  4.  
  5. def download(url):
  6. headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
  7. 'Connection':'keep-alive',
  8. 'Host':'example.webscraping.com'}
  9. response = requests.get(url, headers=headers)
  10. return response.status_code
  11.  
  12. if __name__ == '__main__':
  13. urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
  14. 'http://example.webscraping.com/places/default/view/Aland-Islands-2']
  15.  
  16. start = time.time()
  17. pool = ProcessPoolExecutor(max_workers = )
  18. futures = [pool.submit(download,url) for url in urllist]
  19. for future in futures:
  20. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  21. print('#### 分界线 ####')
  22. for future in as_completed(futures, timeout=):
  23. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  24. print(future.result())
  25. end = time.time()
  26. print('使用多线程--timestamp:{:.3f}'.format(end-start))

  输出

  

4. shutdown函数

  函数原型:def shutdown(self, wait=True)

  此函数用于释放异步执行操作后的系统资源

  由于_base.Executor类提供了上下文方法,将shutdown封装在了__exit__中,若使用with方法,将不需要自己进行资源释放

  1. with ProcessPoolExecutor(max_workers = ) as pool:

5. Future类

  submit函数返回Future对象,Future类提供了跟踪任务执行状态的方法:

  future.running():判断任务是否执行

  futurn.done:判断任务是否执行完成

  futurn.result():返回函数执行结果

  1. futures = [pool.submit(download,url) for url in urllist]
  2. for future in futures:
  3. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  4. print('#### 分界线 ####')
  5. for future in as_completed(futures, timeout=):
  6. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  7. print(future.result())

  as_completed方法传入futures迭代器和timeout两个参数

  默认timeout=None,阻塞等待任务执行完成,并返回执行完成的future对象迭代器,迭代器是通过yield实现的。

  timeout>0,等待timeout时间,如果timeout时间到仍有任务未能完成,不再执行并抛出异常TimeoutError

6. 回调函数

  Future类提供了add_done_callback函数可以自定义回调函数:

  1. def add_done_callback(self, fn):
  2. """Attaches a callable that will be called when the future finishes.
  3.  
  4. Args:
  5. fn: A callable that will be called with this future as its only
  6. argument when the future completes or is cancelled. The callable
  7. will always be called by a thread in the same process in which
  8. it was added. If the future has already completed or been
  9. cancelled then the callable will be called immediately. These
  10. callables are called in the order that they were added.
  11. """
  12. with self._condition:
  13. if self._state not in [CANCELLED, CANCELLED_AND_NOTIFIED, FINISHED]:
  14. self._done_callbacks.append(fn)
  15. return
  16. fn(self)

  例子:

  1. from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,as_completed
  2. import time
  3. import requests
  4.  
  5. def download(url):
  6. headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
  7. 'Connection':'keep-alive',
  8. 'Host':'example.webscraping.com'}
  9. response = requests.get(url, headers=headers)
  10. return response.status_code
  11.  
  12. def callback(future):
  13. print(future.result())
  14.  
  15. if __name__ == '__main__':
  16. urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
  17. 'http://example.webscraping.com/places/default/view/Aland-Islands-2',
  18. 'http://example.webscraping.com/places/default/view/Albania-3',
  19. 'http://example.webscraping.com/places/default/view/Algeria-4',
  20. 'http://example.webscraping.com/places/default/view/American-Samoa-5']
  21.  
  22. start = time.time()
  23. with ProcessPoolExecutor(max_workers = ) as pool:
  24. futures = [pool.submit(download,url) for url in urllist]
  25. for future in futures:
  26. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  27. print('#### 分界线 ####')
  28. for future in as_completed(futures, timeout=):
  29. future.add_done_callback(callback)
  30. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  31. end = time.time()
  32. print('使用多线程--timestamp:{:.3f}'.format(end-start))

7. wait函数

  函数原型:def wait(fs, timeout=None, return_when=ALL_COMPLETED)

  1. def wait(fs, timeout=None, return_when=ALL_COMPLETED):
  2. """Wait for the futures in the given sequence to complete.
  3.  
  4. Args:
  5. fs: The sequence of Futures (possibly created by different Executors) to
  6. wait upon.
  7. timeout: The maximum number of seconds to wait. If None, then there
  8. is no limit on the wait time.
  9. return_when: Indicates when this function should return. The options
  10. are:
  11.  
  12. FIRST_COMPLETED - Return when any future finishes or is
  13. cancelled.
  14. FIRST_EXCEPTION - Return when any future finishes by raising an
  15. exception. If no future raises an exception
  16. then it is equivalent to ALL_COMPLETED.
  17. ALL_COMPLETED - Return when all futures finish or are cancelled.
  18.  
  19. Returns:
  20. A named -tuple of sets. The first set, named 'done', contains the
  21. futures that completed (is finished or cancelled) before the wait
  22. completed. The second set, named 'not_done', contains uncompleted
  23. futures.
  24. """
  25. with _AcquireFutures(fs):
  26. done = set(f for f in fs
  27. if f._state in [CANCELLED_AND_NOTIFIED, FINISHED])
  28. not_done = set(fs) - done
  29.  
  30. if (return_when == FIRST_COMPLETED) and done:
  31. return DoneAndNotDoneFutures(done, not_done)
  32. elif (return_when == FIRST_EXCEPTION) and done:
  33. if any(f for f in done
  34. if not f.cancelled() and f.exception() is not None):
  35. return DoneAndNotDoneFutures(done, not_done)
  36.  
  37. if len(done) == len(fs):
  38. return DoneAndNotDoneFutures(done, not_done)
  39.  
  40. waiter = _create_and_install_waiters(fs, return_when)
  41.  
  42. waiter.event.wait(timeout)
  43. for f in fs:
  44. with f._condition:
  45. f._waiters.remove(waiter)
  46.  
  47. done.update(waiter.finished_futures)
  48. return DoneAndNotDoneFutures(done, set(fs) - done)

  wait方法返回一个中包含两个元组,元组中包含两个集合(set),一个是已经完成的(completed),一个是未完成的(uncompleted)

  它接受三个参数,重点看下第三个参数:

  FIRST_COMPLETED:Return when any future finishes or iscancelled. 

  FIRST_EXCEPTION:Return when any future finishes by raising an exception,
             If no future raises an exception then it is equivalent to ALL_COMPLETED.
  ALL_COMPLETED:Return when all futures finish or are cancelled
  例:
  1. from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,\
  2. as_completed,wait,ALL_COMPLETED, FIRST_COMPLETED, FIRST_EXCEPTION
  3. import time
  4. import requests
  5.  
  6. def download(url):
  7. headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
  8. 'Connection':'keep-alive',
  9. 'Host':'example.webscraping.com'}
  10. response = requests.get(url, headers=headers)
  11. return response.status_code
  12.  
  13. if __name__ == '__main__':
  14. urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
  15. 'http://example.webscraping.com/places/default/view/Aland-Islands-2',
  16. 'http://example.webscraping.com/places/default/view/Albania-3',
  17. 'http://example.webscraping.com/places/default/view/Algeria-4',
  18. 'http://example.webscraping.com/places/default/view/American-Samoa-5']
  19.  
  20. start = time.time()
  21. with ProcessPoolExecutor(max_workers = ) as pool:
  22. futures = [pool.submit(download,url) for url in urllist]
  23. for future in futures:
  24. print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
  25. print('#### 分界线 ####')
  26. completed, uncompleted = wait(futures, timeout=, return_when=FIRST_COMPLETED)
  27. for cp in completed:
  28. print('执行中:%s, 已完成:%s' % (cp.running(), cp.done()))
  29. print(cp.result())
  30. end = time.time()
  31. print('使用多线程--timestamp:{:.3f}'.format(end-start))

  输出

  只返回了一个完成的

网络爬虫必备知识之concurrent.futures库的更多相关文章

  1. 网络爬虫必备知识之urllib库

    就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结合爬虫示例分别对urllib库的使用方法进行 ...

  2. 网络爬虫必备知识之requests库

    就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结对requests库的使用方法进行总结 1. ...

  3. 【网络爬虫入门02】HTTP客户端库Requests的基本原理与基础应用

    [网络爬虫入门02]HTTP客户端库Requests的基本原理与基础应用 广东职业技术学院  欧浩源 1.引言 实现网络爬虫的第一步就是要建立网络连接并向服务器或网页等网络资源发起请求.urllib是 ...

  4. 网络爬虫基础知识(Python实现)

    浏览器的请求 url=请求协议(http/https)+网站域名+资源路径+参数 http:超文本传输协议(以明文的形式进行传输),传输效率高,但不安全. https:由http+ssl(安全套接子层 ...

  5. python网络爬虫,知识储备,简单爬虫的必知必会,【核心】

    知识储备,简单爬虫的必知必会,[核心] 一.实验说明 1. 环境登录 无需密码自动登录,系统用户名shiyanlou 2. 环境介绍 本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到桌 ...

  6. 网络爬虫:利用selenium,pyquery库抓取并处理京东上的图片并存储到使用mongdb数据库进行存储

    一,环境的搭建已经简单的工具介绍 1.selenium,一个用于Web应用程序测试的工具.其特点是直接运行在浏览器中,就像真正的用户在操作一样.新版本selenium2集成了 Selenium 1.0 ...

  7. Python 网络爬虫的常用库汇总

    爬虫的编程语言有不少,但 Python 绝对是其中的主流之一.下面就为大家介绍下 Python 在编写网络爬虫常常用到的一些库. 请求库:实现 HTTP 请求操作 urllib:一系列用于操作URL的 ...

  8. python 爬虫基础知识一

    网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本. 网络爬虫必备知识点 1. Python基础知识2. P ...

  9. 《Python3网络爬虫开发实战》

    推荐:★ ★ ★ ★ ★ 第1章 开发环境配置 第2章 网页基础知识 第3章 网络爬虫基础 第4章 基本库的使用 第5章 解析库的使用 第6章 数据存储 第7章 Ajax数据爬取 第8章 动态渲染页面 ...

随机推荐

  1. spring 异步处理request

    转自:http://blog.csdn.net/u012410733/article/details/52124333Spring MVC 3.2开始引入Servlet 3中的基于异步的处理reque ...

  2. Python 7 多线程及进程

    进程与线程: 进程的概念: 1.程序的执行实例称为进程. 2.每个进程都提供执行程序所需的资源.一个进程有一个虚拟地址空间.可执行代码.对系统对象的开放句柄.一个安全上下文.一个独特的进程标识符.环境 ...

  3. python Selenium库的使用

    一.什么是Selenium selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium Remote Control)和测试的并行 ...

  4. 树莓派使用DHT11温湿度传感器(C语言)

    硬件: 树莓派 2.0 DHT模块  接树莓派5V GND GPIO1 功能:读取传感器数据并打印出来 // //mydht11.c // #include <wiringPi.h> #i ...

  5. Shell编程之变量进阶

    一.变量知识进阶 1.特殊的位置参数变量 实例1:测试$n(n为1...15) [root@codis-178 ~]# cat p.sh echo $1 [root@codis-178 ~]# sh ...

  6. 线程同步synchronized和ReentrantLock

    一.线程同步问题的产生及解决方案 问题的产生: Java允许多线程并发控制,当多个线程同时操作一个可共享的资源变量时(如数据的增删改查),将会导致数据不准确,相互之间产生冲突. 如下例:假设有一个卖票 ...

  7. Hearbeat 介绍

    Hearbeat 介绍 Linux-HA的全称是High-Availability Linux,它是一个开源项目,这个开源项目的目标是:通过社区开发者的共同努力,提供一个增强linux可靠性(reli ...

  8. .babelrc参数小解

    .babelrc是用来设置转码规则和插件的,这种文件在window上无法直接创建,也无法在HBuilder中创建,甚至无法查看,但可以在sublime text中创建.查看并编辑. 当我们使用es6语 ...

  9. HDU 3449 Consumer

    这是一道依赖背包问题.背包问题通常的解法都是由0/1背包拓展过来的,这道也不例外.我最初想到的做法是,由于有依赖关系,先对附件做个DP,得到1-w的附件背包结果f[i]表示i花费得到的最大收益,然后把 ...

  10. iOS应用网络安全之HTTPS

    移动互联网开发中iOS应用的网络安全问题往往被大部分开发者忽略,iOS9和OS X 10.11开始Apple也默认提高了安全配置和要求.本文以iOS平台App开发中对后台数据接口的安全通信进行解析和加 ...