一 pycurl介绍

pycurl模块为libcurl库提供了一个python接口。libcurl是一个开源免费且方便快捷的基于客户端的url传输库,支持FTP,HTTP,HTTPS,IMAP,IMAPS,LDAP,LDAPS,POP3,POP3S,RTMP,RTSP,SCP等等。libcurl还支持SSL认证,HTTP POST,HTTP PUT,FTP UPLOADING等等。和urllib模块类似,pycurl模块也可以用来获取一个url的对象。pycurl使用了大部分libcurl提供的函数,使得pycurl具有以下特性:

快速  libcurl本身就很快速,pycurl就是对libcurl进行了一次封装,所以pycurl同样很快速。

支持多种协议,SSL,认证和代理选项。pycurl支持大部分libcurl的回调函数。

multi 和 share 接口支持

可以和应用的I/O整合

二 pycurl使用案例

1.安装pycurl

CentOS6 下使用pip install pycurl安装

可以使用ipython来调试

2.获取一个url响应结果

  1. import pycurl
  2.  
  3. from StringIO import StringIO
  4.  
  5. buffer=StringIO()
  6.  
  7. c=pycurl.Curl()
  8.  
  9. c.setopt(c.URL,'http://pycurl.io/')
  10.  
  11. c.setopt(c.WRITEFUNCTION,buffer.write)
  12.  
  13. c.perform()
  14.  
  15. c.close()
  16.  
  17. body=buffer.getvalue()
  18.  
  19. print(body)

  

pycurl本身不会存储url的响应结果,因此,需要设置一个buffer,让pycurl将结果写入到这个buffer中

想要获取调试信息,可以设置

  1. c.setopt(c.VERBOSE, True)

  

 

等同于 curl -v

3.审查响应头

在实际案例中,我们想要根据服务端的编码格式来解码响应结果

  1. import pycurl
  2. import re
  3. try:
  4. from io import BytesIO
  5. except ImportError:
  6.  
  7. from StringIO import StringIO as BytesIO
  8.  
  9. headers={}
  10. def header_function(header_line):
  11. # HTTP standard specifies that headers are encoded in iso-8859-1.
  12. # On Python 2, decoding step can be skipped.
  13. # On Python 3, decoding step is required.
  14. header_line=header_line.decode('iso-8859-1')
  15.  
  16. # Header lines include the first status line (HTTP/1.x ...).
  17. # We are going to ignore all lines that don't have a colon in them.
  18. # This will botch headers that are split on multiple lines...
  19. if ':' not in header_line:
  20. return
  21.  
  22. # Break the header line into header name and value.
  23. name, value = header_line.split(':', 1)
  24.  
  25. # Remove whitespace that may be present.
  26. # Header lines include the trailing newline, and there may be whitespace
  27. # around the colon.
  28. name = name.strip()
  29. value = value.strip()
  30.  
  31. # Header names are case insensitive.
  32. # Lowercase name here.
  33. name = name.lower()
  34.  
  35. # Now we can actually record the header name and value.
  36. headers[name] = value
  37. buffer=BytesIO()
  38. c=pycurl.Curl()
  39. c.setopt(c.URL,'http://pycurl.io')
  40. c.setopt(c.WRITEFUNCTION,buffer.write)
  41. #set our header function
  42. c.setopt(c.HEADERFUNCTION,header_function)
  43. c.perform()
  44. c.close()
  45.  
  46. # Figure out what encoding was sent with the response, if any.
  47. # Check against lowercased header name.
  48. encoding=None
  49. if 'content-type' in headers:
  50. content_type=headers['content-type'].lower()
  51. match=re.search('charset=(\S+)', content_type)
  52. if match:
  53. encoding=match.group(1)
  54. print('Decoding using %s' % encoding)
  55.  
  56. if encoding is None:
  57. # Default encoding for HTML is iso-8859-1.
  58. # Other content types may have different default encoding,
  59. # or in case of binary data, may have no encoding at all.
  60. encoding='iso-8859-1'
  61. print('Assuming encoding is %s' % encoding)
  62.  
  63. body=buffer.getvalue()
  64. # Decode using the encoding we figured out.
  65. print(body.decode(encoding))

  

 
 

4.将响应结果写入到文件

  1. import pycurl
  2.  
  3. with open('out.html','wb') as f:
  4. c=pycurl.Curl()
  5. c.setopt(c.URL,'http://pycurl.io/')
  6. c.setopt(c.WRITEDATA,f)
  7. c.perform()
  8. c.close()

  

 

这里最重要的部分就是以二进制模式打开文件,这样响应结果可以以字节码写入到文件中,不需要编码和解码。

5.跟踪url跳转

libcurl和pycurl默认不跟踪url跳转。

  1. import pycurl
  2. c=pycurl.Curl()
  3. #Redirects to https://www.python.org/.
  4. c.setopt(c.URL,'http://www.python.org/')
  5. #Follow redirect
  6. c.setopt(c.FOLLOWLOCATION,True)
  7. c.perform()
  8. c.close()

  

 

6.审查响应

  1. import pycurl
  2. try:
  3. from io import BytesIO
  4. except ImportError:
  5. from StringIO import StringIO as BytesIO
  6.  
  7. buffer=BytesIO()
  8. c=pycurl.Curl()
  9. c.setopt(c.URL,'http://www.python.org/')
  10. c.setopt(c.WRITEFUNCTION,buffer.write)
  11. c.perform()
  12.  
  13. #Last used URL
  14. print('Effective_url: %s' %c.getinfo(c.EFFECTIVE_URL))
  15. #HTTP response code
  16. print('Response_code: %d' %c.getinfo(c.RESPONSE_CODE))
  17. #Total time of previous transfer
  18. print('Total_time: %f' %c.getinfo(c.TOTAL_TIME))
  19. #Time from start until name resolving completed
  20. print('Namelookup_time: %f' %c.getinfo(c.NAMELOOKUP_TIME))
  21. #Time from start until remote host or proxy completed
  22. print('Connect_time: %f' %c.getinfo(c.CONNECT_TIME))
  23. #Time from start until SLL/SSH handshake completed
  24. print('SSL/SSH_time: %f' %c.getinfo(c.APPCONNECT_TIME))
  25. #Time from start until just before the transfer begins
  26. print('Pretransfer_time: %f' %c.getinfo(c.PRETRANSFER_TIME))
  27. #Time from start until just when the first byte is received
  28. print('Starttransfer_time: %f' %c.getinfo(c.STARTTRANSFER_TIME))
  29. #Time taken for all redirect steps before the final transfer
  30. print('Redirect_time: %f' %c.getinfo(c.REDIRECT_TIME))
  31. #Total number of redirects that were followed
  32. print('Redirect_count: %d' %c.getinfo(c.REDIRECT_COUNT))
  33. #URL a redirect would take you to,had you enabled redirects
  34. print('Redirect_url: %s' %c.getinfo(c.REDIRECT_URL))
  35. #Number of bytes uploaded
  36. print('Size_upload: %d' %c.getinfo(c.SIZE_UPLOAD))
  37. #Average upload speed
  38. print('Speed_upload: %f' %c.getinfo(c.SPEED_UPLOAD))
  39. #Number of bytes downloaded
  40. print('Size_download: %d' %c.getinfo(c.SIZE_DOWNLOAD))
  41. #Average download speed
  42. print('Speed_download: %f' %c.getinfo(c.SPEED_DOWNLOAD))
  43.  
  44. #getinfo must be called before close
  45. c.close()

  

  1. # python response_info.py
  2. Effective_url: http://www.python.org/
  3. Response_code: 301
  4. Total_time: 0.105395
  5. Namelookup_time: 0.051208
  6. Connect_time: 0.078317
  7. SSL/SSH_time: 0.000000
  8. Pretransfer_time: 0.078322
  9. Starttransfer_time: 0.105297
  10. Redirect_time: 0.000000
  11. Redirect_count: 0
  12. Redirect_url: https://www.python.org/
  13. Size_upload: 0
  14. Speed_upload: 0.000000
  15. Size_download: 0
  16. Speed_download: 0.000000

  

 
 
 
 

7.发送表单数据

发送表单数据使用POSTFIELDS参数

  1. import pycurl
  2. try:
  3. #python 3
  4. from urllib.parse import urlencode
  5. except ImportError:
  6. from urllib import urlencode
  7.  
  8. c=pycurl.Curl()
  9. c.setopt(c.URL,'http://pycurl.io/tests/testpostvars.php')
  10.  
  11. post_data={'field':'value'}
  12. #Form data must be provided already urlencoded
  13. postfields=urlencode(post_data)
  14. # Sets request method to POST,
  15. # Content-Type header to application/x-www-form-urlencoded
  16. # and data to send in request body.
  17. c.setopt(c.POSTFIELDS, postfields)
  18.  
  19. c.perform()
  20. c.close()

  

 

8.文件上传

上传文件使用HTTPPOST参数,上传一个物理文件,使用FORM_FILE

  1. import pycurl
  2.  
  3. c = pycurl.Curl()
  4. c.setopt(c.URL, 'http://pycurl.io/tests/testfileupload.php')
  5.  
  6. c.setopt(c.HTTPPOST, [
  7. ('fileupload', (
  8. # upload the contents of this file
  9. c.FORM_FILE, __file__,
  10. )),
  11. ])
  12.  
  13. c.perform()
  14. c.close()

  

 

为上传的文件设置不同的文件名和内容类型

  1. import pycurl
  2.  
  3. c = pycurl.Curl()
  4. c.setopt(c.URL, 'http://pycurl.io/tests/testfileupload.php')
  5.  
  6. c.setopt(c.HTTPPOST, [
  7. ('fileupload', (
  8. # upload the contents of this file
  9. c.FORM_FILE, __file__,
  10. # specify a different file name for the upload
  11. c.FORM_FILENAME, 'helloworld.py',
  12. # specify a different content type
  13. c.FORM_CONTENTTYPE, 'application/x-python',
  14. )),
  15. ])
  16.  
  17. c.perform()
  18. c.close()

  

 
 

如果文件数据在内存中,使用BUFFER/BUFFERPTR

  1. import pycurl
  2.  
  3. c = pycurl.Curl()
  4. c.setopt(c.URL, 'http://pycurl.io/tests/testfileupload.php')
  5.  
  6. c.setopt(c.HTTPPOST, [
  7. ('fileupload', (
  8. c.FORM_BUFFER, 'readme.txt',
  9. c.FORM_BUFFERPTR, 'This is a fancy readme file',
  10. )),
  11. ])
  12.  
  13. c.perform()
  14. c.close()

  

 

9.处理FTP协议

  1. import pycurl
  2.  
  3. c = pycurl.Curl()
  4. c.setopt(c.URL, 'ftp://ftp.sunet.se/')
  5. c.setopt(c.FTP_USE_EPSV, 1)
  6. c.setopt(c.QUOTE, ['cwd pub', 'type i'])
  7. c.perform()
  8. c.close()

  

 

10.Sharing Data 

  1. import pycurl
  2. import threading
  3.  
  4. print >>sys.stderr, 'Testing', pycurl.version
  5.  
  6. class Test(threading.Thread):
  7.  
  8. def __init__(self, share):
  9. threading.Thread.__init__(self)
  10. self.curl = pycurl.Curl()
  11. self.curl.setopt(pycurl.URL, 'http://curl.haxx.se')
  12. self.curl.setopt(pycurl.SHARE, share)
  13.  
  14. def run(self):
  15. self.curl.perform()
  16. self.curl.close()
  17.  
  18. s = pycurl.CurlShare()
  19. s.setopt(pycurl.SH_SHARE, pycurl.LOCK_DATA_COOKIE)
  20. s.setopt(pycurl.SH_SHARE, pycurl.LOCK_DATA_DNS)
  21.  
  22. t1 = Test(s)
  23. t2 = Test(s)
  24.  
  25. t1.start()
  26. t2.start()
  27. del s

  

 

11.使用multi接口

libcurl的easy接口是一个同步的,高效的,上手快的用于文件传输的接口。multi接口是一个异步的接口,它可以使用一个或者多个线程进行多路传输。

multi接口比easy接口多了以下几个功能:

提供一个pull接口。使用libcurl的应用决定哪里何时询问libcurl去接收或者发送数据

在同一个线程中启动多路同步传输而不必使应用程序变得更复杂

使得应用程序同时等待在应用程序本身的文件描述符和libcurl文件描述符上的动作变得简单许多

使得基于事件处理和扩展的传输可以达到上千个并行连接

 

例1

  1. import pycurl
  2.  
  3. m = pycurl.CurlMulti()
  4. m.handles = []
  5. c1 = pycurl.Curl()
  6. c2 = pycurl.Curl()
  7. c1.setopt(c1.URL, 'http://curl.haxx.se')
  8. c2.setopt(c2.URL, 'http://cnn.com')
  9. c2.setopt(c2.FOLLOWLOCATION, 1)
  10. m.add_handle(c1)
  11. m.add_handle(c2)
  12. m.handles.append(c1)
  13. m.handles.append(c2)
  14.  
  15. num_handles = len(m.handles)
  16. while num_handles:
  17. while 1:
  18. ret, num_handles = m.perform()
  19. if ret != pycurl.E_CALL_MULTI_PERFORM:
  20. break
  21. m.select(1.0)
  22.  
  23. m.remove_handle(c2)
  24. m.remove_handle(c1)
  25. del m.handles
  26. m.close()
  27. c1.close()
  28. c2.close()

  

 

 
 

例2

  1. import os, sys
  2. try:
  3. from cStringIO import StringIO
  4. except ImportError:
  5. from StringIO import StringIO
  6. import pycurl
  7.  
  8. urls = (
  9. "http://curl.haxx.se",
  10. "http://www.python.org",
  11. "http://pycurl.sourceforge.net",
  12. "http://pycurl.sourceforge.net/tests/403_FORBIDDEN", # that actually exists ;-)
  13. "http://pycurl.sourceforge.net/tests/404_NOT_FOUND",
  14. )
  15.  
  16. # Read list of URIs from file specified on commandline
  17. try:
  18. urls = open(sys.argv[1], "rb").readlines()
  19. except IndexError:
  20. # No file was specified
  21. pass
  22.  
  23. # init
  24. m = pycurl.CurlMulti()
  25. m.handles = []
  26. for url in urls:
  27. c = pycurl.Curl()
  28. # save info in standard Python attributes
  29. c.url = url.strip()
  30. c.body = StringIO()
  31. c.http_code = -1
  32. m.handles.append(c)
  33. # pycurl API calls
  34. c.setopt(c.URL, c.url)
  35. c.setopt(c.WRITEFUNCTION, c.body.write)
  36. c.setopt(c.FOLLOWLOCATION,True)
  37. m.add_handle(c)
  38.  
  39. # get data
  40. num_handles = len(m.handles)
  41. while num_handles:
  42. while 1:
  43. ret, num_handles = m.perform()
  44. print ret,num_handles
  45. if ret != pycurl.E_CALL_MULTI_PERFORM:
  46. break
  47. # currently no more I/O is pending, could do something in the meantime
  48. # (display a progress bar, etc.)
  49. m.select(1.0)
  50.  
  51. # close handles
  52. for c in m.handles:
  53. # save info in standard Python attributes
  54. c.http_code = c.getinfo(c.HTTP_CODE)
  55. # pycurl API calls
  56. m.remove_handle(c)
  57. c.close()
  58. m.close()
  59.  
  60. # print result
  61. for c in m.handles:
  62. data = c.body.getvalue()
  63. if 0:
  64. print "**********", c.url, "**********"
  65. print data
  66. else:
  67. print "%-53s http_code %3d, %6d bytes" % (c.url, c.http_code, len(data))

  

 
 

例3

  1. import os, sys
  2. try:
  3. from cStringIO import StringIO
  4. except ImportError:
  5. from StringIO import StringIO
  6. import pycurl
  7.  
  8. urls = (
  9. "http://curl.haxx.se",
  10. "http://www.python.org",
  11. "http://pycurl.sourceforge.net",
  12. "http://pycurl.sourceforge.net/THIS_HANDLE_IS_CLOSED",
  13. )
  14.  
  15. # init
  16. m = pycurl.CurlMulti()
  17. m.handles = []
  18. for url in urls:
  19. c = pycurl.Curl()
  20. # save info in standard Python attributes
  21. c.url = url
  22. c.body = StringIO()
  23. c.http_code = -1
  24. c.debug = 0
  25. m.handles.append(c)
  26. # pycurl API calls
  27. c.setopt(c.URL, c.url)
  28. c.setopt(c.WRITEFUNCTION, c.body.write)
  29. c.setopt(c.FOLLOWLOCATION,True)
  30. m.add_handle(c)
  31.  
  32. # debug - close a handle
  33. if 1:
  34. c = m.handles[3]
  35. c.debug = 1
  36. c.close()
  37.  
  38. # get data
  39. num_handles = len(m.handles)
  40. while num_handles:
  41. while 1:
  42. ret, num_handles = m.perform()
  43. if ret != pycurl.E_CALL_MULTI_PERFORM:
  44. break
  45. # currently no more I/O is pending, could do something in the meantime
  46. # (display a progress bar, etc.)
  47. m.select(1.0)
  48.  
  49. # close handles
  50. for c in m.handles:
  51. # save info in standard Python attributes
  52. try:
  53. c.http_code = c.getinfo(c.HTTP_CODE)
  54. except pycurl.error:
  55. # handle already closed - see debug above
  56. assert c.debug
  57. c.http_code = -1
  58. # pycurl API calls
  59. if 0:
  60. m.remove_handle(c)
  61. c.close()
  62. elif 0:
  63. # in the C API this is the wrong calling order, but pycurl
  64. # handles this automatically
  65. c.close()
  66. m.remove_handle(c)
  67. else:
  68. # actually, remove_handle is called automatically on close
  69. c.close()
  70. m.close()
  71.  
  72. # print result
  73. for c in m.handles:
  74. data = c.body.getvalue()
  75. if 0:
  76. print "**********", c.url, "**********"
  77. else:
  78. print "%-53s http_code %3d, %6d bytes" % (c.url, c.http_code, len(data))

  

 
 

可以使用multi接口来缩短访问很多url的时间

假设一个文件中包含了很多个url,现在要通过脚本去访问每个url判断返回码是不是200

文件中共有87个url

方法一 使用python的for语句顺序访问每个url

  1. import os,sys
  2. import pycurl
  3. from StringIO import StringIO
  4.  
  5. try:
  6. if sys.argv[1]=="-":
  7. urls=sys.stdin.readlines()
  8. else:
  9. urls=open(sys.argv[1],'rb').readlines()
  10. #print urls
  11. except:
  12. print "Usage: %s check_urls.txt <file with urls to check>" %sys.argv[0]
  13. raise SystemExit
  14.  
  15. class Curl:
  16. def __init__(self,url):
  17. self.url=url
  18. self.body=StringIO()
  19. self.http_code=0
  20.  
  21. self._curl=pycurl.Curl()
  22. self._curl.setopt(pycurl.URL,self.url)
  23. self._curl.setopt(pycurl.WRITEFUNCTION,self.body.write)
  24. self._curl.setopt(pycurl.FOLLOWLOCATION,True)
  25. self._curl.setopt(pycurl.NOSIGNAL,1)
  26.  
  27. def perform(self):
  28. self._curl.perform()
  29.  
  30. def close(self):
  31. self.http_code=self._curl.getinfo(pycurl.HTTP_CODE)
  32. self._curl.close()
  33.  
  34. for url in urls:
  35. url=url.strip()
  36. if not url or url[0] == '#':
  37. continue
  38. c=Curl(url)
  39. c.perform()
  40. c.close()
  41. print url, c.http_code

  

  1. real 2m46.134s
  2. user 0m0.134s
  3. sys 0m0.185s

  

 
 
 
 

方法二 使用pycurl的CurlMulti()函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from StringIO import StringIO
import pycurl
 
# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
   import signal
   from signal import SIGPIPE,SIG_ING
   signal.signal(signal.SIGPIPE,signal.SIG_IGN)
except ImportError:
   pass
 
 
# need a given txt file contains urls
try:
   if sys.argv[1]=="-":
      urls=sys.stdin.readlines()
   else:
      urls=open(sys.argv[1],'rb').readlines()
   #print urls
except:
   print "Usage: %s check_urls.txt <file with urls to check>" %sys.argv[0]
   raise SystemExit
 
class Curl:
   def __init__(self,url):
       self.url=url
       self.body=StringIO()
       self.http_code=0
 
       self._curl=pycurl.Curl()
       self._curl.setopt(pycurl.URL,self.url)
       self._curl.setopt(pycurl.FOLLOWLOCATION,True)
       self._curl.setopt(pycurl.WRITEFUNCTION,self.body.write)
       self._curl.setopt(pycurl.NOSIGNAL,1)
       self._curl.debug=0
 
   def perform(self):
       self._curl.perform()
 
   def close(self):
      try:
        self.http_code=self._curl.getinfo(pycurl.HTTP_CODE)
      except pycurl.error:
        assert c.debug
        self.http_code=0
      self._curl.close()
 
 
 
def print_result(items):
 
    for in items:
        data=c.body.getvalue()
        if 0:
            print "***************",c.url,"******************"
            print data
        elif 1:
            print "%-60s        %3d     %6d" %(c.url,c.http_code,len(data))
 
 
 
def test_multi():
    handles=[]
    m=pycurl.CurlMulti()
    for url in urls:
        url=url.strip()
        if not url or url[0== '#':
           continue
        c=Curl(url)
        m.add_handle(c._curl)
        handles.append(c)
 
    while 1:
        ret,num_handles=m.perform()
        if ret!= pycurl.E_CALL_MULTI_PERFORM:
           break
 
    while num_handles:
        m.select(5.0)
        while 1:
            ret,num_handles=m.perform()
            if ret!= pycurl.E_CALL_MULTI_PERFORM:
                break
    for in handles:
        c.close()
    m.close()
 
 
    print_result(handles)
 
 
 
 
if 1:
  test_multi()
1
2
3
real    2m46.049s
user    0m0.082s
sys 0m0.132s

在pycurl作者给的案例中,使用CurlMulti()接口处理多个url速度是最快的,但是当url数量多时速度并不快,而且有部分url还不能获取正确的返回值

方法三 使用python的多线程模块

python由于有GIL全局解释器锁的存在,python提供的threading模块不能充分利用多线程的优势,在多核CPU服务器上,统一时刻实际上只有一个线程在运行,其他线程都处于锁定状态。所以python的threading模块不适合用于处理CPU密集型任务,相反,threading线程数据量越多,速度越慢。但是对于I/O密集型或者网络密集型任务,还是可以使用threading模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import os,sys,time
import threading
import Queue
 
try:
   from cStringIO import StringIO
except ImportError:
   from StringIO import StringIO
import pycurl
 
# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
   import signal
   from signal import SIGPIPE,SIG_ING
   signal.signal(signal.SIGPIPE,signal.SIG_IGN)
except ImportError:
   pass
 
 
# need a given txt file contains urls
try
   if sys.argv[1]=="-":
      urls=sys.stdin.readlines()
   else:
      urls=open(sys.argv[1],'rb').readlines()
   #print urls
except:
   print "Usage: %s check_urls.txt <file with urls to check>" %sys.argv[0]
   raise SystemExit
  
class Curl:
   def __init__(self,url):
       self.url=url
       self.body=StringIO()
       self.http_code=0
        
       self._curl=pycurl.Curl()
       self._curl.setopt(pycurl.URL,self.url)
       self._curl.setopt(pycurl.FOLLOWLOCATION,True)
       self._curl.setopt(pycurl.CONNECTTIMEOUT,15)
       self._curl.setopt(pycurl.TIMEOUT,15)
       self._curl.setopt(pycurl.WRITEFUNCTION,self.body.write)
       self._curl.setopt(pycurl.NOSIGNAL,1)
       self._curl.debug=0
    
   def perform(self):
       self._curl.perform()
    
   def close(self):
      try:
        self.http_code=self._curl.getinfo(pycurl.HTTP_CODE)
      except pycurl.error:
        assert c.debug
        self.http_code=0 
      self._curl.close()
 
 
queue=Queue.Queue()
for url in urls:
    url=url.strip()
    if not url or url[0== "#":
       continue
    queue.put(url)
     
assert queue.queue, "no urls are given"
num_urls=len(queue.queue)
#num_conn=min(num_conn,num_urls)
num_conn=num_urls
#assert 1 <= num_conn < = 1000,"invalid number of concurrent connections"
 
class WorkerThread(threading.Thread):
     def __init__(self,queue):
         threading.Thread.__init__(self)
         self.queue=queue
   
     def run(self):
         while 1:
             try:
                url=self.queue.get_nowait()
             except Queue.Empty:
                raise SystemExit
             c=Curl(url)
             c.perform()
             c.close()
             print "http_url:" + url + "\t" + "http_code:" + str(c.http_code)
#start a bunch of threads                
threads=[]
for dummy in range(num_conn):
    t=WorkerThread(queue)
    t.start()
    threads.append(t)
 
#wait for all threads to finish
for thread in threads:
    thread.join()
1
2
3
real    0m10.500s
user    0m0.149s
sys 0m0.196s

可以看到时间明显比以上两种方法所短了很多

所以,对于有大量url需要用pycurl来处理时,应该结合threading模块

参考资料:

http://pycurl.sourceforge.net/

http://pycurl.io/docs/latest/index.html

https://curl.haxx.se/libcurl/

https://curl.haxx.se/libcurl/c/libcurl-multi.html

转python版本的curl工具pycurl学习的更多相关文章

  1. python版本wifi共享工具

    原先不知道win7系统也可以当作无线路由器,既然知道了这个东西那么就搞搞了 使用python写的一个wifi共享工具,还不够完善,有些功能还没做(说明:internet共享连接需要手动设置)..... ...

  2. Metasploit和python两种安全工具的学习笔记

    Metasploit是个好东西 主要参考了<Metasploit渗透测试魔鬼训练营>这本书. 一.先用自己的靶机感受一下该工具的强大 linux靶机的ip如图 按照书上写的配置,如图 然后 ...

  3. Cygwin 版本的 Curl 安装,提取,使用笔记

    Cygwin 版本的 Curl 安装,提取,使用笔记 Cygwin 版本的 Curl 使其恢复 HTTPS 请求功能Cygwin 版本的 Curl 依赖的 DLL 清单提取 Cygwin 版本的 Cu ...

  4. 安装的 Python 版本太多互相干扰?pyenv 建议了解一下。

    写在之前 我们都知道现在的 Python 有 Python2 和 Python3,但是由于各种乱七八糟的原因导致这俩哥们要长期共存,荣辱与共,尴尬的是这哥俩的差异还比较大,在很多时候我们可能要同时用到 ...

  5. 《零基础学习Python制作ArcGIS自定义工具》课程简介

    Python for ArcGIS Python for ArcGIS是借助Python语言实现ArcGIS自动化行为的综合,它不止是如课程标题所述的“制作ArcGIS自定义工具”,还包括使用Pyth ...

  6. Python基础学习之Python主要的数据分析工具总结

    Python主要是依靠众多的第三方库来增强它的数据处理能力的.常用的是Numpy库,Scipy库.Matplotlib库.Pandas库.Scikit-Learn库等. 常规版本的python需要在安 ...

  7. 自动化运维工具ansible-如何设置客户端多python版本问题

    问题:在使用ansible进行管理客户主机时,发现客户主机安装了多个版本的python,并且默认版本为3.0 shell>>cat list 192.168.2.9 shell>&g ...

  8. Python多版本共存管理工具之pyenv

    目录 Table of Contents 1. 安装pyenv 2. 安装Python 3.0 使用python 参考 Table of Contents 经常遇到这样的情况: 系统自带的Python ...

  9. Python—版本和环境的管理工具(Pipenv)

    pipenv简介 虚拟环境本质是一个文件,是为了适应不同的项目而存在.pipenv相当于virtualenv和pip的合体. 整合了 pip+virtualenv+Pipfile,能够自动处理好包的依 ...

随机推荐

  1. 秀秀的森林(forest)

    秀秀的森林(forest) 题目要求树上两条不相交的链,且要求权值的和最大 性质: 1.如果某棵树上的最长链端点为x,y,则该树上任意一点z出发的最长链为max(xz,zy) 2如果两个点被连进了树里 ...

  2. 设置pycharm的python版本

    http://blog.csdn.net/github_35160620/article/details/52486986

  3. 从零开始实现放置游戏(六)——实现挂机战斗(4)导入Excel数值配置

    前面我们已经实现了在后台管理系统中,对配置数据的增删查改.但每次添加只能添加一条数据,实际生产中,大量数据通过手工一条一条添加不太现实.本章我们就实现通过Excel导入配置数据的功能.这里我们还是以地 ...

  4. Python与其他语言时间戳

    时间戳是自 1970 年 1 月 1 日(00:00:00 GMT)以来的秒数.它也被称为 Unix 时间戳(Unix Timestamp). Unix时间戳(Unix timestamp),或称Un ...

  5. UIImage与Base64相互转换

    UIImage与Base64相互转换 采用第三方类 Address:https://github.com/l4u/NSData-Base64/ 经测试好用. 2013-07-17

  6. js 字符串长度截取

    <script> function cutstr(str, len) { var temp, icount = 0, patrn = /[^\x00-\xff]/, strre = &qu ...

  7. 百度识图for windows phone 上线

    原文发布时间为:2013-07-04 -- 来源于本人的百度文章 [由搬家工具导入] 百度识图主要用于找女神,找男神,找美图,找宠物,找图文新闻,找相似图,找原图,还能鉴别头像照片真伪,免得被网络照片 ...

  8. COUNT多列,但是每列都是不同条件的,怎么用一句SQL写?

    原文发布时间为:2010-09-06 -- 来源于本人的百度文章 [由搬家工具导入] 《转》http://www.cnblogs.com/ruanzuzhang/archive/2009/02/22/ ...

  9. AC日记——[ZJOI2009]狼和羊的故事 bzoj 1412

    1412 思路: 最小割: 狼作为一个点集a,空领地作为点集b,羊作为点集c: s向a连边,c向t连边,a向b连边,b向b连边,b向c连边: 如何理解最小割? a,c之间割掉最少的路径(栅栏)使其没有 ...

  10. HashMap之equals和hashCode小陷阱

    先以一段代码开始这篇blog. 01 public class Name { 02   03   private String first; //first name 04   private Str ...