1:简单的使用

  1. import mechanize
  2. # response = mechanize.urlopen("http://www.hao123.com/")
  3. request = mechanize.Request("http://www.hao123.com/")
  4. response = mechanize.urlopen(request)
  5. print response.geturl()
  6. print response.info()
  7. # print response.read()

2:mechanize.urlretrieve

  1. >>> import mechanize
  2. >>> help(mechanize.urlretrieve)
  3. Help on function urlretrieve in module mechanize._opener:
  4.  
  5. urlretrieve(url, filename=None, reporthook=None, data=None, timeout=<object object>)
  • 参数 finename 指定了保存本地路径(如果参数未指定,urllib会生成一个临时文件保存数据。)
  • 参数 reporthook 是一个回调函数,当连接上服务器、以及相应的数据块传输完毕时会触发该回调,我们可以利用这个回调函数来显示当前的下载进度。
  • 参数 data 指 post 到服务器的数据,该方法返回一个包含两个元素的(filename, headers)元组,filename 表示保存到本地的路径,header 表示服务器的响应头
  • 参数 timeout 是设定的超时对象

reporthook(block_read,block_size,total_size)定义回调函数,block_size是每次读取的数据块的大小,block_read是每次读取的数据块个数,taotal_size是一一共读取的数据量,单位是byte。可以使用reporthook函数来显示读取进度。

简单的例子

  1. def cbk(a, b, c):print a,b,c
  2.  
  3. url = 'http://www.hao123.com/'
  4. local = 'd://hao.html'
  5. mechanize.urlretrieve(url,local,cbk)

3:form表单登陆

  1. br = mechanize.Browser()
  2. br.set_handle_robots(False)
  3. br.open("http://www.zhaopin.com/")
  4. br.select_form(nr=0)
  5. br['loginname'] = '**'自己注册一个账号密码就行了
  6. br['password'] = '**'
  7. r = br.submit()
  8. print os.path.dirname(__file__)+'\login.html'
  9. h = file(os.path.dirname(__file__)+'\login.html',"w")
  10. rt = r.read()
  11. h.write(rt)
  12. h.close()

4:Browser

看完help的文档基本可以成神了

  1. Help on class Browser in module mechanize._mechanize:
  2.  
  3. class Browser(mechanize._useragent.UserAgentBase)
  4. | Browser-like class with support for history, forms and links.
  5. |
  6. | BrowserStateError is raised whenever the browser is in the wrong state to
  7. | complete the requested operation - e.g., when .back() is called when the
  8. | browser history is empty, or when .follow_link() is called when the current
  9. | response does not contain HTML data.
  10. |
  11. | Public attributes:
  12. |
  13. | request: current request (mechanize.Request)
  14. | form: currently selected form (see .select_form())
  15. |
  16. | Method resolution order:
  17. | Browser
  18. | mechanize._useragent.UserAgentBase
  19. | mechanize._opener.OpenerDirector
  20. | mechanize._urllib2_fork.OpenerDirector
  21. |
  22. | Methods defined here:
  23. |
  24. | __getattr__(self, name)
  25. |
  26. | __init__(self, factory=None, history=None, request_class=None)
  27. | Only named arguments should be passed to this constructor.
  28. |
  29. | factory: object implementing the mechanize.Factory interface.
  30. | history: object implementing the mechanize.History interface. Note
  31. | this interface is still experimental and may change in future.
  32. | request_class: Request class to use. Defaults to mechanize.Request
  33. |
  34. | The Factory and History objects passed in are 'owned' by the Browser,
  35. | so they should not be shared across Browsers. In particular,
  36. | factory.set_response() should not be called except by the owning
  37. | Browser itself.
  38. |
  39. | Note that the supplied factory's request_class is overridden by this
  40. | constructor, to ensure only one Request class is used.
  41. |
  42. | __str__(self)
  43. |
  44. | back(self, n=1)
  45. | Go back n steps in history, and return response object.
  46. |
  47. | n: go back this number of steps (default 1 step)
  48. |
  49. | clear_history(self)
  50. |
  51. | click(self, *args, **kwds)
  52. | See mechanize.HTMLForm.click for documentation.
  53. |
  54. | click_link(self, link=None, **kwds)
  55. | Find a link and return a Request object for it.
  56. |
  57. | Arguments are as for .find_link(), except that a link may be supplied
  58. | as the first argument.
  59. |
  60. | close(self)
  61. |
  62. | encoding(self)
  63. |
  64. | find_link(self, **kwds)
  65. | Find a link in current page.
  66. |
  67. | Links are returned as mechanize.Link objects.
  68. |
  69. | # Return third link that .search()-matches the regexp "python"
  70. | # (by ".search()-matches", I mean that the regular expression method
  71. | # .search() is used, rather than .match()).
  72. | find_link(text_regex=re.compile("python"), nr=2)
  73. |
  74. | # Return first http link in the current page that points to somewhere
  75. | # on python.org whose link text (after tags have been removed) is
  76. | # exactly "monty python".
  77. | find_link(text="monty python",
  78. | url_regex=re.compile("http.*python.org"))
  79. |
  80. | # Return first link with exactly three HTML attributes.
  81. | find_link(predicate=lambda link: len(link.attrs) == 3)
  82. |
  83. | Links include anchors (<a>), image maps (<area>), and frames (<frame>,
  84. | <iframe>).
  85. |
  86. | All arguments must be passed by keyword, not position. Zero or more
  87. | arguments may be supplied. In order to find a link, all arguments
  88. | supplied must match.
  89. |
  90. | If a matching link is not found, mechanize.LinkNotFoundError is raised.
  91. |
  92. | text: link text between link tags: e.g. <a href="blah">this bit</a> (as
  93. | returned by pullparser.get_compressed_text(), ie. without tags but
  94. | with opening tags "textified" as per the pullparser docs) must compare
  95. | equal to this argument, if supplied
  96. | text_regex: link text between tag (as defined above) must match the
  97. | regular expression object or regular expression string passed as this
  98. | argument, if supplied
  99. | name, name_regex: as for text and text_regex, but matched against the
  100. | name HTML attribute of the link tag
  101. | url, url_regex: as for text and text_regex, but matched against the
  102. | URL of the link tag (note this matches against Link.url, which is a
  103. | relative or absolute URL according to how it was written in the HTML)
  104. | tag: element name of opening tag, e.g. "a"
  105. | predicate: a function taking a Link object as its single argument,
  106. | returning a boolean result, indicating whether the links
  107. | nr: matches the nth link that matches all other criteria (default 0)
  108. |
  109. | follow_link(self, link=None, **kwds)
  110. | Find a link and .open() it.
  111. |
  112. | Arguments are as for .click_link().
  113. |
  114. | Return value is same as for Browser.open().
  115. |
  116. | forms(self)
  117. | Return iterable over forms.
  118. |
  119. | The returned form objects implement the mechanize.HTMLForm interface.
  120. |
  121. | geturl(self)
  122. | Get URL of current document.
  123. |
  124. | global_form(self)
  125. | Return the global form object, or None if the factory implementation
  126. | did not supply one.
  127. |
  128. | The "global" form object contains all controls that are not descendants
  129. | of any FORM element.
  130. |
  131. | The returned form object implements the mechanize.HTMLForm interface.
  132. |
  133. | This is a separate method since the global form is not regarded as part
  134. | of the sequence of forms in the document -- mostly for
  135. | backwards-compatibility.
  136. |
  137. | links(self, **kwds)
  138. | Return iterable over links (mechanize.Link objects).
  139. |
  140. | open(self, url, data=None, timeout=<object object>)
  141. |
  142. | open_local_file(self, filename)
  143. |
  144. | open_novisit(self, url, data=None, timeout=<object object>)
  145. | Open a URL without visiting it.
  146. |
  147. | Browser state (including request, response, history, forms and links)
  148. | is left unchanged by calling this function.
  149. |
  150. | The interface is the same as for .open().
  151. |
  152. | This is useful for things like fetching images.
  153. |
  154. | See also .retrieve().
  155. |
  156. | reload(self)
  157. | Reload current document, and return response object.
  158. |
  159. | response(self)
  160. | Return a copy of the current response.
  161. |
  162. | The returned object has the same interface as the object returned by
  163. | .open() (or mechanize.urlopen()).
  164. |
  165. | select_form(self, name=None, predicate=None, nr=None)
  166. | Select an HTML form for input.
  167. |
  168. | This is a bit like giving a form the "input focus" in a browser.
  169. |
  170. | If a form is selected, the Browser object supports the HTMLForm
  171. | interface, so you can call methods like .set_value(), .set(), and
  172. | .click().
  173. |
  174. | Another way to select a form is to assign to the .form attribute. The
  175. | form assigned should be one of the objects returned by the .forms()
  176. | method.
  177. |
  178. | At least one of the name, predicate and nr arguments must be supplied.
  179. | If no matching form is found, mechanize.FormNotFoundError is raised.
  180. |
  181. | If name is specified, then the form must have the indicated name.
  182. |
  183. | If predicate is specified, then the form must match that function. The
  184. | predicate function is passed the HTMLForm as its single argument, and
  185. | should return a boolean value indicating whether the form matched.
  186. |
  187. | nr, if supplied, is the sequence number of the form (where 0 is the
  188. | first). Note that control 0 is the first form matching all the other
  189. | arguments (if supplied); it is not necessarily the first control in the
  190. | form. The "global form" (consisting of all form controls not contained
  191. | in any FORM element) is considered not to be part of this sequence and
  192. | to have no name, so will not be matched unless both name and nr are
  193. | None.
  194. |
  195. | set_cookie(self, cookie_string)
  196. | Request to set a cookie.
  197. |
  198. | Note that it is NOT necessary to call this method under ordinary
  199. | circumstances: cookie handling is normally entirely automatic. The
  200. | intended use case is rather to simulate the setting of a cookie by
  201. | client script in a web page (e.g. JavaScript). In that case, use of
  202. | this method is necessary because mechanize currently does not support
  203. | JavaScript, VBScript, etc.
  204. |
  205. | The cookie is added in the same way as if it had arrived with the
  206. | current response, as a result of the current request. This means that,
  207. | for example, if it is not appropriate to set the cookie based on the
  208. | current request, no cookie will be set.
  209. |
  210. | The cookie will be returned automatically with subsequent responses
  211. | made by the Browser instance whenever that's appropriate.
  212. |
  213. | cookie_string should be a valid value of the Set-Cookie header.
  214. |
  215. | For example:
  216. |
  217. | browser.set_cookie(
  218. | "sid=abcdef; expires=Wednesday, 09-Nov-06 23:12:40 GMT")
  219. |
  220. | Currently, this method does not allow for adding RFC 2986 cookies.
  221. | This limitation will be lifted if anybody requests it.
  222. |
  223. | set_handle_referer(self, handle)
  224. | Set whether to add Referer header to each request.
  225. |
  226. | set_response(self, response)
  227. | Replace current response with (a copy of) response.
  228. |
  229. | response may be None.
  230. |
  231. | This is intended mostly for HTML-preprocessing.
  232. |
  233. | submit(self, *args, **kwds)
  234. | Submit current form.
  235. |
  236. | Arguments are as for mechanize.HTMLForm.click().
  237. |
  238. | Return value is same as for Browser.open().
  239. |
  240. | title(self)
  241. | Return title, or None if there is no title element in the document.
  242. |
  243. | Treatment of any tag children of attempts to follow Firefox and IE
  244. | (currently, tags are preserved).
  245. |
  246. | viewing_html(self)
  247. | Return whether the current response contains HTML data.
  248. |
  249. | visit_response(self, response, request=None)
  250. | Visit the response, as if it had been .open()ed.
  251. |
  252. | Unlike .set_response(), this updates history rather than replacing the
  253. | current response.
  254. |
  255. | ----------------------------------------------------------------------
  256. | Data and other attributes defined here:
  257. |
  258. | default_features = ['_redirect', '_cookies', '_refresh', '_equiv', '_b...
  259. |
  260. | handler_classes = {'_basicauth': <class mechanize._urllib2_fork.HTTPBa...
  261. |
  262. | ----------------------------------------------------------------------
  263. | Methods inherited from mechanize._useragent.UserAgentBase:
  264. |
  265. | add_client_certificate(self, url, key_file, cert_file)
  266. | Add an SSL client certificate, for HTTPS client auth.
  267. |
  268. | key_file and cert_file must be filenames of the key and certificate
  269. | files, in PEM format. You can use e.g. OpenSSL to convert a p12 (PKCS
  270. | 12) file to PEM format:
  271. |
  272. | openssl pkcs12 -clcerts -nokeys -in cert.p12 -out cert.pem
  273. | openssl pkcs12 -nocerts -in cert.p12 -out key.pem
  274. |
  275. |
  276. | Note that client certificate password input is very inflexible ATM. At
  277. | the moment this seems to be console only, which is presumably the
  278. | default behaviour of libopenssl. In future mechanize may support
  279. | third-party libraries that (I assume) allow more options here.
  280. |
  281. | add_password(self, url, user, password, realm=None)
  282. |
  283. | add_proxy_password(self, user, password, hostport=None, realm=None)
  284. |
  285. | set_client_cert_manager(self, cert_manager)
  286. | Set a mechanize.HTTPClientCertMgr, or None.
  287. |
  288. | set_cookiejar(self, cookiejar)
  289. | Set a mechanize.CookieJar, or None.
  290. |
  291. | set_debug_http(self, handle)
  292. | Print HTTP headers to sys.stdout.
  293. |
  294. | set_debug_redirects(self, handle)
  295. | Log information about HTTP redirects (including refreshes).
  296. |
  297. | Logging is performed using module logging. The logger name is
  298. | "mechanize.http_redirects". To actually print some debug output,
  299. | eg:
  300. |
  301. | import sys, logging
  302. | logger = logging.getLogger("mechanize.http_redirects")
  303. | logger.addHandler(logging.StreamHandler(sys.stdout))
  304. | logger.setLevel(logging.INFO)
  305. |
  306. | Other logger names relevant to this module:
  307. |
  308. | "mechanize.http_responses"
  309. | "mechanize.cookies"
  310. |
  311. | To turn on everything:
  312. |
  313. | import sys, logging
  314. | logger = logging.getLogger("mechanize")
  315. | logger.addHandler(logging.StreamHandler(sys.stdout))
  316. | logger.setLevel(logging.INFO)
  317. |
  318. | set_debug_responses(self, handle)
  319. | Log HTTP response bodies.
  320. |
  321. | See docstring for .set_debug_redirects() for details of logging.
  322. |
  323. | Response objects may be .seek()able if this is set (currently returned
  324. | responses are, raised HTTPError exception responses are not).
  325. |
  326. | set_handle_equiv(self, handle, head_parser_class=None)
  327. | Set whether to treat HTML http-equiv headers like HTTP headers.
  328. |
  329. | Response objects may be .seek()able if this is set (currently returned
  330. | responses are, raised HTTPError exception responses are not).
  331. |
  332. | set_handle_gzip(self, handle)
  333. | Handle gzip transfer encoding.
  334. |
  335. | set_handle_redirect(self, handle)
  336. | Set whether to handle HTTP 30x redirections.
  337. |
  338. | set_handle_refresh(self, handle, max_time=None, honor_time=True)
  339. | Set whether to handle HTTP Refresh headers.
  340. |
  341. | set_handle_robots(self, handle)
  342. | Set whether to observe rules from robots.txt.
  343. |
  344. | set_handled_schemes(self, schemes)
  345. | Set sequence of URL scheme (protocol) strings.
  346. |
  347. | For example: ua.set_handled_schemes(["http", "ftp"])
  348. |
  349. | If this fails (with ValueError) because you've passed an unknown
  350. | scheme, the set of handled schemes will not be changed.
  351. |
  352. | set_password_manager(self, password_manager)
  353. | Set a mechanize.HTTPPasswordMgrWithDefaultRealm, or None.
  354. |
  355. | set_proxies(self, proxies=None, proxy_bypass=None)
  356. | Configure proxy settings.
  357. |
  358. | proxies: dictionary mapping URL scheme to proxy specification. None
  359. | means use the default system-specific settings.
  360. | proxy_bypass: function taking hostname, returning whether proxy should
  361. | be used. None means use the default system-specific settings.
  362. |
  363. | The default is to try to obtain proxy settings from the system (see the
  364. | documentation for urllib.urlopen for information about the
  365. | system-specific methods used -- note that's urllib, not urllib2).
  366. |
  367. | To avoid all use of proxies, pass an empty proxies dict.
  368. |
  369. | >>> ua = UserAgentBase()
  370. | >>> def proxy_bypass(hostname):
  371. | ... return hostname == "noproxy.com"
  372. | >>> ua.set_proxies(
  373. | ... {"http": "joe:password@myproxy.example.com:3128",
  374. | ... "ftp": "proxy.example.com"},
  375. | ... proxy_bypass)
  376. |
  377. | set_proxy_password_manager(self, password_manager)
  378. | Set a mechanize.HTTPProxyPasswordMgr, or None.
  379. |
  380. | ----------------------------------------------------------------------
  381. | Data and other attributes inherited from mechanize._useragent.UserAgentBase:
  382. |
  383. | default_others = ['_unknown', '_http_error', '_http_default_error']
  384. |
  385. | default_schemes = ['http', 'ftp', 'file', 'https']
  386. |
  387. | ----------------------------------------------------------------------
  388. | Methods inherited from mechanize._opener.OpenerDirector:
  389. |
  390. | add_handler(self, handler)
  391. |
  392. | error(self, proto, *args)
  393. |
  394. | retrieve(self, fullurl, filename=None, reporthook=None, data=None, timeout=<object object>, open=<built-in function open>)
  395. | Returns (filename, headers).
  396. |
  397. | For remote objects, the default filename will refer to a temporary
  398. | file. Temporary files are removed when the OpenerDirector.close()
  399. | method is called.
  400. |
  401. | For file: URLs, at present the returned filename is None. This may
  402. | change in future.
  403. |
  404. | If the actual number of bytes read is less than indicated by the
  405. | Content-Length header, raises ContentTooShortError (a URLError
  406. | subclass). The exception's .result attribute contains the (filename,
  407. | headers) that would have been returned.
  408. |
  409. | ----------------------------------------------------------------------
  410. | Data and other attributes inherited from mechanize._opener.OpenerDirector:
  411. |
  412. | BLOCK_SIZE = 8192

pyhton mechanize 学习笔记的更多相关文章

  1. OpenCV之Python学习笔记

    OpenCV之Python学习笔记 直都在用Python+OpenCV做一些算法的原型.本来想留下发布一些文章的,可是整理一下就有点无奈了,都是写零散不成系统的小片段.现在看 到一本国外的新书< ...

  2. js学习笔记:webpack基础入门(一)

    之前听说过webpack,今天想正式的接触一下,先跟着webpack的官方用户指南走: 在这里有: 如何安装webpack 如何使用webpack 如何使用loader 如何使用webpack的开发者 ...

  3. PHP-自定义模板-学习笔记

    1.  开始 这几天,看了李炎恢老师的<PHP第二季度视频>中的“章节7:创建TPL自定义模板”,做一个学习笔记,通过绘制架构图.UML类图和思维导图,来对加深理解. 2.  整体架构图 ...

  4. PHP-会员登录与注册例子解析-学习笔记

    1.开始 最近开始学习李炎恢老师的<PHP第二季度视频>中的“章节5:使用OOP注册会员”,做一个学习笔记,通过绘制基本页面流程和UML类图,来对加深理解. 2.基本页面流程 3.通过UM ...

  5. 2014年暑假c#学习笔记目录

    2014年暑假c#学习笔记 一.C#编程基础 1. c#编程基础之枚举 2. c#编程基础之函数可变参数 3. c#编程基础之字符串基础 4. c#编程基础之字符串函数 5.c#编程基础之ref.ou ...

  6. JAVA GUI编程学习笔记目录

    2014年暑假JAVA GUI编程学习笔记目录 1.JAVA之GUI编程概述 2.JAVA之GUI编程布局 3.JAVA之GUI编程Frame窗口 4.JAVA之GUI编程事件监听机制 5.JAVA之 ...

  7. seaJs学习笔记2 – seaJs组建库的使用

    原文地址:seaJs学习笔记2 – seaJs组建库的使用 我觉得学习新东西并不是会使用它就够了的,会使用仅仅代表你看懂了,理解了,二不代表你深入了,彻悟了它的精髓. 所以不断的学习将是源源不断. 最 ...

  8. CSS学习笔记

    CSS学习笔记 2016年12月15日整理 CSS基础 Chapter1 在console输入escape("宋体") ENTER 就会出现unicode编码 显示"%u ...

  9. HTML学习笔记

    HTML学习笔记 2016年12月15日整理 Chapter1 URL(scheme://host.domain:port/path/filename) scheme: 定义因特网服务的类型,常见的为 ...

随机推荐

  1. 面试题中经常遇到的SQL题:删除重复数据,保留其中一条

    如题,解决思路如下: 1.首先我们需要找出拥有重复数据的记录 ---以name字段分组 select Name,COUNT(Name) as [count] from Permission group ...

  2. Xampp+Openfire+Spark的简单使用

    Openfire与Spark的简单实用 1.安装Openfire 百度云 提取码:uu11 2.查找路径 /usr/local/openfire 这时候需要将openfire的文件属性都设置为 可读可 ...

  3. SSH答疑解惑系列(三)——Struts2的异常处理

    Struts2的异常采用声明式异常捕捉,具体通过拦截器来实现. 在项目中,我们可以在Action中直接抛出异常,剩下的就交给Struts2的拦截器来处理了.当然,我们需要进行相关配置. Struts2 ...

  4. 高级C代码的汇编分析

    在windows上,常用的函数调用方式有: Pascal方式,WINAPI(_stdcall)方式 和C方式(_cdecl) _cdecl调用规则: 1,参数从右到左入堆栈 2,在函数返回后,调用者要 ...

  5. Linux设置快捷命令

    vi ~/.bashrc 在.bashrc目录中,添加 alias 设置 例如 cdtools='cd ~/GIT/tools' 对于一条比较长的命令,如显示系统运行时长 cat /proc/upti ...

  6. bzoj3011 可并堆

    我们可以遍历得出每个节点到根节点的距离h,然后用可并堆进行维护.树形dp 我用的是pairing heap #include<cstdio> #include<algorithm&g ...

  7. 【题解】[国家集训队]Crash的数字表格 / JZPTAB

    求解\(\sum_{i = 1}^{n}\sum_{j = 1}^{m}lcm\left ( i,j \right )\). 有\(lcm\left ( i,j \right )=\frac{ij}{ ...

  8. redux的基本概念

    1. State 应用的数据,即状态 2. Action 一个纯对象,携带这个操作的类型和数据信息,主要是用来进行传递信息,如下所示: const action = { type: 'ADD_TODO ...

  9. BZOJ_day5

    32题...今天颓了不想再写了

  10. 从零开始学习MXnet(二)之dataiter

    MXnet的设计结构是C++做后端运算,python.R等做前端来使用,这样既兼顾了效率,又让使用者方便了很多,完整的使用MXnet训练自己的数据集需要了解几个方面.今天我们先谈一谈Data iter ...