pyhton mechanize 学习笔记
1:简单的使用
- import mechanize
- # response = mechanize.urlopen("http://www.hao123.com/")
- request = mechanize.Request("http://www.hao123.com/")
- response = mechanize.urlopen(request)
- print response.geturl()
- print response.info()
- # print response.read()
2:mechanize.urlretrieve
- >>> import mechanize
- >>> help(mechanize.urlretrieve)
- Help on function urlretrieve in module mechanize._opener:
- urlretrieve(url, filename=None, reporthook=None, data=None, timeout=<object object>)
- 参数 finename 指定了保存本地路径(如果参数未指定,urllib会生成一个临时文件保存数据。)
- 参数 reporthook 是一个回调函数,当连接上服务器、以及相应的数据块传输完毕时会触发该回调,我们可以利用这个回调函数来显示当前的下载进度。
- 参数 data 指 post 到服务器的数据,该方法返回一个包含两个元素的(filename, headers)元组,filename 表示保存到本地的路径,header 表示服务器的响应头
- 参数 timeout 是设定的超时对象
reporthook(block_read,block_size,total_size)定义回调函数,block_size是每次读取的数据块的大小,block_read是每次读取的数据块个数,taotal_size是一一共读取的数据量,单位是byte。可以使用reporthook函数来显示读取进度。
简单的例子
- def cbk(a, b, c):print a,b,c
- url = 'http://www.hao123.com/'
- local = 'd://hao.html'
- mechanize.urlretrieve(url,local,cbk)
3:form表单登陆
- br = mechanize.Browser()
- br.set_handle_robots(False)
- br.open("http://www.zhaopin.com/")
- br.select_form(nr=0)
- br['loginname'] = '**'自己注册一个账号密码就行了
- br['password'] = '**'
- r = br.submit()
- print os.path.dirname(__file__)+'\login.html'
- h = file(os.path.dirname(__file__)+'\login.html',"w")
- rt = r.read()
- h.write(rt)
- h.close()
4:Browser
看完help的文档基本可以成神了
- Help on class Browser in module mechanize._mechanize:
- class Browser(mechanize._useragent.UserAgentBase)
- | Browser-like class with support for history, forms and links.
- |
- | BrowserStateError is raised whenever the browser is in the wrong state to
- | complete the requested operation - e.g., when .back() is called when the
- | browser history is empty, or when .follow_link() is called when the current
- | response does not contain HTML data.
- |
- | Public attributes:
- |
- | request: current request (mechanize.Request)
- | form: currently selected form (see .select_form())
- |
- | Method resolution order:
- | Browser
- | mechanize._useragent.UserAgentBase
- | mechanize._opener.OpenerDirector
- | mechanize._urllib2_fork.OpenerDirector
- |
- | Methods defined here:
- |
- | __getattr__(self, name)
- |
- | __init__(self, factory=None, history=None, request_class=None)
- | Only named arguments should be passed to this constructor.
- |
- | factory: object implementing the mechanize.Factory interface.
- | history: object implementing the mechanize.History interface. Note
- | this interface is still experimental and may change in future.
- | request_class: Request class to use. Defaults to mechanize.Request
- |
- | The Factory and History objects passed in are 'owned' by the Browser,
- | so they should not be shared across Browsers. In particular,
- | factory.set_response() should not be called except by the owning
- | Browser itself.
- |
- | Note that the supplied factory's request_class is overridden by this
- | constructor, to ensure only one Request class is used.
- |
- | __str__(self)
- |
- | back(self, n=1)
- | Go back n steps in history, and return response object.
- |
- | n: go back this number of steps (default 1 step)
- |
- | clear_history(self)
- |
- | click(self, *args, **kwds)
- | See mechanize.HTMLForm.click for documentation.
- |
- | click_link(self, link=None, **kwds)
- | Find a link and return a Request object for it.
- |
- | Arguments are as for .find_link(), except that a link may be supplied
- | as the first argument.
- |
- | close(self)
- |
- | encoding(self)
- |
- | find_link(self, **kwds)
- | Find a link in current page.
- |
- | Links are returned as mechanize.Link objects.
- |
- | # Return third link that .search()-matches the regexp "python"
- | # (by ".search()-matches", I mean that the regular expression method
- | # .search() is used, rather than .match()).
- | find_link(text_regex=re.compile("python"), nr=2)
- |
- | # Return first http link in the current page that points to somewhere
- | # on python.org whose link text (after tags have been removed) is
- | # exactly "monty python".
- | find_link(text="monty python",
- | url_regex=re.compile("http.*python.org"))
- |
- | # Return first link with exactly three HTML attributes.
- | find_link(predicate=lambda link: len(link.attrs) == 3)
- |
- | Links include anchors (<a>), image maps (<area>), and frames (<frame>,
- | <iframe>).
- |
- | All arguments must be passed by keyword, not position. Zero or more
- | arguments may be supplied. In order to find a link, all arguments
- | supplied must match.
- |
- | If a matching link is not found, mechanize.LinkNotFoundError is raised.
- |
- | text: link text between link tags: e.g. <a href="blah">this bit</a> (as
- | returned by pullparser.get_compressed_text(), ie. without tags but
- | with opening tags "textified" as per the pullparser docs) must compare
- | equal to this argument, if supplied
- | text_regex: link text between tag (as defined above) must match the
- | regular expression object or regular expression string passed as this
- | argument, if supplied
- | name, name_regex: as for text and text_regex, but matched against the
- | name HTML attribute of the link tag
- | url, url_regex: as for text and text_regex, but matched against the
- | URL of the link tag (note this matches against Link.url, which is a
- | relative or absolute URL according to how it was written in the HTML)
- | tag: element name of opening tag, e.g. "a"
- | predicate: a function taking a Link object as its single argument,
- | returning a boolean result, indicating whether the links
- | nr: matches the nth link that matches all other criteria (default 0)
- |
- | follow_link(self, link=None, **kwds)
- | Find a link and .open() it.
- |
- | Arguments are as for .click_link().
- |
- | Return value is same as for Browser.open().
- |
- | forms(self)
- | Return iterable over forms.
- |
- | The returned form objects implement the mechanize.HTMLForm interface.
- |
- | geturl(self)
- | Get URL of current document.
- |
- | global_form(self)
- | Return the global form object, or None if the factory implementation
- | did not supply one.
- |
- | The "global" form object contains all controls that are not descendants
- | of any FORM element.
- |
- | The returned form object implements the mechanize.HTMLForm interface.
- |
- | This is a separate method since the global form is not regarded as part
- | of the sequence of forms in the document -- mostly for
- | backwards-compatibility.
- |
- | links(self, **kwds)
- | Return iterable over links (mechanize.Link objects).
- |
- | open(self, url, data=None, timeout=<object object>)
- |
- | open_local_file(self, filename)
- |
- | open_novisit(self, url, data=None, timeout=<object object>)
- | Open a URL without visiting it.
- |
- | Browser state (including request, response, history, forms and links)
- | is left unchanged by calling this function.
- |
- | The interface is the same as for .open().
- |
- | This is useful for things like fetching images.
- |
- | See also .retrieve().
- |
- | reload(self)
- | Reload current document, and return response object.
- |
- | response(self)
- | Return a copy of the current response.
- |
- | The returned object has the same interface as the object returned by
- | .open() (or mechanize.urlopen()).
- |
- | select_form(self, name=None, predicate=None, nr=None)
- | Select an HTML form for input.
- |
- | This is a bit like giving a form the "input focus" in a browser.
- |
- | If a form is selected, the Browser object supports the HTMLForm
- | interface, so you can call methods like .set_value(), .set(), and
- | .click().
- |
- | Another way to select a form is to assign to the .form attribute. The
- | form assigned should be one of the objects returned by the .forms()
- | method.
- |
- | At least one of the name, predicate and nr arguments must be supplied.
- | If no matching form is found, mechanize.FormNotFoundError is raised.
- |
- | If name is specified, then the form must have the indicated name.
- |
- | If predicate is specified, then the form must match that function. The
- | predicate function is passed the HTMLForm as its single argument, and
- | should return a boolean value indicating whether the form matched.
- |
- | nr, if supplied, is the sequence number of the form (where 0 is the
- | first). Note that control 0 is the first form matching all the other
- | arguments (if supplied); it is not necessarily the first control in the
- | form. The "global form" (consisting of all form controls not contained
- | in any FORM element) is considered not to be part of this sequence and
- | to have no name, so will not be matched unless both name and nr are
- | None.
- |
- | set_cookie(self, cookie_string)
- | Request to set a cookie.
- |
- | Note that it is NOT necessary to call this method under ordinary
- | circumstances: cookie handling is normally entirely automatic. The
- | intended use case is rather to simulate the setting of a cookie by
- | client script in a web page (e.g. JavaScript). In that case, use of
- | this method is necessary because mechanize currently does not support
- | JavaScript, VBScript, etc.
- |
- | The cookie is added in the same way as if it had arrived with the
- | current response, as a result of the current request. This means that,
- | for example, if it is not appropriate to set the cookie based on the
- | current request, no cookie will be set.
- |
- | The cookie will be returned automatically with subsequent responses
- | made by the Browser instance whenever that's appropriate.
- |
- | cookie_string should be a valid value of the Set-Cookie header.
- |
- | For example:
- |
- | browser.set_cookie(
- | "sid=abcdef; expires=Wednesday, 09-Nov-06 23:12:40 GMT")
- |
- | Currently, this method does not allow for adding RFC 2986 cookies.
- | This limitation will be lifted if anybody requests it.
- |
- | set_handle_referer(self, handle)
- | Set whether to add Referer header to each request.
- |
- | set_response(self, response)
- | Replace current response with (a copy of) response.
- |
- | response may be None.
- |
- | This is intended mostly for HTML-preprocessing.
- |
- | submit(self, *args, **kwds)
- | Submit current form.
- |
- | Arguments are as for mechanize.HTMLForm.click().
- |
- | Return value is same as for Browser.open().
- |
- | title(self)
- | Return title, or None if there is no title element in the document.
- |
- | Treatment of any tag children of attempts to follow Firefox and IE
- | (currently, tags are preserved).
- |
- | viewing_html(self)
- | Return whether the current response contains HTML data.
- |
- | visit_response(self, response, request=None)
- | Visit the response, as if it had been .open()ed.
- |
- | Unlike .set_response(), this updates history rather than replacing the
- | current response.
- |
- | ----------------------------------------------------------------------
- | Data and other attributes defined here:
- |
- | default_features = ['_redirect', '_cookies', '_refresh', '_equiv', '_b...
- |
- | handler_classes = {'_basicauth': <class mechanize._urllib2_fork.HTTPBa...
- |
- | ----------------------------------------------------------------------
- | Methods inherited from mechanize._useragent.UserAgentBase:
- |
- | add_client_certificate(self, url, key_file, cert_file)
- | Add an SSL client certificate, for HTTPS client auth.
- |
- | key_file and cert_file must be filenames of the key and certificate
- | files, in PEM format. You can use e.g. OpenSSL to convert a p12 (PKCS
- | 12) file to PEM format:
- |
- | openssl pkcs12 -clcerts -nokeys -in cert.p12 -out cert.pem
- | openssl pkcs12 -nocerts -in cert.p12 -out key.pem
- |
- |
- | Note that client certificate password input is very inflexible ATM. At
- | the moment this seems to be console only, which is presumably the
- | default behaviour of libopenssl. In future mechanize may support
- | third-party libraries that (I assume) allow more options here.
- |
- | add_password(self, url, user, password, realm=None)
- |
- | add_proxy_password(self, user, password, hostport=None, realm=None)
- |
- | set_client_cert_manager(self, cert_manager)
- | Set a mechanize.HTTPClientCertMgr, or None.
- |
- | set_cookiejar(self, cookiejar)
- | Set a mechanize.CookieJar, or None.
- |
- | set_debug_http(self, handle)
- | Print HTTP headers to sys.stdout.
- |
- | set_debug_redirects(self, handle)
- | Log information about HTTP redirects (including refreshes).
- |
- | Logging is performed using module logging. The logger name is
- | "mechanize.http_redirects". To actually print some debug output,
- | eg:
- |
- | import sys, logging
- | logger = logging.getLogger("mechanize.http_redirects")
- | logger.addHandler(logging.StreamHandler(sys.stdout))
- | logger.setLevel(logging.INFO)
- |
- | Other logger names relevant to this module:
- |
- | "mechanize.http_responses"
- | "mechanize.cookies"
- |
- | To turn on everything:
- |
- | import sys, logging
- | logger = logging.getLogger("mechanize")
- | logger.addHandler(logging.StreamHandler(sys.stdout))
- | logger.setLevel(logging.INFO)
- |
- | set_debug_responses(self, handle)
- | Log HTTP response bodies.
- |
- | See docstring for .set_debug_redirects() for details of logging.
- |
- | Response objects may be .seek()able if this is set (currently returned
- | responses are, raised HTTPError exception responses are not).
- |
- | set_handle_equiv(self, handle, head_parser_class=None)
- | Set whether to treat HTML http-equiv headers like HTTP headers.
- |
- | Response objects may be .seek()able if this is set (currently returned
- | responses are, raised HTTPError exception responses are not).
- |
- | set_handle_gzip(self, handle)
- | Handle gzip transfer encoding.
- |
- | set_handle_redirect(self, handle)
- | Set whether to handle HTTP 30x redirections.
- |
- | set_handle_refresh(self, handle, max_time=None, honor_time=True)
- | Set whether to handle HTTP Refresh headers.
- |
- | set_handle_robots(self, handle)
- | Set whether to observe rules from robots.txt.
- |
- | set_handled_schemes(self, schemes)
- | Set sequence of URL scheme (protocol) strings.
- |
- | For example: ua.set_handled_schemes(["http", "ftp"])
- |
- | If this fails (with ValueError) because you've passed an unknown
- | scheme, the set of handled schemes will not be changed.
- |
- | set_password_manager(self, password_manager)
- | Set a mechanize.HTTPPasswordMgrWithDefaultRealm, or None.
- |
- | set_proxies(self, proxies=None, proxy_bypass=None)
- | Configure proxy settings.
- |
- | proxies: dictionary mapping URL scheme to proxy specification. None
- | means use the default system-specific settings.
- | proxy_bypass: function taking hostname, returning whether proxy should
- | be used. None means use the default system-specific settings.
- |
- | The default is to try to obtain proxy settings from the system (see the
- | documentation for urllib.urlopen for information about the
- | system-specific methods used -- note that's urllib, not urllib2).
- |
- | To avoid all use of proxies, pass an empty proxies dict.
- |
- | >>> ua = UserAgentBase()
- | >>> def proxy_bypass(hostname):
- | ... return hostname == "noproxy.com"
- | >>> ua.set_proxies(
- | ... {"http": "joe:password@myproxy.example.com:3128",
- | ... "ftp": "proxy.example.com"},
- | ... proxy_bypass)
- |
- | set_proxy_password_manager(self, password_manager)
- | Set a mechanize.HTTPProxyPasswordMgr, or None.
- |
- | ----------------------------------------------------------------------
- | Data and other attributes inherited from mechanize._useragent.UserAgentBase:
- |
- | default_others = ['_unknown', '_http_error', '_http_default_error']
- |
- | default_schemes = ['http', 'ftp', 'file', 'https']
- |
- | ----------------------------------------------------------------------
- | Methods inherited from mechanize._opener.OpenerDirector:
- |
- | add_handler(self, handler)
- |
- | error(self, proto, *args)
- |
- | retrieve(self, fullurl, filename=None, reporthook=None, data=None, timeout=<object object>, open=<built-in function open>)
- | Returns (filename, headers).
- |
- | For remote objects, the default filename will refer to a temporary
- | file. Temporary files are removed when the OpenerDirector.close()
- | method is called.
- |
- | For file: URLs, at present the returned filename is None. This may
- | change in future.
- |
- | If the actual number of bytes read is less than indicated by the
- | Content-Length header, raises ContentTooShortError (a URLError
- | subclass). The exception's .result attribute contains the (filename,
- | headers) that would have been returned.
- |
- | ----------------------------------------------------------------------
- | Data and other attributes inherited from mechanize._opener.OpenerDirector:
- |
- | BLOCK_SIZE = 8192
pyhton mechanize 学习笔记的更多相关文章
- OpenCV之Python学习笔记
OpenCV之Python学习笔记 直都在用Python+OpenCV做一些算法的原型.本来想留下发布一些文章的,可是整理一下就有点无奈了,都是写零散不成系统的小片段.现在看 到一本国外的新书< ...
- js学习笔记:webpack基础入门(一)
之前听说过webpack,今天想正式的接触一下,先跟着webpack的官方用户指南走: 在这里有: 如何安装webpack 如何使用webpack 如何使用loader 如何使用webpack的开发者 ...
- PHP-自定义模板-学习笔记
1. 开始 这几天,看了李炎恢老师的<PHP第二季度视频>中的“章节7:创建TPL自定义模板”,做一个学习笔记,通过绘制架构图.UML类图和思维导图,来对加深理解. 2. 整体架构图 ...
- PHP-会员登录与注册例子解析-学习笔记
1.开始 最近开始学习李炎恢老师的<PHP第二季度视频>中的“章节5:使用OOP注册会员”,做一个学习笔记,通过绘制基本页面流程和UML类图,来对加深理解. 2.基本页面流程 3.通过UM ...
- 2014年暑假c#学习笔记目录
2014年暑假c#学习笔记 一.C#编程基础 1. c#编程基础之枚举 2. c#编程基础之函数可变参数 3. c#编程基础之字符串基础 4. c#编程基础之字符串函数 5.c#编程基础之ref.ou ...
- JAVA GUI编程学习笔记目录
2014年暑假JAVA GUI编程学习笔记目录 1.JAVA之GUI编程概述 2.JAVA之GUI编程布局 3.JAVA之GUI编程Frame窗口 4.JAVA之GUI编程事件监听机制 5.JAVA之 ...
- seaJs学习笔记2 – seaJs组建库的使用
原文地址:seaJs学习笔记2 – seaJs组建库的使用 我觉得学习新东西并不是会使用它就够了的,会使用仅仅代表你看懂了,理解了,二不代表你深入了,彻悟了它的精髓. 所以不断的学习将是源源不断. 最 ...
- CSS学习笔记
CSS学习笔记 2016年12月15日整理 CSS基础 Chapter1 在console输入escape("宋体") ENTER 就会出现unicode编码 显示"%u ...
- HTML学习笔记
HTML学习笔记 2016年12月15日整理 Chapter1 URL(scheme://host.domain:port/path/filename) scheme: 定义因特网服务的类型,常见的为 ...
随机推荐
- 面试题中经常遇到的SQL题:删除重复数据,保留其中一条
如题,解决思路如下: 1.首先我们需要找出拥有重复数据的记录 ---以name字段分组 select Name,COUNT(Name) as [count] from Permission group ...
- Xampp+Openfire+Spark的简单使用
Openfire与Spark的简单实用 1.安装Openfire 百度云 提取码:uu11 2.查找路径 /usr/local/openfire 这时候需要将openfire的文件属性都设置为 可读可 ...
- SSH答疑解惑系列(三)——Struts2的异常处理
Struts2的异常采用声明式异常捕捉,具体通过拦截器来实现. 在项目中,我们可以在Action中直接抛出异常,剩下的就交给Struts2的拦截器来处理了.当然,我们需要进行相关配置. Struts2 ...
- 高级C代码的汇编分析
在windows上,常用的函数调用方式有: Pascal方式,WINAPI(_stdcall)方式 和C方式(_cdecl) _cdecl调用规则: 1,参数从右到左入堆栈 2,在函数返回后,调用者要 ...
- Linux设置快捷命令
vi ~/.bashrc 在.bashrc目录中,添加 alias 设置 例如 cdtools='cd ~/GIT/tools' 对于一条比较长的命令,如显示系统运行时长 cat /proc/upti ...
- bzoj3011 可并堆
我们可以遍历得出每个节点到根节点的距离h,然后用可并堆进行维护.树形dp 我用的是pairing heap #include<cstdio> #include<algorithm&g ...
- 【题解】[国家集训队]Crash的数字表格 / JZPTAB
求解\(\sum_{i = 1}^{n}\sum_{j = 1}^{m}lcm\left ( i,j \right )\). 有\(lcm\left ( i,j \right )=\frac{ij}{ ...
- redux的基本概念
1. State 应用的数据,即状态 2. Action 一个纯对象,携带这个操作的类型和数据信息,主要是用来进行传递信息,如下所示: const action = { type: 'ADD_TODO ...
- BZOJ_day5
32题...今天颓了不想再写了
- 从零开始学习MXnet(二)之dataiter
MXnet的设计结构是C++做后端运算,python.R等做前端来使用,这样既兼顾了效率,又让使用者方便了很多,完整的使用MXnet训练自己的数据集需要了解几个方面.今天我们先谈一谈Data iter ...