python模块之HTMLParser(原理很大程度上就是对类构造的熟练运用)

# -*- coding: utf-8 -*-

#python 27

#xiaodeng

#python模块之HTMLParser(原理很大程度上就是对类构造的熟练运用)

import HTMLParser

#tag是的html标签，attrs是 (属性，值)元组(tuple)的列表(list)。

#HTMLParser自动将tag和attrs都转为小写

'''

>>> help(HTMLParser)

Help on module HTMLParser:

CLASSES

    exceptions.Exception(exceptions.BaseException)

        HTMLParseError

    markupbase.ParserBase

        HTMLParser

    class HTMLParser(markupbase.ParserBase)

     |  Find tags and other markup and call handler functions.

     |

     |  Usage:

     |      p = HTMLParser()#初始化

     |      p.feed(data)#feed()方法可以多次调用，也就是不一定一次把整个HTML字符串都塞进去，可以一部分一部分塞进去

                        #提供一些文本给解析器。在由完整元素组成的限度内进行处理，不完整的数据被缓冲直到更多的数据提供或者close()被调用

     |      ...

     |      p.close()

     |

     |  Methods defined here:

     |

     |  __init__(self)

     |      Initialize and reset this instance.

     |

     |  check_for_whole_start_tag(self, i)

     |      # Internal -- check to see if we have a complete starttag; return end

     |      # or -1 if incomplete.

     |

     |  clear_cdata_mode(self)

     |

     |  close(self)

     |      Handle any buffered data.

     |

     |  error(self, message)

     |

     |  feed(self, data)            #向分析器提供数据。

     |      Feed data to the parser.

     |

     |      Call this as often as you want, with as little or as much text

     |      as you want (may include '\n').

     |

     |  get_starttag_text(self)

     |      Return full source of start tag: '<...>'.

     |

     |  goahead(self, end)

     |      # Internal -- handle data as far as reasonable.  May leave state

     |      # and data to be processed by a subsequent call.  If 'end' is

     |      # true, force handling all data as if followed by EOF marker.

     |

     |  handle_charref(self, name)              #处理特殊字符串，就是以&#开头的，一般是内码表示的字符

     |      # Overridable -- handle character reference

     |

     |  handle_comment(self, data)              #处理注释，处理<!--comment-->内的内容

     |      # Overridable -- handle comment

     |

     |  handle_data(self, data)                 #处理数据，就是<xx>data</xx>中间的那些数据

     |      # Overridable -- handle data

     |

     |  handle_decl(self, decl)                 #处理<!开头的，比如<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

     |                                          #文档类型声明，

             # Overridable -- handle declaration

     |

     |  handle_endtag(self, tag)                #处理结束标签，</xx>

     |      # Overridable -- handle end tag

     |

     |  handle_entityref(self, name)            #处理一些特殊字符，以&开头的

     |      # Overridable -- handle entity reference

     |

     |  handle_pi(self, data)                   #处理形如<?instruction>的东西

     |      # Overridable -- handle processing instruction

     |

     |  handle_startendtag(self, tag, attrs)    #处理开始标签和结束标签

     |      # Overridable -- finish processing of start+end tag: <tag.../>

     |

     |  handle_starttag(self, tag, attrs)       # 处理开始标签，比如<xx>

     |      # Overridable -- handle start tag

     |

     |  parse_bogus_comment(self, i, report=1)

     |      # Internal -- parse bogus comment, return length or -1 if not terminated

     |      # see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state

     |

     |  parse_endtag(self, i)

     |      # Internal -- parse endtag, return end or -1 if incomplete

     |

     |  parse_html_declaration(self, i)

     |      # Internal -- parse html declarations, return length or -1 if not terminated

     |      # See w3.org/TR/html5/tokenization.html#markup-declaration-open-state

     |      # See also parse_declaration in _markupbase

     |

     |  parse_pi(self, i)

     |      # Internal -- parse processing instr, return end or -1 if not terminated

     |

     |  parse_starttag(self, i)

     |      # Internal -- handle starttag, return end or -1 if not terminated

     |

     |  reset(self)

     |      Reset this instance.  Loses all unprocessed data.

     |

     |  set_cdata_mode(self, elem)

     |

     |  unescape(self, s)

     |

     |  unknown_decl(self, data)

     |

     |  ----------------------------------------------------------------------

     |  Data and other attributes defined here:

     |

     |  CDATA_CONTENT_ELEMENTS = ('script', 'style')

     |

     |  entitydefs = None

     |

     |  ----------------------------------------------------------------------

     |  Methods inherited from markupbase.ParserBase:

     |

     |  getpos(self)

     |      Return current line number and offset.

     |

     |  parse_comment(self, i, report=1)

     |      # Internal -- parse comment, return length or -1 if not terminated

     |

     |  parse_declaration(self, i)

     |      # Internal -- parse declaration (for use by subclasses).

     |

     |  parse_marked_section(self, i, report=1)

     |      # Internal -- parse a marked section

     |      # Override this to handle MS-word extension syntax <![if word]>content<![endif]>

     |

     |  updatepos(self, i, j)

     |      # Internal -- update line number and offset.  This should be

     |      # called for each piece of data exactly once, in order -- in other

     |      # words the concatenation of all the input strings to this

     |      # function should be exactly the entire input.

>>>

'''

python模块之HTMLParser(原理很大程度上就是对类构造的熟练运用)的更多相关文章

QPointer很大程度上避免了野指针（使用if语句判断即可，类似于dynamic_cast），而且使用非常方便 good
QPointer 如何翻译呢?我不太清楚,保留英文吧. The QPointer class is a template class that provides guarded pointers ...
python模块之HTMLParser之穆雪峰的案例(理解其用法原理)
# -*- coding: utf-8 -*- #python 27 #xiaodeng #python模块之HTMLParser之穆雪峰的案例(理解其用法原理) #http://www.cnblog ...
python模块介绍- HTMLParser 简单的HTML和XHTML解析器
python模块介绍- HTMLParser 简单的HTML和XHTML解析器 2013-09-11 磁针石 #承接软件自动化实施与培训等gtalk:ouyangchongwu#gmail.comqq ...
python模块之HTMLParser抓页面上的所有URL链接
# -*- coding: utf-8 -*- #python 27 #xiaodeng #python模块之HTMLParser抓页面上的所有URL链接 import urllib #MyParse ...
python模块之HTMLParser解析出URL链接
# -*- coding: utf-8 -*- #python 27 #xiaodeng #python模块之HTMLParser解析出URL链接 #http://www.cnblogs.com/mf ...
python模块之HTMLParser
HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...
tensorflow 单机多GPU训练时间比单卡更慢/没有很大时间上提升
使用tensorflow model库里的cifar10 多gpu训练时,最后测试发现时间并没有减少,反而更慢参考以下两个链接 https://github.com/keras-team/keras ...
python模块学习---HTMLParser(解析HTML文档元素)
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...
新学了几个python模块，不是很鸡肋。
先说一个模块分类(基本上所有模块都是小写开头,虽然规范的写法是变量的命名规范,但是,都是这样写的) 1,C编写并镶嵌到python解释器中的内置模块 2,包好的一组模块的包 3.已经被编译好的共享库, ...

随机推荐

SVG.js 颜色渐变使用
一.SVG.Gradient 1.线性渐变.径向渐变,设置渐变的起始点,设置径向渐变的外层半径 var draw = SVG('svg1').size(300, 300); //SVG.Gradien ...
Ubuntu系统重启后/etc/resolv.conf内容丢失的解决方案
通过resolvconf实现配置 resolvconfig应用可以实现DNS信息管理,可以通过下面的应用来安装此组件: sudo apt-get install resolvconf 创建/etc/d ...
django的日志发往http server
配置示例: # https://docs.djangoproject.com/zh-hans/2.1/topics/logging/ LOGGING = { , 'disable_existing_l ...
Android实现在线更新的过程案例
一.更新软件的准备在线更新软件的话需要我们有签名的应用,我们需要把签过名之后的软件放入到服务器中,我的如下: 其中apk是有签名的更新版本! updateinfo.html代码如下: {" ...
客户端连接SQL报"Cannot Generate SSPI Context"错误
这种错误实在是让人头痛, 如果你遇到它还没有头痛的话, 请先看看微软给出的针对这个错误的这篇KB811889. 一般我遇到这种错误都是直接放弃, 重新运行sysprep之后再安装一遍所需要的软件. 然 ...
iOS开发-UIActionSheet简单介绍
UIActionSheet和UIAlertView都是ios系统自带的模态视图,模态视图的一个重要的特性就是在显示模态视图的时候可以阻断其他视图的事件响应.一般情况下我们对UIAlertView使用的 ...
Jquery怎么获取select选中项自定义属性的值
Jquery如何获取select选中项自定义属性的值?HTML code <select id="ddl" onchange="ddl_change(this)& ...
大数据开发实战：Spark Streaming流计算开发
1.背景介绍 Storm以及离线数据平台的MapReduce和Hive构成了Hadoop生态对实时和离线数据处理的一套完整处理解决方案.除了此套解决方案之外,还有一种非常流行的而且完整的离线和实时数 ...
Bootstrap风格button
一直非常喜欢Bootstrap的按钮风格,仿照Bootstrap做了一套按钮.在ie6/7/8/9/10/11.chrome.firefox下能正常使用. ie6/7/8不支持css3的样式.按钮在这 ...
Android -- EventBus使用
EventBus EventBus是一个Android端优化的publish/subscribe消息总线,简化了应用程序内各组件间.组件与后台线程间的通信.比如请求网络,等网络返回时通过Handler ...

python模块之HTMLParser(原理很大程度上就是对类构造的熟练运用)

python模块之HTMLParser(原理很大程度上就是对类构造的熟练运用)的更多相关文章

随机推荐

热门专题