http://html5lib.readthedocs.org/en/latest/

By default, the document will be an
xml.etree element instance.Whenever possible, html5lib chooses the accelerated
ElementTreeimplementation (i.e.
xml.etree.cElementTree on Python 2.x).

Overview

html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.

Usage

Simple usage follows this pattern:

import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)

or:

import html5lib
document = html5lib.parse("<p>Hello World!")

By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation
(i.e.xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:

import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:

from contextlib import closing
from urllib2 import urlopen
import html5lib with closing(urlopen("http://example.com/")) as f:
document = html5lib.parse(f, encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:

from urllib.request import urlopen
import html5lib with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:

import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)

When you’re instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")

More documentation is available at http://html5lib.readthedocs.org/.

Installation

html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:

$ pip install html5lib

Optional Dependencies

The following third-party libraries may be used for additionalfunctionality:

  • datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
  • lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
  • genshi has a treewalker (but not builder); and
  • charade can be used as a fallback when character encoding cannotbe determined;chardet, from which it was forked, can also be usedon Python
    2.
  • ordereddict can be used under Python 2.6(collections.OrderedDict is used instead on later versions) toserialize attributes in alphabetical
    order.

Bugs

Please report any bugs on the issue tracker.

Tests

Unit tests require the nose library and can be run using thenosetests command in the root directory;ordereddict
isrequired under Python 2.6. All should pass.

Test data are contained in a separate html5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:

$ git submodule init
$ git submodule update

If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.

Questions?

There’s a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg
onirc.freenode.net
.

Indices and tables

html5lib-python doc的更多相关文章

  1. python doc格式转文本格式

    首先python是不能直接读写doc格式的文件的,这是python先天的缺陷.但是可以利用python-docx (0.8.6)库可以读取.docx文件或.txt文件,且一路畅通无阻. 这样的话,可以 ...

  2. python doc os 参考

    os --- 操作系统接口模块 源代码: Lib/os.py 该模块提供了一些方便使用操作系统相关功能的函数. 如果你是想读写一个文件,请参阅 open(),如果你想操作路径,请参阅 os.path  ...

  3. python doc

    http://blog.51cto.com/lizhenliang/category16.html

  4. python爬虫 beutifulsoup4_1官网介绍

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup Documentation Beautiful Soup is ...

  5. 【Python爬虫】BeautifulSoup网页解析库

    BeautifulSoup 网页解析库 阅读目录 初识Beautiful Soup Beautiful Soup库的4种解析器 Beautiful Soup类的基本元素 基本使用 标签选择器 节点操作 ...

  6. 【python】BeautifulSoup的应用

    from bs4 import BeautifulSoup#下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档): html_doc = ...

  7. 吴裕雄--天生自然python学习笔记:Beautiful Soup 4.2.0模块

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...

  8. python使用uuid库生成唯一id

    概述: UUID是128位的全局唯一标识符,通常由32字节的字符串表示. 它可以保证时间和空间的唯一性,也称为GUID,全称为: UUID -- Universally Unique IDentifi ...

  9. 【循序渐进学Python】14.数据库的支持

    纯文本只能够实现一些简单有限的功能.如果想要实现自动序列化,也可以使用 shelve 模块和 pickle 模块来实现.但是,如果想要自动的实现数据并发访问,以及更标准,更通用的数据库(databas ...

  10. 【循序渐进学Python】13.基本的文件I/O

    文件I/O是Python中最重要的技术之一,在Python中对文件进行I/O操作是非常简单的. 1. 打开文件 使用 open 函数来打开文件,语法如下: open(name[, mode[, buf ...

随机推荐

  1. linux 终止用户会话

    第一步使用 tty 命令 查看自己会话id:本例中会话id为1[root@localhost ~]# tty/dev/pts/1[root@localhost ~]# 第二步 使用 w 命令 查看当前 ...

  2. 小结: Async & Await

    新项目组用到Async & Await, 关于Await会不会新开不开线程,遇到什么情况会新开线程的问题网上查了很多资料都没看到直观的解释.现简单总结一下. 直接上代码: namespace ...

  3. oracle时间戳转换

    select (to_date('2013-04-09 14:02:15','yyyy-mm-dd hh24:mi:ss') - to_date('1970-01-01','yyyy-mm-dd')) ...

  4. UTF8转GB2312(UTF8解码)

    小弟C++上手没多久,代码不严谨之处敬请见谅.英语也不是很好,有的是直接使用的拼音. string MyUTF_8toGB2312(string str) { ,,str.c_str(),-,NULL ...

  5. 《第一行代码》学习笔记6-活动Activity(4)

    1.SecondActivity不是主活动,故不需要配置标签里的内容. 2.Intent是Android程序中各组件之间进行交互的一种重要方式,一般可被用于 启动活动,启动服务,以及发送广播等.Int ...

  6. iOS8怎么降级到iOS7,苹果iOS8怎么刷回iOS7

    iOS8怎么降级到iOS7,苹果iOS8怎么刷回iOS7 http://jingyan.baidu.com/article/e75aca855c5c19142edac6e9.html 威锋APPLE工 ...

  7. sunday算法实现

    这个算法比其他的kmp  bm 好理解的太多,而且速度还很快. sunday思路是: 1,Sunday算法是Daniel M.Sunday于1990年提出的一种比BM算法搜索速度更快的算法.  2,S ...

  8. unity——使用角色控制器组件+射线移动

    首先要导入unity标准资源包Character Controllers 这个标准资源包,为了方便,还添加了两外一个资源包Scripts,后者包含了一些基本的脚本个摄像机脚本. 没错,这次我们要使用其 ...

  9. PHP MAIL DEMO(程序代码直接发送邮件)

    php代码 <?php // 收件人邮箱地址 $to = 'xxxxxx@qq.com'; // 邮件主题 $title = '测试邮件发送'; // 邮件内容 $msg = '这是一封测试邮件 ...

  10. 拉姆达表达式 追加 条件判断 Expression<Func<T, bool>>

    public static class PredicateBuilder { /// <summary> /// 机关函数应用True时:单个AND有效,多个AND有效:单个OR无效,多个 ...