http://html5lib.readthedocs.org/en/latest/

By default, the document will be an
xml.etree element instance.Whenever possible, html5lib chooses the accelerated
ElementTreeimplementation (i.e.
xml.etree.cElementTree on Python 2.x).

Overview

html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.

Usage

Simple usage follows this pattern:

import html5lib

with open("mydocument.html", "rb") as f:

    document = html5lib.parse(f)

or:

import html5lib

document = html5lib.parse("<p>Hello World!")

By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation
(i.e.xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:

import html5lib

with open("mydocument.html", "rb") as f:

    lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:

from contextlib import closing

from urllib2 import urlopen

import html5lib

with closing(urlopen("http://example.com/")) as f:

    document = html5lib.parse(f, encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:

from urllib.request import urlopen

import html5lib

with urlopen("http://example.com/") as f:

    document = html5lib.parse(f, encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:

import html5lib

with open("mydocument.html", "rb") as f:

    parser = html5lib.HTMLParser(strict=True)

    document = parser.parse(f)

When you’re instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:

import html5lib

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))

minidom_document = parser.parse("<p>Hello World!")

More documentation is available at http://html5lib.readthedocs.org/.

Installation

html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:

$ pip install html5lib

Optional Dependencies

The following third-party libraries may be used for additionalfunctionality:

datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
genshi has a treewalker (but not builder); and
charade can be used as a fallback when character encoding cannotbe determined;chardet, from which it was forked, can also be usedon Python
2.
ordereddict can be used under Python 2.6(collections.OrderedDict is used instead on later versions) toserialize attributes in alphabetical
order.

Bugs

Please report any bugs on the issue tracker.

Tests

Unit tests require the nose library and can be run using thenosetests command in the root directory;ordereddict
isrequired under Python 2.6. All should pass.

Test data are contained in a separate html5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:

$ git submodule init

$ git submodule update

If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.

Questions?

There’s a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg
onirc.freenode.net.

The moving parts
Change Log
- 0.9999
- 0.999
- 0.99
- 1.0b3
- 1.0b2
- 1.0b1
- 0.95
- 0.90
- 0.11.1
- 0.11
- 0.10
- 0.9
- 0.2
License

Indices and tables

html5lib-python doc的更多相关文章

python doc格式转文本格式
首先python是不能直接读写doc格式的文件的,这是python先天的缺陷.但是可以利用python-docx (0.8.6)库可以读取.docx文件或.txt文件,且一路畅通无阻. 这样的话,可以 ...
python doc os 参考
os --- 操作系统接口模块源代码: Lib/os.py 该模块提供了一些方便使用操作系统相关功能的函数. 如果你是想读写一个文件,请参阅 open(),如果你想操作路径,请参阅 os.path ...
python doc
http://blog.51cto.com/lizhenliang/category16.html
python爬虫 beutifulsoup4_1官网介绍
http://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup Documentation Beautiful Soup is ...
【Python爬虫】BeautifulSoup网页解析库
BeautifulSoup 网页解析库阅读目录初识Beautiful Soup Beautiful Soup库的4种解析器 Beautiful Soup类的基本元素基本使用标签选择器节点操作 ...
【python】BeautifulSoup的应用
from bs4 import BeautifulSoup#下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的的一段内容(以后内容中简称为爱丽丝的文档): html_doc = ...
吴裕雄--天生自然python学习笔记：Beautiful Soup 4.2.0模块
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
python使用uuid库生成唯一id
概述: UUID是128位的全局唯一标识符,通常由32字节的字符串表示. 它可以保证时间和空间的唯一性,也称为GUID,全称为: UUID -- Universally Unique IDentifi ...
【循序渐进学Python】14.数据库的支持
纯文本只能够实现一些简单有限的功能.如果想要实现自动序列化,也可以使用 shelve 模块和 pickle 模块来实现.但是,如果想要自动的实现数据并发访问,以及更标准,更通用的数据库(databas ...
【循序渐进学Python】13.基本的文件I/O
文件I/O是Python中最重要的技术之一,在Python中对文件进行I/O操作是非常简单的. 1. 打开文件使用 open 函数来打开文件,语法如下: open(name[, mode[, buf ...

随机推荐

UITableView的简单总结与回顾
今天突发奇想的想对UItableView做一下汇总,感觉在编程中这个控件可以千变万化也是用的最多的一个了,下面就为大家简单总结下这个控件,也许还有不足,不过还是请各位不吝赐教了哈,那么我就开始了,我会 ...
mongodb数据文件内部结构
有人在Quora上提问:MongoDB数据文件内部的组织结构是什么样的.随后10gen的工程师Jared Rosoff出来做了简短的回答. 每一个数据库都有自己独立的文件.如果你开启了director ...
C#检测串口被拔掉等一些触发事件合集
// //设备异常重载 // protected override void WndProc(ref Message m) { if (m.Msg == 0x0219) {//设备被拔出 if (m. ...
DataTable 导出Excel 下载 (NPOI)
public class ExcelHelper { public void DownLoadExcelNew(System.Data.DataTable data, Hashtable h, ...
[C入门 - 游戏编程系列] 贪吃蛇篇(三) - 蛇定义
蛇是这个游戏的主角,要实现的功能也是最复杂的一个.因为蛇不止有属性,还有行为.它会动,还会吃东西,还会长大!而且还会死!这是很要命的.我一向看不懂复杂的代码,也写不出复杂的代码.所以对于蛇,我很纠结, ...
JavaWeb学习笔记--2.3内置对象
参考资料:http://www.cnblogs.com/qqnnhhbb/archive/2007/10/16/926234.html 目录 1. JSP内置对象分类2. 属性保存范围 2.1 pag ...
Android通过HTTP协议实现上传文件数据
SocketHttpRequester.java package cn.itcast.utils; import java.io.BufferedReader; import java.io.Byte ...
ElasticSearch大批量数据入库
最近着手处理大批量数据的任务. 现状是这样的,一个数据采集程序承载大批量数据的存储和检索.后期可能需要对大批量数据进行统计. 数据分布情况 13个点定时生成采集结果到4个文件(小文件生成周期是5分钟) ...
c++ 12
一.模板与继承 1.从模板类派生模板子类 2.为模板子类提供基类二.容器和迭代器以链表为例. 三.STL概览 1.十大容器 1)向量(vector):连续内存,后端压弹,插删低效 2)列表(lis ...
HTTP请求&&响应
在视频上截的图....俗话说好记性不如烂笔头,所以就保留下来请求: 响应: 状态码: 请求头和响应头的解释:

html5lib-python doc