HTMLParser in python

You can know form the name that the HTMLParser is something used to parse HTML files. In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.

htmllib.HTMLParser

This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.

>>> from cStringIO import StringIO

>>> from formatter import DumbWriter, AbstractFormatter

>>> from htmllib import HTMLParser

>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

>>>

It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc.

HTMLParser.HTMLParser

In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO. For more information you can go to this URL :

https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

Here is some briefly introduction for this module.

See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.

>>> from HTMLParser import HTMLParser

>>> class MyHTMLParser(HTMLParser):

...     def handle_starttag(self, tag, attrs):

...             print "Encountered a start tag:", tag

...     def handle_endtag(self, tag):

...             print "Encountered an end tag :", tag

...     def handle_data(self, data):

...              print "Encountered some data  :", data

...

>>> parser = MyHTMLParser()

>>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')

Encountered a start tag: html

Encountered a start tag: head

Encountered a start tag: title

Encountered some data  : Test

Encountered an end tag : title

Encountered an end tag : head

Encountered a start tag: body

Encountered a start tag: h1

Encountered some data  : Parse me!

Encountered an end tag : h1

Encountered an end tag : body

Encountered an end tag : html

Another case here, in the htmllib.HTMLParser, there was two functions as below,

HTMLParser.anchor_bgn(href, name, type)

This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist.

HTMLParser.anchor_end()

This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:

>>> from urlparse import urlparse

>>> from formatter import DumbWriter, AbstractFormatter

>>> from cStringIO import StringIO

>>> from htmllib import HTMLParser

>>>

>>> def parseAndGetLinks():

...     parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

...     parser.feed(open(file).read())

...     parser.close()

...     return parser.anchorlist

...

>>> file='/tmp/a.ttt'

>>> parseAndGetLinks()

['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']

But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.

 >>> from HTMLParser import HTMLParser

 >>> class myHtmlParser(HTMLParser):

 ...     def __init__(self):

 ...             HTMLParser.__init__(self)

 ...             self.anchorlist=[]

 ...     def handle_starttag(self, tag, attrs):

 ...                     if tag=='a' or tag=='A':

 ...                             for t in attrs :

 ...                                     if t[0] == 'href' or t[0]=='HREF':

 ...                                             self.anchorlist.append(t[1])

 ...

 >>> file='/tmp/a.ttt'

 >>> parser=myHtmlParser()

 >>> parser.feed(open(file).read())

 >>> parser.anchorlist

 ['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']

 >>>

We look into the second code.

line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.

line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A', then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist.

Then done.

HTMLParser in python的更多相关文章

python模块学习---HTMLParser(解析HTML文档元素)
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...
python网络爬虫之LXML与HTMLParser
Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表 ...
python之HTMLParser解析HTML文档
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函 ...
Python HTML解析模块HTMLParser(爬虫工具)
简介先简略介绍一下.实际上,HTMLParser是python用来解析HTML的内置模块.它可以分析出HTML里面的标签.数据等等,是一种处理HTML的简便途径.HTMLParser采用的是一种事件 ...
python模块之HTMLParser
HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...
python学习（解析python官网会议安排）
在学习python的过程中,做练习,解析https://www.python.org/events/python-events/ HTML文件,输出Python官网发布的会议时间.名称和地点. 对ht ...
Python学习笔记5
1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == ' ...
python 爬虫部分解释
example:self.file = www.baidu.com存有baidu站的index.html def parseAndGetLinks(self): # parse HTML, save ...
Python之HTML的解析（网页抓取一）
http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的 ...

随机推荐

Boost Bimap示例
#include <string> #include <iostream> #include <boost/bimap.hpp> template< clas ...
Android彻底组件化demo发布
今年6月份开始,我开始负责对"得到app"的android代码进行组件化拆分,在动手之前我查阅了很多组件化或者模块化的文章,虽然有一些收获,但是很少有文章能够给出一个整体且有效的方 ...
JS——stye属性
1.样式少的时候使用 this.parentNode.style.backgroundColor="yellow"; 2.style是对象 console.log(box.styl ...
ddrmenu
<%@ Register TagPrefix="dnn" TagName="MENU" Src="~/DesktopModules/DDRMen ...
mysql命令行导出数据
1. 包含表头 mysql -h${1} -P${2} -u${3} -p${4} -Dpom_${5} --default-character-set=utf8 -B -e > result. ...
12--C++_运算符重载
C++_运算符重载什么是运算符的重载? 运算符与类结合,产生新的含义. 为什么要引入运算符重载? 作用:为了实现类的多态性(多态是指一个函数名有多种含义) 怎么实现运算符的重载? 方式:类的成员函数 ...
windows 设置注册表服务自动启动
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\xxx\Start = ,=自动,,=禁用
Python 之selenium+phantomJS斗鱼抓取案例
from selenium import webdriver from bs4 import BeautifulSoup import time if __name__ == '__main__': ...
http://blog.csdn.net/pizi0475/article/details/48286579 -------------（Collada 快速入门）
http://blog.csdn.net/zhouhangjay/article/details/8469085 说明:Collada的文件格式,中文版的很少,在csdn上看到了一个Sleepy的,感 ...
vue踩坑之路--点击按钮改变div样式
有时候,我们在做项目的时候,想通过某个按钮来改变某个div样式,那么可以通过以下代码实现: <!DOCTYPE html> <html> <head> <me ...

HTMLParser in python

HTMLParser in python的更多相关文章

随机推荐

热门专题