一、Python lxml的基本应用
The Dormouse's story
<p class="title">
The Dormouse's story
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="" id="link1">
<a class="sister" href="" id="link2">
<a class="sister" href="" id="link3">
; and they lived at the bottom of a well.
<p class="story">
from lxml import etree, cssselect
from cssselect import GenericTranslator, SelectorError parser = etree.HTMLParser(remove_blank_text=True)
document = etree.fromstring(html_doc, parser) # 使用CSS选择器
sel = cssselect.CSSSelector('p a')
results_sel_href = [e.get('href') for e in sel(document)] # 打印a标签的href属性
results_sel_text = [e.text for e in sel(document)] # 打印<a></a>之间的文本
print(results_sel_text) # 使用CSS样式
results_css = [e.get('href') for e in document.cssselect('p a')]
print(results_css) # 使用xpath
expression = GenericTranslator().css_to_xpath('p a')
except SelectorError:
print('Invalid selector.') results_xpath = [e.get('href') for e in document.xpath(expression)] # document.xpath('//a')
print(results_xpath) up html
# cleaning up html
# 1.不使用Cleaner
from lxml.html.clean import Cleaner
html_after_clean = clean_html(html_doc)
# <div>
# The Dormouse's story
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </div> # 2.使用Cleaner
cleaner = Cleaner(style=True, links=True, add_nofollow=True, page_structure=False, safe_attrs_only=False)
html_with_cleaner = cleaner.clean_html(html_doc)
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
二、Python lxml的实际应用
这是网易云音乐华语歌曲的分类链接华语&limit=35&offset=0,打开Chrome F12的Elements查看到页面源码,我们发现每页的歌单都在一个iframe浮窗上面,每首单曲的信息构成一个li标签,包含歌单图片、
<ul class="m-cvrlst f-cb" id="m-pl-container">
<div class="u-cover u-cover-1">
<img class="j-flag" src="" />
<a title="【说唱】留住你一面,画在我心间" href="/playlist?id=832790627" class="msk"></a>
<div class="bottom">
<a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="832790627" data-res-action="play"></a>
<span class="icon-headset"></span>
<span class="nb">1615</span>
</div> <p class="dec"> <a title="【说唱】留住你一面,画在我心间" href="/playlist?id=832790627" class="tit f-thide s-fc0">【说唱】留住你一面,画在我心间</a> </p> <p><span class="s-fc4">by</span> <a title="JediMindTricks" href="/user/home?id=17647877" class="nm nm-icn f-thide s-fc3">JediMindTricks</a> <sup class="u-icn u-icn-84 "></sup> </p> </li>
<div class="u-cover u-cover-1">
<img class="j-flag" src="" />
<a title="鞋子好看|国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="msk"></a>
<div class="bottom">
<a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="721462105" data-res-action="play"></a>
<span class="icon-headset"></span>
<span class="nb">77652</span>
</div> <p class="dec"> <a title="鞋子好看|国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="tit f-thide s-fc0">鞋子好看|国产自赏摇滚噪音流行</a> </p> <p><span class="s-fc4">by</span> <a title="原创君" href="/user/home?id=201586" class="nm nm-icn f-thide s-fc3">原创君</a> <sup class="u-icn u-icn-1 "></sup> </p> </li>
首先实例化一个etree.HTMLParser对象,对html源码简单做下处理,创建cssselect.CSSSelector CSS选择器对象,搜索出无序列表ul下的所有li元素(_Element元素对象),再通过sel(document)遍历所有的_Element对象,使用find方法
find(self, path, namespaces=None) Finds the first matching subelement, by tag name or path. (lxml.ettr/lxml.cssselect 详细API请转义官网
from lxml import etree, cssselect html = '''上面提取的html源码'''
parser = etree.HTMLParser(remove_blank_text=True)
document = etree.fromstring(html_doc, parser) sel = cssselect.CSSSelector('#m-pl-container > li')
for e in sel(document):
img = e.find('.//div/img')
img_url = img.attrib['src']
a_msk = e.find(".//div/a[@class='msk']")
musicList_url = 'http:/%s' % a_msk.attrib['href']
musicList_name = a_msk.attrib['title']
