爬虫之pyquery库
官方文档:https://pyquery.readthedocs.io/en/latest/
PyQuery是一个强大又灵活的网页解析库。如果你觉得正则写起来太麻烦、BeautifulSoup语法太难记,而你熟悉jQury的语法,那么PyQuery就是你的绝佳选择。
一、开始
字符串初始化:
from pyquery import PyQuery as pq
d = pq("<html>哈哈哈</html>") # 现在d就相当于jQuery的$
print(d("html"))
URL初始化:
from pyquery import PyQuery as pq
d = pq(url="https://www.baidu.com")
print(d("head"))
文件初始化:
from pyquery import PyQuery as pq
d = pq(filename='demo.html') # filename指定文件路径
print(d("head"))
二、基本CSS选择器
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
print(d("#container .list li"))
三、查找元素
子元素
d("css选择器").find("li")
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
print(type(items)) # <class 'pyquery.pyquery.PyQuery'>
li = items.find("li")
print(type(li)) # <class 'pyquery.pyquery.PyQuery'>
print(li)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
父元素
d("css选择器").parent(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents()
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
d(".list").parents()
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents(".wrap")
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
d(".list").parents(".wrap")
兄弟元素
d("css选择器").siblings(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings())
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings(".active"))
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""
四、遍历
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li").items()
print(type(li)) # <class 'generator'>
for i in li:
print(i)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
五、获取信息
获取属性
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.attr("href"))
print(a.attr.href)
获取文本
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.text())
"""
third item
"""
获取html
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
print(li.html())
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a>
"""
六、DOM操作
addClass()、removeClass()
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.removeClass("active")
print(li)
li.addClass("active")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
"""
attr()、css()
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.attr("name", "link")
print(li)
li.css("font-size", "14px")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
"""
remove()
html = """
<div class="wrap">
Hello, World.
<p>This is a paragraph.</p>
</div>
""" from pyquery import PyQuery as pq
d = pq(html)
wrap = d(".wrap")
print(wrap.text())
"""
Hello, World.
This is a paragraph.
"""
wrap.find("p").remove()
print(wrap.text()) # Hello, World.
其他DOM方法
https://pyquery.readthedocs.io/en/latest/api.html
七、伪类选择器
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li:first-child")
print(li) # <li class="item-0">first item</li>
li = d("li:last-child")
print(li) # <li class="item-0"><a href="link5.html">fifth item</a></li>
li = d("li:nth-child(2)")
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
li = d("li:gt(2)") # 从0开始计数,索引大于2
print(li)
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
li = d("li:nth-child(2n)") # 获取偶数顺序的元素(从1开始)
print(li)
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""
li = d("li:contains(second)") # 根据文本匹配,匹配文本包含second的标签
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
更多选择器:http://www.w3school.com.cn/cssref/css_selectors.asp
爬虫之pyquery库的更多相关文章
- Python爬虫之pyquery库的基本使用
# 字符串初始化 html = ''' <div> <ul> <li class = "item-0">first item</li> ...
- python爬虫从入门到放弃(七)之 PyQuery库的使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- 爬虫常用库之pyquery 库
pyquery库是jQuery的Python实现,可以用于解析HTML网页内容,我个人写过的一些抓取网页数据的脚本就是用它来解析html获取数据的.他的官方文档地址是:http://packages. ...
- Python爬虫-- PyQuery库
PyQuery库 PyQuery库也是一个非常强大又灵活的网页解析库,PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪 ...
- PYTHON 爬虫笔记六:PyQuery库基础用法
知识点一:PyQuery库详解及其基本使用 初始化 字符串初始化 html = ''' <div> <ul> <li class="item-0"&g ...
- 第四节:Web爬虫之pyquery解析库
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- python之爬虫(九)PyQuery库的使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- Python3 网络爬虫(请求库的安装)
Python3 网络爬虫(请求库的安装) 爬虫可以简单分为几步:抓取页面,分析页面和存储数据 在页面爬取的过程中我们需要模拟浏览器向服务器发送请求,所以需要用到一些python库来实现HTTP的请求操 ...
- 爬虫之PyQuery的base了解
爬虫之PyQuery的base了解 pyquery库是jQuery的Python实现,能够以jQuery的语法来操作解析 HTML 文档,易用性和解析速度都很好,和它差不多的还有BeautifulSo ...
随机推荐
- POJ 3650:The Seven Percent Solution
The Seven Percent Solution Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 7684 Accep ...
- exynos 4412 时钟配置
/** ****************************************************************************** * @author Maox ...
- 两道NOIP里的DP题目~
拦截导弹 来源:NOIP1999(提高组) 第一题 某国为了防御敌国的导弹袭击,发展出一种导弹拦截系统.但是这种导弹拦截系统有一个缺陷:虽然它的第一发炮弹能够到达任意的高度,但是以后每一发炮弹都 ...
- Mac os x下配置 Android ndk 开发环境
1.阅读下面之前,请确保你android sdk的开发环境已经搭建好,ADT也最好是目前最新的. 2.到http://developer.android.com/tools/sdk/ndk/index ...
- 51Nod 1450 闯关游戏 —— 期望DP
题目:http://www.51nod.com/onlineJudge/questionCode.html#!problemId=1450 期望DP: INF 表示这种情况不行,转移时把不行的概率也转 ...
- bzoj3196 二逼平衡树——线段树套平衡树
题目:https://www.lydsy.com/JudgeOnline/problem.php?id=3196 人生中第一棵树套树! 写了一个晚上,成功卡时 9000ms+ 过了! 很要注意数组的大 ...
- ubuntu/linuxmint更换163源及常用软件PPA源记录
一.163源使用说明 (具体见:http://mirrors.163.com/,不推荐linuxmint更换源!!!) 以Wily(15.10)为例, 编辑/etc/apt/sources.list文 ...
- 17年day3
/* 嗯,又一天. 时日无多了,还能蹦哒几天? 上午依旧考试,日常挂T1,读错题.还是好困. 兔子说明天晚上要请我们吃水饺~~~~去年就没这待遇. 下午打开邮箱一看,咦?嗯. 昨晚做噩梦NOIP考了状 ...
- Pycharm初始创建项目和环境搭建(解决aconda库文件引入不全等问题)
1.新建工程 1.选择新建一个Pure Python项目,新建项目路径可以在Location处选择. 2.Project Interpreter部分是选择新建项目所依赖的python库,第一个选项会在 ...
- CSS3 核心知识面试题
一种常见利用伪类清除浮动的代码 .clearfix:after { content:"."; //这里利用到了content属性 display:block; height:; v ...