官方文档:https://pyquery.readthedocs.io/en/latest/

PyQuery是一个强大又灵活的网页解析库。如果你觉得正则写起来太麻烦、BeautifulSoup语法太难记,而你熟悉jQury的语法,那么PyQuery就是你的绝佳选择。

一、开始

字符串初始化:

from pyquery import PyQuery as pq
d = pq("<html>哈哈哈</html>") # 现在d就相当于jQuery的$
print(d("html"))

URL初始化:

from pyquery import PyQuery as pq
d = pq(url="https://www.baidu.com")
print(d("head"))

文件初始化:

from pyquery import PyQuery as pq
d = pq(filename='demo.html') # filename指定文件路径
print(d("head"))

二、基本CSS选择器

html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
print(d("#container .list li"))

三、查找元素

子元素

d("css选择器").find("li")
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
print(type(items)) # <class 'pyquery.pyquery.PyQuery'>
li = items.find("li")
print(type(li)) # <class 'pyquery.pyquery.PyQuery'>
print(li)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""

父元素

d("css选择器").parent(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents()
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""

d(".list").parents()

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
items = d(".list")
parents = items.parents(".wrap")
print(parents)
"""
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""

d(".list").parents(".wrap")

兄弟元素

d("css选择器").siblings(<css选择器(可无)>)
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings())
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".list .item-0.active")
print(li.siblings(".active"))
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""

四、遍历

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li").items()
print(type(li)) # <class 'generator'>
for i in li:
print(i)
"""
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""

五、获取信息

获取属性

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.attr("href"))
print(a.attr.href)

获取文本

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
a = d(".item-0.active a")
print(a.text())
"""
third item
"""

获取html

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
print(li.html())
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<a href="link3.html"><span class="bold">third item</span></a>
"""

六、DOM操作

addClass()、removeClass()

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.removeClass("active")
print(li)
li.addClass("active")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
"""

attr()、css()

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d(".item-0.active")
print(li)
li.attr("name", "link")
print(li)
li.css("font-size", "14px")
print(li)
"""
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
"""

remove()

html = """
<div class="wrap">
Hello, World.
<p>This is a paragraph.</p>
</div>
""" from pyquery import PyQuery as pq
d = pq(html)
wrap = d(".wrap")
print(wrap.text())
"""
Hello, World.
This is a paragraph.
"""
wrap.find("p").remove()
print(wrap.text()) # Hello, World.

其他DOM方法

https://pyquery.readthedocs.io/en/latest/api.html

七、伪类选择器

html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
from pyquery import PyQuery as pq
d = pq(html)
li = d("li:first-child")
print(li) # <li class="item-0">first item</li>
li = d("li:last-child")
print(li) # <li class="item-0"><a href="link5.html">fifth item</a></li>
li = d("li:nth-child(2)")
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
li = d("li:gt(2)") # 从0开始计数,索引大于2
print(li)
"""
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
"""
li = d("li:nth-child(2n)") # 获取偶数顺序的元素(从1开始)
print(li)
"""
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
"""
li = d("li:contains(second)") # 根据文本匹配,匹配文本包含second的标签
print(li) # <li class="item-1"><a href="link2.html">second item</a></li>

更多选择器:http://www.w3school.com.cn/cssref/css_selectors.asp

爬虫之pyquery库的更多相关文章

  1. Python爬虫之pyquery库的基本使用

    # 字符串初始化 html = ''' <div> <ul> <li class = "item-0">first item</li> ...

  2. python爬虫从入门到放弃(七)之 PyQuery库的使用

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  3. 爬虫常用库之pyquery 库

    pyquery库是jQuery的Python实现,可以用于解析HTML网页内容,我个人写过的一些抓取网页数据的脚本就是用它来解析html获取数据的.他的官方文档地址是:http://packages. ...

  4. Python爬虫-- PyQuery库

    PyQuery库 PyQuery库也是一个非常强大又灵活的网页解析库,PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪 ...

  5. PYTHON 爬虫笔记六:PyQuery库基础用法

    知识点一:PyQuery库详解及其基本使用 初始化 字符串初始化 html = ''' <div> <ul> <li class="item-0"&g ...

  6. 第四节:Web爬虫之pyquery解析库

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  7. python之爬虫(九)PyQuery库的使用

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  8. Python3 网络爬虫(请求库的安装)

    Python3 网络爬虫(请求库的安装) 爬虫可以简单分为几步:抓取页面,分析页面和存储数据 在页面爬取的过程中我们需要模拟浏览器向服务器发送请求,所以需要用到一些python库来实现HTTP的请求操 ...

  9. 爬虫之PyQuery的base了解

    爬虫之PyQuery的base了解 pyquery库是jQuery的Python实现,能够以jQuery的语法来操作解析 HTML 文档,易用性和解析速度都很好,和它差不多的还有BeautifulSo ...

随机推荐

  1. POJ 3650:The Seven Percent Solution

    The Seven Percent Solution Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 7684   Accep ...

  2. exynos 4412 时钟配置

    /** ****************************************************************************** * @author    Maox ...

  3. 两道NOIP里的DP题目~

    拦截导弹    来源:NOIP1999(提高组) 第一题 某国为了防御敌国的导弹袭击,发展出一种导弹拦截系统.但是这种导弹拦截系统有一个缺陷:虽然它的第一发炮弹能够到达任意的高度,但是以后每一发炮弹都 ...

  4. Mac os x下配置 Android ndk 开发环境

    1.阅读下面之前,请确保你android sdk的开发环境已经搭建好,ADT也最好是目前最新的. 2.到http://developer.android.com/tools/sdk/ndk/index ...

  5. 51Nod 1450 闯关游戏 —— 期望DP

    题目:http://www.51nod.com/onlineJudge/questionCode.html#!problemId=1450 期望DP: INF 表示这种情况不行,转移时把不行的概率也转 ...

  6. bzoj3196 二逼平衡树——线段树套平衡树

    题目:https://www.lydsy.com/JudgeOnline/problem.php?id=3196 人生中第一棵树套树! 写了一个晚上,成功卡时 9000ms+ 过了! 很要注意数组的大 ...

  7. ubuntu/linuxmint更换163源及常用软件PPA源记录

    一.163源使用说明 (具体见:http://mirrors.163.com/,不推荐linuxmint更换源!!!) 以Wily(15.10)为例, 编辑/etc/apt/sources.list文 ...

  8. 17年day3

    /* 嗯,又一天. 时日无多了,还能蹦哒几天? 上午依旧考试,日常挂T1,读错题.还是好困. 兔子说明天晚上要请我们吃水饺~~~~去年就没这待遇. 下午打开邮箱一看,咦?嗯. 昨晚做噩梦NOIP考了状 ...

  9. Pycharm初始创建项目和环境搭建(解决aconda库文件引入不全等问题)

    1.新建工程 1.选择新建一个Pure Python项目,新建项目路径可以在Location处选择. 2.Project Interpreter部分是选择新建项目所依赖的python库,第一个选项会在 ...

  10. CSS3 核心知识面试题

    一种常见利用伪类清除浮动的代码 .clearfix:after { content:"."; //这里利用到了content属性 display:block; height:; v ...