PYTHON 爬虫笔记六:PyQuery库基础用法
知识点一:PyQuery库详解及其基本使用
初始化
字符串初始化
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) print(doc('li'))#选择器实际上就是CSS选择器,即:选id就加“#”,选class前面加“.”<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>获得的结果
URL初始化
from pyquery import PyQuery as pq
doc1 = pq(url = "http://www.baidu.com") print(doc1("head"))<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head>
获得的结果
文件初始化
from pyquery import PyQuery as pq
doc2 = pq(filename = "demo.html")#自己下载一个HTML文件 print(doc2('li'))
基本CSS选择器
实例
tml = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) print(doc("#container .list li"))#注意空格,空格代表嵌套关系<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>获得的结果
查询元素
子元素
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc(".list")#首先选中url标签 print(type(items))
print(items) lis = items.find('li')#实际上也是一个CSS选择器,将里面所有的li标签都打印出来;只要在它里面的标签都可以找到
print(type(lis))
print(lis) #查找直接子元素
lis2 = items.children()
print(type(lis2))
print(lis2) lis3 = items.children('.active')
print(lis3)<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul> <class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li> <class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>获得的结果
父元素
#父元素
html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) items = doc(".list")#首先选中url标签
#每个标签外面肯定只能套一个父元素
container = items.parent() print(type(container))
print(container)<class 'pyquery.pyquery.PyQuery'>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>获得的结果
另一种方法:
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
''' from pyquery import PyQuery as pq
doc = pq(html)
items = doc(".list")#首先选中url标签
#将所有祖先节点返回
parents = items.parents() print(parents)
print(type(parents))#打印出两个div另一种方法
<html><body><div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
</body></html><body><div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
</body><div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div> <class 'pyquery.pyquery.PyQuery'>--->获得的结果
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
''' from pyquery import PyQuery as pq
doc = pq(html)
items = doc(".list")#首先选中url标签 #在其中进行搜索
parents1 = items.parents(".wrap") print(parents1)#通过筛选,只剩下一个div获取单一内容
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>--->获得的结果
兄弟元素
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')#首先选class=“.list”,空格即使选择list里面的标签,再选class=“item-0”,并列active(实际就是一个整体)
print(li)
print(li.siblings())#获取所有的兄弟元素<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1"><a href="link2.html">second item</a><>/li
</li><li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>获得的结果
另一种方式:
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) li = doc('.list .item-0.active')#首先选class=“.list”,空格即使选择list里面的标签,再选class=“item-0”,并列active(实际就是一个整体)
#在向其中筛选
print(li.siblings('.active'))另一种方式
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
--->获得的结果
遍历
单个元素
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) li = doc(".item-0.active")
print(li)<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
获得的方法
另一种方式
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) lis = doc('li').items()#多个元素,进行遍历,生成一个产生器 print(type(lis))
for li in lis:
print(li)另一种方式
<class 'generator'>
<li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>--->获得的结果
获取信息
获取属性
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
''' from pyquery import PyQuery as pq
doc = pq(html)
a = doc(".item-0.active a")#选择class同时为item-0和active,在选择class里面的啊标签,中间注意空格
print(a)
print(a.attr("href"))
print(a.attr.href)#结果同上<a href="link3.html"><span class="boid">third item</span></a>
link3.html
link3.html获得的结果
获取文本
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc(".item-0.active a") print(a)
print(a.text())#将上面的选中的class中包围的文字<a href="link3.html"><span class="boid">third item</span></a>
third item获得的结果
获取HTML
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
''' from pyquery import PyQuery as pq
doc = pq(html)
a = doc(".item-0.active") print(a)
print(a.html())<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <a href="link3.html"><span class="boid">third item</span></a>
获得的结果
DOM操作
address,removeClass
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) li = doc(".item-0.active")
print(li) li.removeClass("active")#移除active
print(li) li.addClass("active")#增加active
print(li)<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-0"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
获得的结果
attr,css
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) li = doc(".item-0.active")
print(li) li.attr("name","link")#若存在,就会覆盖
print(li) li.css("font-size","14px")#增加style属性
print(li)<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-0 active" name="link"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="boid">third item</span></a></li>
获得的结果
remove
html1 = '''
<div class="wrap">
Hello,World
<p>This is a paragraph.</p>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html1) wrap = doc(".wrap")
print(wrap.text()) wrap.find('p').remove() print(wrap.text())Hello,World
This is a paragraph.
Hello,World获得的结果
其他DOM操作
- 其他DOM方法: http://pythonhosted.org/pyquery/
伪类选择器
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html) li = doc("li:first-child")#第一个
print(li) li1 = doc('li:last-child')#最后一个
print(li1) li2 = doc('li:nth-child(2)')#指定缩写顺序,第二个
print(li2) li3 = doc("li:gt(2)")#大于2的(从0开始)
print(li3) li4 = doc("li:nth-child(2n)")#偶数
print(li4) li5 = doc("li:contains(second)")#内容包含second
print(li5)<li class="item-0">first item</li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-1"><a href="link2.html">second item</a></li>获得的结果
更多CSS选择器可以查看:http://www.w3school.com.cn/css/index.asp
官方文档
PYTHON 爬虫笔记六:PyQuery库基础用法的更多相关文章
- PYTHON 爬虫笔记七:Selenium库基础用法
知识点一:Selenium库详解及其基本使用 什么是Selenium selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium ...
- PYTHON 爬虫笔记五:BeautifulSoup库基础用法
知识点一:BeautifulSoup库详解及其基本使用方法 什么是BeautifulSoup 灵活又方便的网页解析库,处理高效,支持多种解析器.利用它不用编写正则表达式即可方便实现网页信息的提取库. ...
- PYTHON 爬虫笔记三:Requests库的基本使用
知识点一:Requests的详解及其基本使用方法 什么是requests库 Requests库是用Python编写的,基于urllib,采用Apache2 Licensed开源协议的HTTP库,相比u ...
- 芝麻HTTP: Python爬虫利器之Requests库的用法
前言 之前我们用了 urllib 库,这个作为入门的工具还是不错的,对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来 ...
- Python爬虫进阶六之多进程的用法
前言 在上一节中介绍了thread多线程库.python中的多线程其实并不是真正的多线程,并不能做到充分利用多核CPU资源. 如果想要充分利用,在python中大部分情况需要使用多进程,那么这个包就叫 ...
- python爬虫笔记----4.Selenium库(自动化库)
4.Selenium库 (自动化测试工具,支持多种浏览器,爬虫主要解决js渲染的问题) pip install selenium 基本使用 from selenium import webdriver ...
- PYTHON 爬虫笔记二:Urllib库基本使用
知识点一:urllib的详解及基本使用方法 一.基本介绍 urllib是python的一个获取url(Uniform Resource Locators,统一资源定址器)了,我们可以利用它来抓取远程的 ...
- Python爬虫利器六之PyQuery的用法
前言 你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有 ...
- Python 爬虫十六式 - 第六式:JQuery的假兄弟-pyquery
PyQuery:一个类似jquery的python库 学习一时爽,一直学习一直爽 Hello,大家好,我是 Connor,一个从无到有的技术小白.上一次我们说到了 BeautifulSoup 美味 ...
随机推荐
- 怎样通过Html网页调用本地安卓app
怎样使用html网页和本地app进行传递数据呢?经过研究.发现还是有方法的,总结了一下,大致有一下几种方式 一.通过html页面打开Android本地的app 1.首先在编写一个简单的html页面 & ...
- vue2.X computed 计算属性
需求:数据msg值为12345,我们现在需要反向显示成54321. 1.在模板中绑定表达式是非常便利的,但是它们实际上只用于简单的操作.在模板中放入太多的逻辑会让模板过重且难以维护.例如: <! ...
- 简单记录一次ORA-00600: internal error code, arguments: [2662]
接上一个,REDO报错搞定后OPEN数据库时又报错ORA-00600: internal error code, arguments: [2662]. 原因是_ALLOW_RESETLOGS_CORR ...
- CentOS7.1 KVM虚拟化之经常使用管理虚拟机命令(3)
一.查看虚拟机列表及状态 [root@kvm01 ~]# virsh list --all Id Name State ---------------------------------------- ...
- 字符串(string)操作的相关方法
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
- 36:字符串排序SortString
题目描述:编写一个程序,将输入字符串中的字符按如下规则排序. 规则1:英文字母从A到Z排列,不区分大小写. 如,输入:Type 输出:epTy 规则2:同一个英文字母的大小写同时存在时,按照输入顺序排 ...
- mBot试用体验
[Arduino话题] [mBot试用体验]1.mBot开箱体验(部分资料合集)http://bbs.elecfans.com/forum.php?mod=viewthread&tid=532 ...
- 如何创建RESTFul Web服务
想写这篇文章很久了,这是个大话题,不是一时半会就能说清楚的. 所以准备花个一星期整理资料,把思路理清楚,然后再在Team里做个sharing:) 其实RESTFul是架构风格,并不是实现规范,也不一定 ...
- HTML元素定位
一切皆为框 div.h1 或 p 元素常常被称为块级元素(block element).这意味着这些元素显示为一块内容,即"块框".与之相反,span 和 strong 等元素称为 ...
- VIM中保存编辑的只读文件
如何在VIM中保存编辑的只读文件 你是否会和我一样经常碰到这样的情景:在VIM中编辑了一个系统配置文件,当需要保存时才发现当前的用户对该文件没有写入的权限.如果已 经做了很多修改,放弃保存的确很懊恼, ...