常见的爬虫分析库(4)-爬虫之PyQuery
PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同。
官方文档:http://pyquery.readthedocs.io/
安装
1
|
pip install pyquery |
初始化
字符串初始化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print (doc( 'li' )) |
URL初始化
1
2
3
|
from pyquery import PyQuery as pq doc = pq(url = 'http://www.baidu.com' ) print (doc( 'head' )) |
文件初始化
1
2
3
|
from pyquery import PyQuery as pq doc = pq(filename = 'demo.html' ) print (doc( 'li' )) |
基本CSS选择器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) print (doc( '#container .list li' )) |
查找元素
子元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) print ( type (items)) print (items) lis = items.find( 'li' ) print ( type (lis)) print (lis) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) lis = items.children() print ( type (lis)) print (lis) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) lis = items.children( '.active' ) print (lis) |
父元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) container = items.parent() print ( type (container)) print (container) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) items = doc( '.list' ) parents = items.parents() print ( type (parents)) print (parents) |
1
2
|
parent = items.parents( '.wrap' ) print (parent) |
兄弟元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.list .item-0.active' ) print (li.siblings()) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.list .item-0.active' ) print (li.siblings( '.active' )) |
遍历
单个元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) lis = doc( 'li' ).items() print ( type (lis)) for li in lis: print (li) |
获取信息
获取属性
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc( '.item-0.active a' ) print (a) print (a.attr( 'href' )) print (a.attr.href) |
获取文本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) a = doc( '.item-0.active a' ) print (a) print (a.text()) |
获取HTML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) print (li.html()) |
DOM操作
addClass、removeClass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) li.removeClass( 'active' ) print (li) li.addClass( 'active' ) print (li) |
attr、css
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( '.item-0.active' ) print (li) li.attr( 'name' , 'link' ) print (li) li.css( 'font-size' , '14px' ) print (li) |
remove
1
2
3
4
5
6
7
8
9
10
11
12
|
html = ''' <div class="wrap"> Hello, World <p>This is a paragraph.</p> </div> ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc( '.wrap' ) print (wrap.text()) wrap.find( 'p' ).remove() print (wrap.text()) |
其他DOM方法 http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' from pyquery import PyQuery as pq doc = pq(html) li = doc( 'li:first-child' ) print (li) li = doc( 'li:last-child' ) print (li) li = doc( 'li:nth-child(2)' ) print (li) li = doc( 'li:gt(2)' ) print (li) li = doc( 'li:nth-child(2n)' ) print (li) li = doc( 'li:contains(second)' ) print (li) |
常见的爬虫分析库(4)-爬虫之PyQuery的更多相关文章
- 常见的爬虫分析库(3)-Python正则表达式与re模块
在线正则表达式测试 http://tool.oschina.net/regex/ 常见匹配模式 模式 描述 \w 匹配字母数字及下划线 \W 匹配非字母数字下划线 \s 匹配任意空白字符,等价于 [\ ...
- 常见的爬虫分析库(2)-xpath语法
xpath简介 1.xpath使用路径表达式在xml和html中进行导航 2.xpath包含标准函数库 3.xpath是一个w3c的标准 xpath节点关系 1.父节点 2.子节点 3.同胞节点 4. ...
- 常见的爬虫分析库(1)-Python3中Urllib库基本使用
原文来自:https://www.cnblogs.com/0bug/p/8893677.html 什么是Urllib? Python内置的HTTP请求库 urllib.request ...
- 爬虫requests库 之爬虫贴吧
首先要观察爬虫的URL规律,爬取一个贴吧所有页的数据,观察点击下一页时URL是如何变化的. 思路: 定义一个类,初始化方法什么都不用管 定义一个run方法,用来实现主要逻辑 3 class Tieba ...
- 2.爬虫 urlib库讲解 异常处理、URL解析、分析Robots协议
1.异常处理 URLError类来自urllib库的error模块,它继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类来处理. from urllib ...
- Python爬虫Urllib库的基本使用
Python爬虫Urllib库的基本使用 深入理解urllib.urllib2及requests 请访问: http://www.mamicode.com/info-detail-1224080.h ...
- Python爬虫—requests库get和post方法使用
目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...
- [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写 文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
- Python爬虫与数据分析之爬虫技能:urlib库、xpath选择器、正则表达式
专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...
随机推荐
- Windows防火墙配置(允许某个网段和部分IP访问某个端口)
1.win+R 2.gpedit.msc 3.计算机配置+Windows设置+安全设置+IP安全策略,在本地计算机 4.创建IP安全策略 5.配置IP筛选器列表.筛选器操作 6.分配 192.168. ...
- Keepalived详解(二):Keepalived安装与配置【转】
一.Keepalived安装与配置: 1.Keepalived的安装过程: Keepalived的安装非常简单,本实例以源码安装讲解: Keepalived的官方网址:http://www.keepa ...
- Session、LocalStorage、SessionStorage、Cache-Ctrol比较
1.Session Session是什么? 服务器通过 Set-Cookie给用户一个sessionIdsessionId对应 服务器 内的一小块内存每次用户访问服务器的时候,服务器就听过Sessio ...
- hibernate框架学习之Session管理
Session对象的生命周期 lHibernate中数据库连接最终包装成Session对象,使用Session对象可以对数据库进行操作. lSession对象获取方式: •加载所有配置信息得到Conf ...
- 1.ROS启动小乌龟
启动turtlesim 在三个不同的终端中分别执行如下三个指令 roscore rosrun turtlesim turtlesim_node rosrun turtlesim turtle_ ...
- WebService生成XML文档时出错。不应是类型XXXX。使用XmlInclude或SoapInclude属性静态指定非已知的类型。
情况是SingleRoom和DoubleRoom是Room类的子类.在WebService中有一个方法是返回Room类. public Room Get(int roomId) { return Ro ...
- Windows服务没有及时响应启动或控制请求1053
参考链接: 解决“指定的服务已经标记为删除”问题 服务没有及时响应启动或控制请求 1053 关闭服务后,重新启动windows服务报错:"服务没有及时响应启动或控制请求 1053" ...
- 前端 ----jQuery操作表单
05-使用jQuery操作input的value值 表单控件是我们的重中之重,因为一旦牵扯到数据交互,离不开form表单的使用,比如用户的注册登录功能等 那么通过上节知识点我们了解到,我们在使用j ...
- 【原创】数据库基础之Mysql(3)mysql删除历史binlog
mysql开启binlog后会在/var/lib/mysql下创建binlog文件,如果手工删除,则下次mysql启动会报错: mysqld: File './master-bin.000001' n ...
- 06 元祖 字典 集合set
元组 定义: ti=() print(ti,type(ti)) 参数:for可以循环的对象(可迭代对象) t2=tuple(") # ('1', '2', '3') <class 't ...