Python爬虫之PyQuery使用(六)
Python爬虫之PyQuery使用
PyQuery简介
pyquery能够通过选择器精确定位 DOM 树中的目标并进行操作。pyquery相当于jQuery的python实现,可以用于解析HTML网页等。它的语法与jQuery几乎完全相同,对于使用过jQuery的人来说很熟悉,也很好上手。
初始化
有 4 种方法可以进行初始化:
可以通过传入 字符串、lxml、文件 或者 url 来使用PyQuery
from pyquery import PyQuery as pq
from lxml import etree d = pq("<html></html>")#传入字符串
d = pq(etree.fromstring("<html></html>"))#传入lxml
d = pq(url='http://baidu.com/') #传入url
d = pq(filename=path_to_html_file) #传入文件
基本CSS选择器
html='''
<html>
<body>
<ul class="mh-col">
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-1"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E6%AD%8C%E6%89%8B%E9%AB%98%E7%A9%BA%E6%8B%8DMV%E5%9D%A0%E4%BA%A1&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
歌手高空拍MV坠亡
</a>
<span class="mh-ico-up">
</span>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-2"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BD%9B%E7%A5%96%E6%9C%B1%E9%BE%99%E5%B9%BF%E9%87%91%E5%A9%9A&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D&fr=hao_360so_history_b" target="_blank">
佛祖朱龙广金婚
</a>
<span class="mh-ico-down">
</span>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-3"}' href="http://www.so.com/link?url=http%3A%2F%2Fbaike.so.com%2Fzt%2Fzufangfangpian.html%3Fsrc%3Dreci&q=%E5%91%A8%E6%9D%B0%E4%BC%A6%E7%9A%84%E6%AD%8C&ts=1540365398&t=c6890eee0e669832ca96d5223582d6e" target="_blank">
常见租房陷阱
</a>
<span class="mh-ico-up">
</span>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-4"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E9%9D%B3%E4%B8%9C%E5%9B%9E%E5%BA%94%E5%8F%91%E9%94%99%E8%AF%97%E8%AF%8D&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
靳东回应发错诗词
</a>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-5"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%85%A4%E8%80%81%E6%9D%BF%E4%BB%AC%E7%9A%84%E5%BD%B1%E8%A7%86%E6%B1%9F%E6%B9%96&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
煤老板们的影视江湖
</a>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-6"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=1024%E7%A8%8B%E5%BA%8F%E5%91%98%E8%8A%82&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
1024程序员节
</a>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-7"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%BE%8E%E7%9A%84%E5%90%88%E5%B9%B6%E5%B0%8F%E5%A4%A9%E9%B9%85&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
美的合并小天鹅
</a>
</li>
<li class="g-ellipsis">
<a class="g-a-noline" data-md='{"b":"list","p":"1-8"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BA%AC%E6%98%86%E9%AB%98%E9%80%9F4%E8%BD%A6%E7%9B%B8%E6%92%9E&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
京昆高速4车相撞
</a>
</li> </ul>
</body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
# 获取所有a标签
print(doc('body .mh-col li a'))
注意:
类名用.
id用#
标签用标签名
另外选择的是具有层级关系,从左到右,不是直接的父子的关系。
运行结果如下:
<a class="g-a-noline" data-md="{"b":"list","p":"1-1"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E6%AD%8C%E6%89%8B%E9%AB%98%E7%A9%BA%E6%8B%8DMV%E5%9D%A0%E4%BA%A1&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
歌手高空拍MV坠亡
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-2"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BD%9B%E7%A5%96%E6%9C%B1%E9%BE%99%E5%B9%BF%E9%87%91%E5%A9%9A&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D&fr=hao_360so_history_b" target="_blank">
佛祖朱龙广金婚
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-3"}" href="http://www.so.com/link?url=http%3A%2F%2Fbaike.so.com%2Fzt%2Fzufangfangpian.html%3Fsrc%3Dreci&q=%E5%91%A8%E6%9D%B0%E4%BC%A6%E7%9A%84%E6%AD%8C&ts=1540365398&t=c6890eee0e669832ca96d5223582d6e" target="_blank">
常见租房陷阱
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-4"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E9%9D%B3%E4%B8%9C%E5%9B%9E%E5%BA%94%E5%8F%91%E9%94%99%E8%AF%97%E8%AF%8D&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
靳东回应发错诗词
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-5"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%85%A4%E8%80%81%E6%9D%BF%E4%BB%AC%E7%9A%84%E5%BD%B1%E8%A7%86%E6%B1%9F%E6%B9%96&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
煤老板们的影视江湖
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-6"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=1024%E7%A8%8B%E5%BA%8F%E5%91%98%E8%8A%82&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
1024程序员节
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-7"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%BE%8E%E7%9A%84%E5%90%88%E5%B9%B6%E5%B0%8F%E5%A4%A9%E9%B9%85&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
美的合并小天鹅
</a>
<a class="g-a-noline" data-md="{"b":"list","p":"1-8"}" href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BA%AC%E6%98%86%E9%AB%98%E9%80%9F4%E8%BD%A6%E7%9B%B8%E6%92%9E&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
京昆高速4车相撞
</a>
操作
html='''
<html>
<body>
<ul class="mh-col">
<li class="g-ellipsis1">
<a class="g-a-noline1" data-md='{"b":"list","p":"1-1"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E6%AD%8C%E6%89%8B%E9%AB%98%E7%A9%BA%E6%8B%8DMV%E5%9D%A0%E4%BA%A1&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
歌手高空拍MV坠亡
</a>
<span class="mh-ico-up">
</span>
</li>
<li class="g-ellipsis2">
<a class="g-a-noline2" data-md='{"b":"list","p":"1-2"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BD%9B%E7%A5%96%E6%9C%B1%E9%BE%99%E5%B9%BF%E9%87%91%E5%A9%9A&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D&fr=hao_360so_history_b" target="_blank">
佛祖朱龙广金婚
</a>
<span class="mh-ico-down">
</span>
</li>
<li class="g-ellipsis3">
<a class="g-a-noline3" data-md='{"b":"list","p":"1-3"}' href="http://www.so.com/link?url=http%3A%2F%2Fbaike.so.com%2Fzt%2Fzufangfangpian.html%3Fsrc%3Dreci&q=%E5%91%A8%E6%9D%B0%E4%BC%A6%E7%9A%84%E6%AD%8C&ts=1540365398&t=c6890eee0e669832ca96d5223582d6e" target="_blank">
常见租房陷阱
</a>
<span class="mh-ico-up">
</span>
</li>
<li class="g-ellipsis4">
<a class="g-a-noline4" data-md='{"b":"list","p":"1-4"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E9%9D%B3%E4%B8%9C%E5%9B%9E%E5%BA%94%E5%8F%91%E9%94%99%E8%AF%97%E8%AF%8D&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
靳东回应发错诗词
</a>
</li>
<li class="g-ellipsis5">
<a class="g-a-noline5" data-md='{"b":"list","p":"1-5"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%85%A4%E8%80%81%E6%9D%BF%E4%BB%AC%E7%9A%84%E5%BD%B1%E8%A7%86%E6%B1%9F%E6%B9%96&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
煤老板们的影视江湖
</a>
</li>
<li class="g-ellipsis6">
<a class="g-a-noline6" data-md='{"b":"list","p":"1-6"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=1024%E7%A8%8B%E5%BA%8F%E5%91%98%E8%8A%82&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
1024程序员节
</a>
</li>
<li class="g-ellipsis7">
<a class="g-a-noline7" data-md='{"b":"list","p":"1-7"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E7%BE%8E%E7%9A%84%E5%90%88%E5%B9%B6%E5%B0%8F%E5%A4%A9%E9%B9%85&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
美的合并小天鹅
</a>
</li>
<li class="g-ellipsis8">
<a class="g-a-noline8" data-md='{"b":"list","p":"1-8"}' href="https://www.so.com/s?ie=utf-8&src=know_side_nlp_sohot&q=%E4%BA%AC%E6%98%86%E9%AB%98%E9%80%9F4%E8%BD%A6%E7%9B%B8%E6%92%9E&ob_ext=%7B%22rsv_cq%22%3A%22%5Cu5468%5Cu6770%5Cu4f26%5Cu7684%5Cu6b4c%22%2C%22rt%22%3A%22%5Cu5b9e%5Cu65f6%5Cu70ed%5Cu641c%22%2C%22rclken%22%3Anull%7D" target="_blank">
<p>新闻</p>
京昆高速4车相撞
</a>
</li> </ul>
</body>
</html>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.mh-col') #.find():查找嵌套元素
alist = items.find('li a')
print(alist)
#查找所有子元素
alist2 = items.children()
print(alist2) #查找指定的子元素
alist3 = items.children('.g-ellipsis1')
print(alist2)
#查找父元素
#注意:一个元素只有一个父元素
body = items.parent()
print(body) #查找祖先元素
content = items.parents()
print(content) #查找兄弟元素
li = doc('.mh-col .g-ellipsis1')
print(li.siblings()) #遍历 单个元素
#遍历所有的a标签
alist =doc('.mh-col li a').items()
for a in alist: print(a)
获取信息
获取属性
a =doc('.mh-col li .g-a-noline8')
print(a.attr['href'])
print(a.attr.href) 获取文本
a =doc('.mh-col li .g-a-noline8')
print(a.text()) 获取HTML
a =doc('.mh-col li .g-a-noline8')
print(a.html())
简单的DOM操作
#addClass、removerClass
#修改类名
a =doc('.mh-col li .g-a-noline8')
print(a)
a.removeClass('g-a-noline8')
print(a)
a.addClass('g-a-noline8')
print(a) #attr、css
#修改属性和样式
a =doc('.mh-col li .g-a-noline8')
print(a)
a.attr('name','link')
print(a)
a.css('font-size','14px')
print(a)
#remove
#删除标签
li = doc('.mh-col .g-ellipsis8')
print(li)
li.find('a').remove()
print(li)
更多的DOM操作:https://pyquery.readthedocs.io/en/latest/api.html
Python爬虫之PyQuery使用(六)的更多相关文章
- [Python爬虫] 之二十六:Selenium +phantomjs 利用 pyquery抓取智能电视网站图片信息
一.介绍 本例子用Selenium +phantomjs爬取智能电视网站(http://www.tvhome.com/news/)的资讯信息,输入给定关键字抓取图片信息. 给定关键字:数字:融合:电视 ...
- python爬虫神器PyQuery的使用方法
你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有了一些 ...
- Python爬虫小白入门(六)爬取披头士乐队历年专辑封面-网易云音乐
一.前言 前文说过我的设计师小伙伴的设计需求,他想做一个披头士乐队历年专辑的瀑布图. 通过搜索,发现网易云音乐上有比较全的历年专辑信息加配图,图片质量还可以,虽然有大有小. 我的例子怎么都是爬取图片? ...
- python爬虫之PyQuery的基本使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- python爬虫之pyquery学习
相关内容: pyquery的介绍 pyquery的使用 安装模块 导入模块 解析对象初始化 css选择器 在选定元素之后的元素再选取 元素的文本.属性等内容的获取 pyquery执行DOM操作.css ...
- 【Python爬虫】PyQuery解析库
PyQuery解析库 阅读目录 初始化 基本CSS选择器 查找元素 遍历 获取信息 DOM操作 伪类选择器 PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎 ...
- 爬取网易云音乐评论!python 爬虫入门实战(六)selenium 入门!
说到爬虫,第一时间可能就会想到网易云音乐的评论.网易云音乐评论里藏了许多宝藏,那么让我们一起学习如何用 python 挖宝藏吧! 既然是宝藏,肯定是用要用钥匙加密的.打开 Chrome 分析 Head ...
- Python爬虫之pyquery库的基本使用
# 字符串初始化 html = ''' <div> <ul> <li class = "item-0">first item</li> ...
- python爬虫知识点总结(六)BeautifulSoup库详解
官方学习文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 一.什么时BeautifulSoup? 答:灵活又方便的网页解析库,处 ...
随机推荐
- D - Maximizing Advertising
题目链接:https://cn.vjudge.net/contest/250168#problem/D 题目大意:给你一些点的坐标,这些点属于两个帮派,让你将这些点分进两个不能重叠的矩形中,问你最多两 ...
- mongodb系列~mongodb数据迁移
一 简介:今天来聊聊mongo的数据迁移二 迁移 1 具体迁移命令 nohup mongodump --port --db dbname --collection tablename --qu ...
- mysql 案例 ~ 分析执行完的大事务
一 简介:今天咱们来聊聊如何定位以及执行完的大事务 二 目的:通过分析binlog脚本来定位执行的大事务 三 分析脚本 mysqlbinlog --base64-output=decode-rows ...
- drozer工具的安装与使用:之二使用篇
如果英文好的同学可以直接查看官方文档 官方文档连接:https://labs.mwrinfosecurity.com/assets/BlogFiles/mwri-drozer-user-guide ...
- 编写灵活、稳定、高质量的 css代码的规范
语法 用两个空格来代替制表符(tab) -- 这是唯一能保证在所有环境下获得一致展现的方法. 为选择器分组时,将单独的选择器单独放在一行. 为了代码的易读性,在每个声明块的左花括号前添加一个空格. 声 ...
- js给<img>的src赋值
用js原生方法:document.getElementById("imageId").src = "xxxx.jpg";用Jquery方法:$("#i ...
- 修改weblogic的端口
两种方法可以修改,第一种方法是后台管理界面修改,第二种是配置文件修改,下面分别介绍: 1.后台修改 (1)进入weblogic登陆界面:(默认端口是7001) (2)登陆之后点击服务器----然后管理 ...
- SpringBoot整合MyBatis(注解版)
详情可以参考Mybatis官方文档 http://www.mybatis.org/spring-boot-starter/mybatis-spring-boot-autoconfigure/ (1). ...
- D3开发中的资料整理
D3开发台阶比较高,需要对html,css,js非常熟练,还要对SVG非常熟悉,SVG不会就不要开发D3了,下面给大家推荐一本资料,为大家未来的开发提供便利. 这个框架产品不支持ie8,是这个产品的特 ...
- gamma校正
1 gamma校正背景 在电视和图形监视器中,显像管发生的电子束及其生成的图像亮度并不是随显像管的输入电压线性变化,电子流与输入电压相比是按照指数曲线变化的,输入电压的指数要大于电子束的指数.这说明暗 ...