Python爬虫之pyquery库的基本使用
- # 字符串初始化
- html = '''
- <div>
- <ul>
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- print(doc('li'))
- # url初始化
- from pyquery import PyQuery as pq
- doc = pq(url = "http://www.baidu.com")
- print(doc("head"))
- # 文件初始化
- from pyquery import PyQuery as pq
- doc = pq(filename = "demo.html")
- print(doc('li'))
- # 基本CSS选择器
- html = '''
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- # 注意下面id 前面需要加上#,class 前面需要加上.
- print(doc('#container .list li'))
- # 查找元素
- # 子元素
- html = '''
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- items = doc('.list')
- print(type(items))
- print(items)
- lis = items.find('li')
- print(type(lis))
- print(lis)
- lis = items.children()
- print(type(lis))
- print(lis)
- lis = items.children('.active')
- print(lis)
- # 父元素
- html = '''
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- items = doc('.list')
- container = items.parent()
- print(type(container))
- print(container)
- html = '''
- <div class = "wrap">
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- items = doc('.list')
- parents = items.parents()
- print(type(parents))
- print(parents)
- parents = items.parents('.wrap')
- print(parents)
- # 兄弟元素
- html = '''
- <div class = "wrap">
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- # 注意下面item-0后面直接是. 没有空格
- li = doc('.list .item-0.active')
- print(li.siblings())
- print(li.siblings('.active'))
- # 遍历
- # 单个元素
- html = '''
- <div class = "wrap">
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- li = doc('.item-0.active')
- print(li)
- html = '''
- <div class = "wrap">
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- lis = doc('li').items()
- print(type(lis))
- for li in lis:
- print(li)
- # 获取信息
- # 获取属性
- html = '''
- <div class = "wrap">
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- a = doc('.item-0.active a')
- print(a)
- # 获取属性的两种方法
- print(a.attr('href'))
- print(a.attr.href)
- # 获取文本
- print(a.text())
- # 获取html
- from pyquery import PyQuery as pq
- doc = pq(html)
- li = doc('.item-0.active')
- print(li)
- # 得到<li>标签里面的代码
- print(li.html())
- # DOM操作
- # addClass、removeClass
- from pyquery import PyQuery as pq
- doc = pq(html)
- li = doc('.item-0.active')
- print(li)
- li.remove_class('active')
- print(li)
- li.add_class('active')
- print(li)
- # attr CSS
- li.attr('name', 'link')
- print(li)
- li.css('font-size', '14px')
- print(li)
- # remove
- html = '''
- <div class = "wrap">
- Hello,World
- <p>This is a paragraph</p>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- wrap = doc('.wrap')
- print(wrap.text())
- wrap.find('p').remove()
- print(wrap.text())
- # 伪类选择器
- html = '''
- <div class = "wrap">
- <div id = "container">
- <ul class = "list">
- <li class = "item-0">first item</li>
- <li class = "item-1"><a href = "link2.html">second item</a></li>
- <li class = "item-0 active"><a href = "link3.html"><span class = "bold">third item</span></a></li>
- <li class = "item-1 active"><a href = "link4.html">fourth item</a></li>
- <li class = "item-0"><a href = "link5.html">fifthth item</a></li>
- </ul>
- </div>
- </div>
- '''
- from pyquery import PyQuery as pq
- doc = pq(html)
- # 获取第一个元素
- li = doc('li:first-child')
- print(li)
- # 获取最后一个元素
- li = doc('li:last-child')
- print(li)
- # 获取第二个元素
- li = doc('li:nth-child(2)')
- print(li)
- # 获取下标为2的元素后面的所有元素(下标从0开始)
- li = doc('li:gt(2)')
- print(li)
- # 获取下标为偶数的元素
- li = doc('li:nth-child(2n)')
- print(li)
- # 获取内容包含second 的元素
- li = doc('li:contains(second)')
- print(li)
Python爬虫之pyquery库的基本使用的更多相关文章
- Python爬虫之PyQuery使用(六)
Python爬虫之PyQuery使用 PyQuery简介 pyquery能够通过选择器精确定位 DOM 树中的目标并进行操作.pyquery相当于jQuery的python实现,可以用于解析HTML网 ...
- python爬虫之urllib库(三)
python爬虫之urllib库(三) urllib库 访问网页都是通过HTTP协议进行的,而HTTP协议是一种无状态的协议,即记不住来者何人.举个栗子,天猫上买东西,需要先登录天猫账号进入主页,再去 ...
- python爬虫之urllib库(二)
python爬虫之urllib库(二) urllib库 超时设置 网页长时间无法响应的,系统会判断网页超时,无法打开网页.对于爬虫而言,我们作为网页的访问者,不能一直等着服务器给我们返回错误信息,耗费 ...
- python爬虫之urllib库(一)
python爬虫之urllib库(一) urllib库 urllib库是python提供的一种用于操作URL的模块,python2中是urllib和urllib2两个库文件,python3中整合在了u ...
- Python爬虫之selenium库使用详解
Python爬虫之selenium库使用详解 本章内容如下: 什么是Selenium selenium基本使用 声明浏览器对象 访问页面 查找元素 多个元素查找 元素交互操作 交互动作 执行JavaS ...
- Mac os 下 python爬虫相关的库和软件的安装
由于最近正在放暑假,所以就自己开始学习python中有关爬虫的技术,因为发现其中需要安装许多库与软件所以就在这里记录一下以避免大家在安装时遇到一些不必要的坑. 一. 相关软件的安装: 1. h ...
- python爬虫(四)_urllib2库的基本使用
本篇我们将开始学习如何进行网页抓取,更多内容请参考:python学习指南 urllib2库的基本使用 所谓网页抓取,就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地.在Python中有很 ...
- python爬虫之PyQuery的基本使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
- python爬虫之requests库
在python爬虫中,要想获取url的原网页,就要用到众所周知的强大好用的requests库,在2018年python文档年度总结中,requests库使用率排行第一,接下来就开始简单的使用reque ...
随机推荐
- 你还在 Select * 吗?
应用程序慢如牛,原因多多,可能是网络的原因.可能是系统架构的原因,还有可能是数据库的原因. 那么如何提高数据库SQL语句执行速度呢?有人会说性能调优是数据库管理员(DBA)的事,然而性能调优跟程序员们 ...
- 使用xUnit为.net core程序进行单元测试 -- Assert
第一部分: http://www.cnblogs.com/cgzl/p/8283610.html Assert Assert做什么?Assert基于代码的返回值.对象的最终状态.事件是否发生等情况来评 ...
- Django+Bootstrap+Mysql 搭建个人博客(二)
2.1.博客首页设计 (1)settings.py MEDIA_ROOT = os.path.join(BASE_DIR,'media').replace("//","/ ...
- Kibana(一张图片胜过千万行日志)
Kibana是一个开源的分析和可视化平台,设计用于和Elasticsearch一起工作. 你用Kibana来搜索,查看,并和存储在Elasticsearch索引中的数据进行交互. 你可以轻松地执行高级 ...
- RabbitMQ学习笔记(六) RPC
什么RPC? 这一段是从度娘摘抄的. RPC(Remote Procedure Call Protocol)——远程过程调用协议,它是一种通过网络从远程计算机程序上请求服务,而不需要了解底层网络技术的 ...
- Lucene 06 - 使用Lucene的Query API查询数据
目录 1 Query对象的创建(方式一): 使用子类对象 1.1 常用的Query子类对象 1.2 常用的Query子类对象使用 1.2.1 使用TermQuery 1.2.2 使用NumericRa ...
- SpringCloud应对高并发的思路
一.Eureka的高可用性 Eureka下面的服务实例默认每隔30秒会发送一个HTTP心跳给Eureka,来告诉Eureka服务还活着,每个服务实例每隔30秒也会通过HTTP请求向Eureka获取服务 ...
- Spring Boot(六)集成 MyBatis 操作 MySQL 8
一.简介 1.1 MyBatis介绍 MyBatis 是一款优秀的持久层框架,它支持定制化 SQL.存储过程以及高级映射.MyBatis 避免了几乎所有的 JDBC代码和手动设置参数以及获取结果集. ...
- 四种途径提高RabbitMQ传输数据的可靠性(二)
前言 上一篇四种途径提高RabbitMQ传输消息数据的可靠性(一)已经介绍了两种方式提高数据可靠性传输的方法,本篇针对上一篇中提出的问题(1)与问题(2)提出解决常用的方法. 本文其实也就是结合以上四 ...
- GitHub 开源的 MySQL 在线更改 Schema 工具【转】
本文来自:https://segmentfault.com/a/1190000006158503 原文:gh-ost: GitHub's online schema migration tool fo ...