python爬虫（10）--PyQuery的用法

简介

pyquery 可让你用 jQuery 的语法来对 xml 进行操作。这I和 jQuery 十分类似。如果利用 lxml，pyquery 对 xml 和 html 的处理将更快。

初始化

在这里介绍四种初始化方式。

（1）直接字符串

from pyquery import PyQuery as pq

doc = pq("<html></html>")

pq 参数可以直接传入 HTML 代码，doc 现在就相当于 jQuery 里面的 $ 符号了。

（2）lxml.etree

from lxml import etree

doc = pq(etree.fromstring("<html></html>"))

可以首先用 lxml 的 etree 处理一下代码，这样如果你的 HTML 代码出现一些不完整或者疏漏，都会自动转化为完整清晰结构的 HTML代码。

（3）直接传URL

from pyquery import PyQuery as pq

doc = pq('http://www.baidu.com')

这里就像直接请求了一个网页一样，类似用 urllib2 来直接请求这个链接，得到 HTML 代码。

（4）传文件

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

可以直接传某个路径的文件名。

快速体验

现在我们以本地文件为例，传入一个名字为 hello.html 的文件，文件内容为

<div>

    <ul>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

 </div>

编写如下程序

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

print doc.html()

print type(doc)

li = doc('li')

print type(li)

print li.text()

运行结果

    <ul>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

<class 'pyquery.pyquery.PyQuery'>

<class 'pyquery.pyquery.PyQuery'>

first item second item third item fourth item fifth item

属性操作

你可以完全按照 jQuery 的语法来进行 PyQuery 的操作。

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.attr("id")

print p.attr("id", "plop")

print p.attr("id", "hello")

运行结果

hello

<p id="plop" class="hello"/>

<p id="hello" class="hello"/>

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.addClass('beauty')

print p.removeClass('hello')

print p.css('font-size', '16px')

print p.css({'background-color': 'yellow'})

运行结果

<p id="hello" class="hello beauty"/>

<p id="hello" class="beauty"/>

<p id="hello" class="beauty" style="font-size: 16px"/>

<p id="hello" class="beauty" style="font-size: 16px; background-color: yellow"/>

DOM操作

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.append(' check out <a href="http://reddit.com/r/python"><span>reddit</span></a>')

print p.prepend('Oh yes!')

d = pq('<div class="wrap"><div id="test"><a href="http://cuiqingcai.com">Germy</a></div></div>')

p.prependTo(d('#test'))

print p

print d

d.empty()

print d

运行结果

<p id="hello" class="hello"> check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<div class="wrap"><div id="test"><p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p><a href="http://cuiqingcai.com">Germy</a></div></div>

<div class="wrap"/>

遍历

遍历用到 items 方法返回对象列表，或者用 lambda

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

lis = doc('li')

for li in lis.items():

    print li.html()

print lis.each(lambda e: e)

运行结果

first item

<a href="link2.html">second item</a>

<a href="link3.html"><span class="bold">third item</span></a>

<a href="link4.html">fourth item</a>

<a href="link5.html">fifth item</a>

<li class="item-0">first item</li>

 <li class="item-1"><a href="link2.html">second item</a></li>

 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

 <li class="item-1 active"><a href="link4.html">fourth item</a></li>

 <li class="item-0"><a href="link5.html">fifth item</a></li>

网页请求

PyQuery 本身还有网页请求功能，而且会把请求下来的网页代码转为 PyQuery 对象。

from pyquery import PyQuery as pq

print pq('http://cuiqingcai.com/', headers={'user-agent': 'pyquery'})

print pq('http://httpbin.org/post', {'foo': 'bar'}, method='post', verify=True)

python爬虫（10）--PyQuery的用法的更多相关文章

python爬虫---selenium库的用法
python爬虫---selenium库的用法 selenium是一个自动化测试工具,支持Firefox,Chrome等众多浏览器在爬虫中的应用主要是用来解决JS渲染的问题. 1.使用前需要安装这个 ...
Python爬虫之PyQuery使用（六）
Python爬虫之PyQuery使用 PyQuery简介 pyquery能够通过选择器精确定位 DOM 树中的目标并进行操作.pyquery相当于jQuery的python实现,可以用于解析HTML网 ...
Python爬虫之BeautifulSoup的用法
之前看静觅博客,关于BeautifulSoup的用法不太熟练,所以趁机在网上搜索相关的视频,其中一个讲的还是挺清楚的:python爬虫小白入门之BeautifulSoup库,有空做了一下笔记: 一.爬 ...
python爬虫10 | 网站维护人员：真的求求你们了，不要再来爬取了！！
今天小帅b想给大家讲一个小明的小故事 ... 话说在很久很久以前小明不小心发现了一个叫做学习python的正确姿势的公众号从此一发不可收拾看到什么网站都想爬取有一天小明发现了一个小黄 ...
【Python爬虫】selenium基础用法
selenium 基础用法阅读目录初识selenium 基本使用查找元素元素互交操作执行JavaScript 获取元素信息等待前进后退 Cookies 选项卡管理异常处理初识sele ...
python爬虫神器PyQuery的使用方法
你是否觉得 XPath 的用法多少有点晦涩难记呢? 你是否觉得 BeautifulSoup 的语法多少有些悭吝难懂呢? 你是否甚至还在苦苦研究正则表达式却因为少些了一个点而抓狂呢? 你是否已经有了一些 ...
python爬虫---requests库的用法
requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多因为是第三方库,所以使用前需要cmd安装 pip install requests 安装完成后import一下 ...
Python爬虫之BeautifulSoap的用法
1. Beautiful Soup的简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.pyt ...
python爬虫之PyQuery的基本使用
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...
python爬虫之pyquery学习
相关内容: pyquery的介绍 pyquery的使用安装模块导入模块解析对象初始化 css选择器在选定元素之后的元素再选取元素的文本.属性等内容的获取 pyquery执行DOM操作.css ...

随机推荐

R语言常用语法总结
## 1. 数据输入 ##a$b # 数据框中的变量a = 15 # 赋值a <- 15 # 赋值a = c(1,2,3,4,5) # 数组(向量)b = a[1] # 数组下标,从1开始b = ...
eclipse部署和启动guns
eclipse部署guns: 1.import -> 搜索maven -> Existing Maven Projects -> 选择guns根目录 2.修改配置文件: spring ...
Android EditText输入光标居于开头最开始位置
如果欲使EditText加载后的输入光标自动处于最开始处,可以通过设置EditText的android:gravity实现,设置android:gravity为left或者start即可,可以设置: ...
python的单例模式--解决多线程的单例模式失效
单例模式单例模式(Singleton Pattern) 是一种常用的软件设计模式,主要目的是确保某一个类只有一个实例存在.希望在整个系统中,某个类只能出现一个实例时,单例对象就能派上用场比如,某个 ...
NOI模拟赛 #4
好像只有一个串串题可以做... 不会 dp 和数据结构啊 QAQ 10 + 20 + 100 = 130 T1 一棵树,每个点有一个能量的最大容量 $l_i$ 和一个增长速度 $v_i$,每次可以选一 ...
【整理】C++中的unique函数
之前总结了一下我觉得有用的erase,lower_bound,upper_bound. 现在总结一下unique,unique的作用是“去掉”容器中相邻元素的重复元素(不一定要求数组有序),它会把重复 ...
UML类图与类的关系详解【转】
在画类图的时候,理清类和类之间的关系是重点. 类的关系有泛化(Generalization).实现(Realization).依赖(Dependency)和关联(Association).其中关联又分 ...
MariaDB10.1找回密码
C:\Program Files\MariaDB 10.1\data下面的my.ini文件,在[mysqld]节点下,增加一句: skip-grant-tables 重启MariaDB服务(mysq ...
12.Selenium+Python案例 -- 今日头条（获取科技栏目的所有新闻标题）
一:具体代码实现 # -*- coding: utf-8 -*-# @Time : 2018/7/26 16:33# @Author : Nancy# @Email : NancyWangDL@163 ...
mysql 查找表的auto_increment和修改
1.查看最大的AUTO_INCREMENT SELECT AUTO_INCREMENT from information_schema.tables where table_schema='cont ...

python爬虫（10）--PyQuery的用法

简介

初始化

快速体验

属性操作

DOM操作

遍历

网页请求

python爬虫（10）--PyQuery的用法的更多相关文章

随机推荐

热门专题