python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re

这里主要是做一个关于数据爬取以后的数据解析功能的整合，方便查阅，以防混淆

主要讲到的技术有Xpath，BeautifulSoup，PyQuery，re（正则）

首先举出两个作示例的代码，方便后面举例

解析之前需要先将html代码转换成相应的对象，各自的方法如下：

Xpath：

In [7]: from lxml import etree

In [8]: text = etree.HTML(html)

BeautifulSoup：

In [2]: from bs4 import BeautifulSoup

In [3]: soup = BeautifulSoup(html, 'lxml')

PyQuery：

In [10]: from pyquery import PyQuery as pq

In [11]: doc = pq(html)

re：没有需要的对象，他是直接对字符串进行匹配的规则

示例1

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

</body>

</html>

'''

接下来我们来用不同的解析方法分析示例的HTML代码

匹配标题内容：

Xpath：

In [16]: text.xpath('//title/text()')[0]

Out[16]: "The Dormouse's story"

BeautifulSoup：

In [18]: soup.title.string

Out[18]: "The Dormouse's story"

PyQuery：

In [20]: doc('title').text()

Out[20]: "The Dormouse's story"

re:

In [11]: re.findall(r'<title>(.*?)</title></head>', html)[0]

Out[11]: "The Dormouse's story"

匹配第三个a标签的href属性：

Xpath：#推荐

In [36]: text.xpath('//a[@id="link3"]/@href')[0]

Out[36]: 'http://example.com/tillie'

BeautifulSoup：

In [27]: soup.find_all(attrs={'id':'link3'})
Out[27]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [33]: soup.find_all(attrs={'id':'link3'})[0].attrs['href']
Out[33]: 'http://example.com/tillie'

PyQuery：#推荐

In [45]: doc("#link3").attr.href

Out[45]: 'http://example.com/tillie'

re:

In [46]: re.findall(r'<a href="(.*?)" class="sister" id="link3">Tillie</a>;', html)[0]

Out[46]: 'http://example.com/tillie'

匹配P标签便是内容的全部数据：

Xpath：

In [48]: text.xpath('string(//p[@class="story"])').strip()

Out[48]: 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'

In [51]: ' '.join(text.xpath('string(//p[@class="story"])').split('\n'))

Out[51]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.'

BeautifulSoup：

In [89]: ' '.join(list(soup.body.stripped_strings)).replace('\n', '')

Out[89]: "The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of a well. ..."

PyQuery:

In [99]: doc('.story').text()

Out[99]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...'

re：不推荐使用，过于麻烦

In [101]: re.findall(r'<p class="story">(.*?)<a href="http://example.com/elsie" class="sister" id="link1">(.*?)</a>(.*?)<a href="http://example.com/lacie" class="siste

     ...: r" id="link2">(.*?)</a>(.*?)<a href="http://example.com/tillie" class="sister" id="link3">(.*?)</a>;(.*?)</p>', html, re.S)[0]

Out[101]:

('Once upon a time there were three little sisters; and their names were\n',

 'Elsie',

 ',\n',

 'Lacie',

 ' and\n',

 'Tillie',

 '\nand they lived at the bottom of a well.')

示例2

html = '''

<div>

<ul>

<li class="item-0">first item</li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

</ul>

</div>

'''

匹配second item

Xpath：

In [14]: text.xpath('//li[2]/a/text()')[0]

Out[14]: 'second item'

BeautifulSoup：

In [23]: soup.find_all(attrs={'class': 'item-1'})[0].string

Out[23]: 'second item'

PyQuery：

In [34]: doc('.item-1>a')[0].text

Out[34]: 'second item'

re:

In [35]: re.findall(r'<li class="item-1"><a href="link2.html">(.*?)</a></li>', html)[0]

Out[35]: 'second item'

匹配第五个li标签的href属性：

Xpath：

In [36]: text.xpath('//li[@class="item-0"]/a/@href')[0]

Out[36]: 'link5.html'

BeautifulSoup：

In [52]:  soup.find_all(attrs={'class': 'item-0'})

Out[52]:

[<li class="item-0">first item</li>,

 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>,

 <li class="item-0"><a href="link5.html">fifth item</a></li>]

In [53]: soup.find_all(attrs={'class': 'item-0'})[-1].a.attrs['href']

Out[53]: 'link5.html'

PyQuery：

In [75]: [i.attr.href for i in doc('.item-0 a').items()][1]

Out[75]: 'link5.html'

re:

In [95]: re.findall(r'<li class="item-0" ><a href="(.*?)">fifth item</a></li>',html)[0]

Out[95]: 'link5.html'

示例3

<li><span class="label">房屋用途</span>普通住宅</li>

分别获取出房屋用途和普通住宅

Xpath：

In [47]: text.xpath('//li/span/text()')[0]

Out[47]: '房屋用途'

In [49]: text.xpath('//li/text()')[0]

Out[49]: '普通住宅'

BeautifulSoup：

In [65]: soup.span.string

Out[65]: '房屋用途'

In [69]: soup.li.contents[1] # contents 获取直接子节点

Out[69]: '普通住宅'

PyQuery：

In [70]: doc('li span').text()

Out[70]: '房屋用途'

In [75]: doc('li .label')[0].tail

Out[75]: '普通住宅'

re: 略

示例4

<div class="unitPrice">

    <span class="unitPriceValue">26667<i>元/平米</i></span>

</div>

分别获取26667和元/平米

Xpath:

In [81]: text.xpath('//div[@class="unitPrice"]/span/text()')[0]

Out[81]: ''

In [82]: text.xpath('//div[@class="unitPrice"]/span/i/text()')[0]

Out[82]: '元/平米'

BeautifulSoup:

In [97]: [i for i in soup.find('div', class_="unitPrice").strings]

Out[97]: ['\n', '', '元/平米', '\n']

In [98]: [i for i in soup.find('div', class_="unitPrice").strings][1]

Out[98]: ''

In [99]: [i for i in soup.find('div', class_="unitPrice").strings][2]

Out[99]: '元/平米'

PyQuery:

In [109]: doc('.unitPrice .unitPriceValue')[0].text

Out[109]: ''

In [110]: doc('.unitPrice .unitPriceValue i')[0].text

Out[110]: '元/平米'

python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re的更多相关文章

python爬虫--数据解析
数据解析什么是数据解析及作用概念:就是将一组数据中的局部数据进行提取作用:来实现聚焦爬虫数据解析的通用原理标签定位取文本或者属性正则解析正则回顾单字符: . : 除换行以外所有字符 ...
python爬虫数据解析之BeautifulSoup
BeautifulSoup是一个可以从HTML或者XML文件中提取数据的python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式. BeautfulSoup是python爬虫三 ...
Python网络爬虫数据解析的三种方式
request实现数据爬取的流程: 指定url 基于request发起请求获取响应的数据数据解析持久化存储 1.正则解析: 常用的正则回顾:https://www.cnblogs.com/wqz ...
python爬虫数据解析之正则表达式
爬虫的一般分为四步,第二个步骤就是对爬取的数据进行解析. python爬虫一般使用三种解析方式,一正则表达式,二xpath,三BeautifulSoup. 这篇博客主要记录下正则表达式的使用. 正则表 ...
[转]JSon数据解析的四种方式
转至http://blog.csdn.net/enuola/article/details/7903632 作为一种轻量级的数据交换格式,json正在逐步取代xml,成为网络数据的通用格式. 有的js ...
python爬虫数据解析之xpath
xpath是一门在xml文档中查找信息的语言.xpath可以用来在xml文档中对元素和属性进行遍历. 在xpath中,有7中类型的节点,元素,属性,文本,命名空间,处理指令,注释及根节点. 节点首先 ...
070.Python聚焦爬虫数据解析
一聚焦爬虫数据解析 1.1 基本介绍聚焦爬虫的编码流程指定url 基于requests模块发起请求获取响应对象中的数据数据解析进行持久化存储如何实现数据解析三种数据解析方式正则表达式 ...
python爬虫+数据可视化项目（关注、持续更新）
python爬虫+数据可视化项目(一) 爬取目标:中国天气网(起始url:http://www.weather.com.cn/textFC/hb.shtml#) 爬取内容:全国实时温度最低的十个城市气 ...
python 爬虫数据存入csv格式方法
python 爬虫数据存入csv格式方法命令存储方式:scrapy crawl ju -o ju.csv 第一种方法:with open("F:/book_top250.csv" ...

随机推荐

loj 102 最小费用流
补一发费用流的代码 %%%棒神 #include<iostream> #include<cstdio> #include<cstring> #include< ...
bzoj2752
线段树+概率今天这道题爆零了,奥妙重重. 其实我们可以把式子化成这样:sigma((i-l+1)*(r-i+1)*ai) 这里r减了1 然后展开,(1-l)*(r+1)*ai+(r+l)*i*ai- ...
特征变化--->特征向量中部分特征到类别索引的转换（VectorIndexer）
VectorIndexer: 倘若所有特征都已经被组织在一个向量中,又想对其中某些单个分量进行处理时,Spark ML提供了VectorIndexer类来解决向量数据集中的类别性特征转换. 通过为其提 ...
istio-禁用/允许sidecar设置
一.在namespace设置自动注入: 给 default 命名空间设置标签:istio-injection=enabled: $ kubectl label namespace default is ...
consul备份还原导入导出
工作中要保证生产环境部署的consul的集群能够安全稳定地对外提供服务,即使出现系统故障也能快速恢复,这里将讲述部分的备份还原操作及KV的导入导出操作. 备份与还原需要备份的主要有两类数据:cons ...
thinkphp的model的where条件的两种形式
thinkphp的model的where查询时有两种形式. $model->field('id')->where('customer_num is null or customer_num ...
如何使jquery性能最佳
转自 http://www.cnblogs.com/mo-beifeng/archive/2012/02/02/2336228.html 1. 使用最新版本的jQuery jQuery的版本更新很快, ...
action="post" 、 servletconfig 、 servletcontext 、getPrintWiter() 、context-param、 init-param(第一个完整的servlet)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <hea ...
【转】Linux账号管理之useradd
转自:http://www.jb51.net/article/45848.htm Linux 系统是一个多用户多任务的分时操作系统,任何一个要使用系统资源的用户,都必须首先向系统管理员申请一个账号,然 ...
【转】Linux命令学习手册-split命令
转自:http://blog.chinaunix.net/uid-9525959-id-3054325.html split [OPTION] [INPUT [PREFIX]] [功能]将文件分割成多 ...

python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re

匹配标题内容：

匹配第三个a标签的href属性：

匹配P标签便是内容的全部数据：

匹配second item

匹配第五个li标签的href属性：

python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re的更多相关文章

随机推荐

热门专题