python3 网络数据采集1

第一部分：

一、可靠的网络连接：

使用库：

python标准库: urllib

python第三方库：BeautifulSoup

安装：pip3 install beautifulsoup4

导入：import bs4

 cat scrapetest2.py

#!/usr/local/bin/python3

from urllib.request import urlopen

from bs4 import BeautifulSoup

from urllib.error import HTTPError

def getTitle(url):

    try:

        html = urlopen(url)

    except HTTPError as e:

        return None

    try:

        bsObj = BeautifulSoup(html.read())

        title = bsObj.body.h1

    except AttributeError as e:

        return None

    return title

x = 'http://pythonscraping.com/pages/page1.html'

title = getTitle(x)

if title == None:

    print('Title could not be found.')

else:

    print(title)

#######执行结果#######

python3 scrapetest2.py

/usr/local/lib/python3.5/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 21 of the file scrapetest2.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))

<h1>An Interesting Title</h1>

二、复杂的HTML解析

.get_text() 会把正在处理的HTML文档中所有的标签（超链接、段落、标签）都清除，然后返回一个只包含文字的字符串。

 cat bs41.py

#!/usr/local/bin/python3

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')

bsObj = BeautifulSoup(html)

nameList = bsObj.findAll('span',{'class':'green'})

for name in nameList:

    print(name.get_text())

#################

返回所有绿色的字体

Anna

Pavlovna Scherer

Empress Marya

Fedorovna

Prince Vasili Kuragin

Anna Pavlovna

St. Petersburg

the prince

Anna Pavlovna

Anna Pavlovna

the prince

the prince

the prince

Prince Vasili

Anna Pavlovna

Anna Pavlovna

the prince

Wintzingerode

King of Prussia

le Vicomte de Mortemart

Montmorencys

Rohans

Abbe Morio

the Emperor

the prince

Prince Vasili

Dowager Empress Marya Fedorovna

the baron

Anna Pavlovna

the Empress

the Empress

Anna Pavlovna's

Her Majesty

Baron

Funke

The prince

Anna

Pavlovna

the Empress

The prince

Anatole

the prince

The prince

Anna

Pavlovna

Anna Pavlovna

BeautifulSoup 的find() 和findAll()

用途：通过标签的不同属性过滤HTML页面，查找需要的标签组或单个标签。

findAll(tag, attributes, recursive, text, limit, keywords)

findAll = find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

find(tag, attributes, recursive, text, keywords)

find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)

说明： tag 可以传一个标签的名称或多个标签名称组成的python列表做标签参数；findAll({'h1', 'h2', 'h3', 'h4', 'h5', 'h6',})

attributes 用字典封装一个标签的若干属性和对应的属性值； .findAll("span", {"class":{"green", "red"}})

recursive 递归参数是一个布尔变量，默认值True；如果是True会查找变迁参数的所有子标签，以及子标签的子标签；如果设置为False，就只查找文档的一级标签；

text 文本参数，它是用标签的文本内容去匹配，而不是标签的属性；nameList = bsObj.findAll(text='the prince') print(len(nameList)) 结果是：7

limit 范围限制参数x，按照网页上的顺序排序获取前面的x项；

keywords 关键词参数，选择具有指定属性的标签；

关键词参数：

allText = bsObj.findAll(id='text')

print(allText[0].get_text())

######

下面两行代码一样：

bsObj.findAll(id='text')

bsObj.findAll("", {"id" : "text"})

######

在class后面加一个下划线

bsObj.findAll(class_='green')

也可以用属性参数把class用引号包起来

bsObj.findAll("", {"class":"green"})

BeautifulSoup库里的两种对象：

1，BeautifulSoup 对象

2，标签Tag对象

直接调用子标签获取的一列对象或单个对象; bsObj.div.h1

另外两个如下：

3， NavigableString对象

用来表示标签里的文字，不是标签；

4，Comment 对象

用来查找 HTML文档的注释标签

导航树：

通过标签在文档中的位置来查找标签，导航树Navigating Trees作用；

1，处理子标签和其他后代标签；

子标签就是一个父标签的下一级；而后代标签是指一个父标签下面所有级别的标签；

查找子标签，可以用 .children标签；

########子标签#######

# cat  beau.py

#!/usr/local/bin/python3

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')

bsObj = BeautifulSoup(html)

for child in bsObj.find("table",{'id':'giftList'}).children:

    print(child)

########################## cat  beau.py

#!/usr/local/bin/python3

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')

bsObj = BeautifulSoup(html)

for child in bsObj.find("table",{'id':'giftList'}).descendants:

    print(child)

2，处理兄弟标签；

python 网络数据采集1的更多相关文章

笔记之Python网络数据采集
笔记之Python网络数据采集非原创即采集一念清净, 烈焰成池, 一念觉醒, 方登彼岸网络数据采集, 无非就是写一个自动化程序向网络服务器请求数据, 再对数据进行解析, 提取需要的信息通常, ...
Python网络数据采集7-单元测试与Selenium自动化测试
Python网络数据采集7-单元测试与Selenium自动化测试单元测试 Python中使用内置库unittest可完成单元测试.只要继承unittest.TestCase类,就可以实现下面的功能. ...
Python网络数据采集6-隐含输入字段
Python网络数据采集6-隐含输入字段 selenium的get_cookies可以轻松获取所有cookie. from pprint import pprint from selenium imp ...
Python网络数据采集4-POST提交与Cookie的处理
Python网络数据采集4-POST提交与Cookie的处理 POST提交之前访问页面都是用的get提交方式,有些网页需要登录才能访问,此时需要提交参数.虽然在一些网页,get方式也能提交参.比如h ...
Python网络数据采集3-数据存到CSV以及MySql
Python网络数据采集3-数据存到CSV以及MySql 先热热身,下载某个页面的所有图片. import requests from bs4 import BeautifulSoup headers ...
Python网络数据采集2-wikipedia
Python网络数据采集2-wikipedia 随机链接跳转获取维基百科的词条超链接,并随机跳转.可能侧边栏和低栏会有其他链接.这不是我们想要的,所以定位到正文.正文在id为bodyContent的 ...
Python网络数据采集1-Beautifulsoup的使用
Python网络数据采集1-Beautifulsoup的使用来自此书: [美]Ryan Mitchell <Python网络数据采集>,例子是照搬的,觉得跟着敲一遍还是有作用的,所以记录 ...
Python网络数据采集PDF
Python网络数据采集(高清版)PDF 百度网盘链接:https://pan.baidu.com/s/16c4GjoAL_uKzdGPjG47S4Q 提取码:febb 复制这段内容后打开百度网盘手 ...
python网络数据采集的代码
python网络数据采集的代码 https://github.com/REMitchell/python-scraping
[python] 网络数据采集操作清单 BeautifulSoup、Selenium、Tesseract、CSV等
Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesseract.CSV等 Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesse ...

随机推荐

推送证书p12文件转换成pem的命令
openssl pkcs12 -in 你的p12文件名称.p12 -out 需要生成的pem文件名称.pem -nodes
nodejs 修改端口号 process.env.PORT（window环境下）
各个环境下,nodejs设置process.env.PORT的值的命令,如下1.linux环境下: PORT= node app.js 使用上面命令每次都需要重新设置,如果想设置一次永久生效,使用下面 ...
kali菜单中各工具功能
一.说明各工具kali官方简介(竖排):https://tools.kali.org/tools-listing 安装kali虚拟机可参考:https://www.cnblogs.com/lsdb/ ...
[转]一次CMS GC问题排查过程（理解原理+读懂GC日志）
这个是之前处理过的一个线上问题,处理过程断断续续,经历了两周多的时间,中间各种尝试,总结如下.这篇文章分三部分: 1.问题的场景和处理过程:2.GC的一些理论东西:3.看懂GC的日志先说一下问题吧 ...
oracle中如何创建表的自增ID(通过序列)
1.什么是序列呢? 序列是一数据库对象,利用它可生成唯一的整数.一般使用序列自动地生成主码值.一个序列的值是由特别的Oracle程序自动生成,因而序列避免了在运用层实现序列而引起的性能瓶颈. Orac ...
Java Web(八) 事务，安全问题及隔离级别
事务什么是事务? 事务就是一组原子性的SQL查询,或者说是一个独立的工作单元. 事务的作用事务在我们平常的CRUD(增删改查)操作当中也许不太常用, 但是如果我们有一种需求,一组操作中必须全部成功 ...
jquery ready&&load用法
ready和load那一个先执行 DOM文档加载的步骤 (1) 解析HTML结构 (2) 加载外部脚本和样式表文件 (3) 解析并执行脚本代码 (4) 构造HTML DOM模型 //ready (5) ...
uImage是什么
vmlinux是内核文件,zImage是一般情况下默认的压缩内核映像文件,压缩vmlinux,加上一段解压启动代码得到.而uImage则是使用工具mkimage对普通的压缩内核映像文件(zImage) ...
os 模块和 os模块下的path模块
import os # os 主要用于与操作系统进行交互 #获取当前的工作目录 print(os.getcwd()) #切换工作目录 os .chdir("D:\上海python全栈4期\d ...
【原创】<Debug> QString
[问题1] 'class QString' has no member named 'toAscii' [解答] 把toAscii().data()改成toLatin1().data() 如果QStr ...

python 网络数据采集1

python3 网络数据采集1

python 网络数据采集1的更多相关文章

随机推荐

热门专题