一、Beautifu Soup库

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")

Tag、Name、Attributes、NavigableString、Comment

.contents 子节点的列表,将<tag>所有儿子节点存入列表

.children 子节点的迭代类型

.descendants 子孙节点的迭代类型

.parent 节点的父亲标签

.parents 节点先辈标签的迭代类型

.next_sibling(s) 返回安照HTML文本顺序的下一个平行节点标签

.previous_sibling(s) 上一个

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>>from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>

二、信息组织与提取

1.信息标记的三种形式:

XML:尖括号

JSON:有类型键值对

YAML:无类型

3.信息提取的一般方法

(1)完整解析信息地标记形式,再提取关键信息

(2)无视标记形式,直接搜索关键信息

(3)融合方法

实例:

>>>  import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
demo SyntaxError: unexpected indent
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html,parser")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
soup = BeautifulSoup(demo,"html,parser")
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\bs4\__init__.py", line 196, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html,parser. Do you need to install a parser library?
>>> yes
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
yes
NameError: name 'yes' is not defined
>>> soup = BeautifulSoup(demo,"html.parser")
>>> from link in soup.find_all('a')
SyntaxError: invalid syntax
>>> for link in soup.find_all('a')
SyntaxError: invalid syntax
>>> for link in soup.find_all('a'):
print(link.get('href')) http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

4.基于bs4库的HTML内容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)

返回一个列表类型,存储查找的结果

name:对标签名称的检索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):
print(tag.name) html
head
title
body
p
b
p
a
a
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
print(tag.name) body
b

attrs:对标签属性值的检索字符串,可标注属性检索

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive:是否对子孙全部检索,默认True

>>> soup.find_all('a',recursive=False)
[]

string:<>...</>中字符串区域的检索字符串

>>> soup.find_all(string = 'Basic Python')
['Basic Python']
>>> soup.find_all(string = re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']

>>> soup(string = 'Basic Python')
['Basic Python']

扩展方法:

<>.find() find_parents parent next_sibling(s) previous_sibling(s)

三、中国大学排名定向爬虫

技术路线:requests+bs4

可行性:robots协议

步骤1:获取内容 getHTMLText()

2:数据结构 fillUnivList()

3:利用DS printUnivList()

import requests
from bs4 import BeautifulSoup
import bs4 def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "" def fillUnivList(ulist,html):
soup = BeautifulSoup(html,'html.parser')
for tr in soup.find('tbody').children:
if isinstance(tr,bs4.element.Tag):#过滤
tds = tr('td')
ulist.append([tds[0].string,tds[1].string,tds[3].string]) def printUnivList(ulist,num):
#tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
for i in range(num):
u = ulist[i]
print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2])) def main():
uinfo = []
url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20)
main()

优化后:

import requests
from bs4 import BeautifulSoup
import bs4 def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "" def fillUnivList(ulist,html):
soup = BeautifulSoup(html,'html.parser')
for tr in soup.find('tbody').children:
if isinstance(tr,bs4.element.Tag):#过滤
tds = tr('td')
ulist.append([tds[0].string,tds[1].string,tds[3].string]) def printUnivList(ulist,num):
tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt.format("排名","学校名称","总分",chr(12288)))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0],u[1],u[2],chr(12288))) def main():
uinfo = []
url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20)
main()

The website is API(2)的更多相关文章

  1. The website is API(3)

    网络爬虫实战知识准备: Requests库.robots(网络爬虫排除标准).BeautifulSoup库 一.Re正则表达式 1. 简洁地表达一组字符串 通用的字符串表达框架 字符串匹配 编译: 2 ...

  2. The website is API(1)

    Requests 自动爬取HTML页面 自动网路请求提交 robots 网络爬虫排除标准 Beautiful Soup 解析HTML页面 实战 Re 正则表达式详解提取页面关键信息 Scrapy*框架 ...

  3. The website is API(4)

    1.淘宝商品信息定向爬虫 目标:获取淘宝搜索页面信息,提取其中的商品名称和价格 理解:淘宝的搜索接口 翻页的处理 技术路线:requests+re https://s.taobao.com/searc ...

  4. 我这么玩Web Api(二):数据验证,全局数据验证与单元测试

    目录 一.模型状态 - ModelState 二.数据注解 - Data Annotations 三.自定义数据注解 四.全局数据验证 五.单元测试   一.模型状态 - ModelState 我理解 ...

  5. [Android]使用Dagger 2依赖注入 - API(翻译)

    以下内容为原创,欢迎转载,转载请注明 来自天天博客:http://www.cnblogs.com/tiantianbyconan/p/5092525.html 使用Dagger 2依赖注入 - API ...

  6. [转]ASP.NET Web API(三):安全验证之使用摘要认证(digest authentication)

    本文转自:http://www.cnblogs.com/parry/p/ASPNET_MVC_Web_API_digest_authentication.html 在前一篇文章中,主要讨论了使用HTT ...

  7. ASP.NET Web API(三):安全验证之使用摘要认证(digest authentication)

    在前一篇文章中,主要讨论了使用HTTP基本认证的方法,因为HTTP基本认证的方式决定了它在安全性方面存在很大的问题,所以接下来看看另一种验证的方式:digest authentication,即摘要认 ...

  8. ASP.NET Web API(二):安全验证之使用HTTP基本认证

    在前一篇文章ASP.NET Web API(一):使用初探,GET和POST数据中,我们初步接触了微软的REST API: Web API. 我们在接触了Web API的后就立马发现了有安全验证的需求 ...

  9. 微信公众平台Js API(WeixinApi)

    微信公众平台Js API(WeixinApi): https://github.com/zxlie/WeixinApi#user-content-3%E9%9A%90%E8%97%8F%E5%BA%9 ...

随机推荐

  1. oracle11g忘记管理员密码

    oracle的sys和system密码是我们经常忘记的,忘记之后我们可以通过sqlplus来修改重置. 首先打开sqlplus:在运行处可直接输入打开 进入窗口后,首先输入 sqlplus/as sy ...

  2. Innodb特性以及实现原理

    Innodb五大特性 1.insert buffer2.double write3.自适应哈希索引4.异步io5.邻接页刷新 1.insert buffer(change buffer) 作用:将非聚 ...

  3. Codeforces Round #604 (Div. 2) 部分题解

    链接:http://codeforces.com/contest/1265 A. Beautiful String A string is called beautiful if no two con ...

  4. vue实现CheckBox与数组对象绑定

    实现需求: 实现一个简易的购物车,页面的表格展示data数据中的一个数组对象,并提供选中商品和全选商品checkbox复选框,页面实时显示选中商品的总金额: 分析: 1:使用v-for循环渲染arra ...

  5. 自己编写DLL并导出函数

    sub.c #include<windows.h> #include"sub.h" int WINAPI DllMain(_In_ HANDLE _HDllHandle ...

  6. 112.限制请求的method装饰器

    客户端与服务器之间最常用的两种请求方式: 1. GET请求一般是用来向服务器索取数据,但不会向服务器提交数据,不会对服务器的状态进行更改. 2.POST请求一般是用来向服务器提交数据,会对服务器的状态 ...

  7. 【2017西安邀请赛:A】XOR(线段树+线性基)

    前言:虽然已经有很多题解了,但是还是想按自己的理解写一篇. 思路:首先分析题目 一.区间操作 —— 线段树 二.异或操作 —— 线性基 这个两个不难想,关键是下一步的技巧 “或”运算 就是两个数的二进 ...

  8. 关于DSP仿真软件CCS中断点和探针的简单理解

    关于DSP仿真软件CCS中断简单理解 (郑州大学姬祥老师编写) CCS中的2.0版本(实验所用)断点(Break Point) 和探针(Probe Point)之所以能组合使用,是因为我们在实现硬件仿 ...

  9. Java程序员想年后跳槽,对JVM没有深入的理解,我劝你还是别跳了

    前言 Java 虚拟机是学习 Java 的基础,也是迈入高级 Java 开发工程师的必备知识点.所以今天这篇文章我们来聊聊如何从零开始学习 Java 虚拟机. 深入浅出Java虚拟机 对于刚刚接触 J ...

  10. 基于python的arcgis底图添加(转)

    本文翻译自:Qingkai‘s Blog 当使用python的Basemap库绘制地图时,选择一个漂亮的底图会为图片增色不少,但是使用map.bluemarble().map.etopo()或者map ...