Python BeautifulSoup 使用

BS4库简单使用:

1.最好配合LXML库，下载：pip install lxml

2.最好配合Requests库，下载：pip install requests

3.下载bs4：pip install bs4

4.直接输入pip没用？解决：环境变量->系统变量->Path->新建：C:\Python27\Scripts

案例：获取网站标题

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

import requests

url = "https://www.baidu.com"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')

print soup.title.text

标签识别

示例1：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<html>

<head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

'''

soup = BeautifulSoup(html, 'lxml')

# BeautifulSoup中有内置的方法来实现格式化输出

print(soup.prettify())

# title标签内容

print(soup.title.string)

# title标签的父节点名

print(soup.title.parent.name)

# 标签名为p的内容

print(soup.p)

# 标签名为p的class内容

print(soup.p["class"])

# 标签名为a的内容

print(soup.a)

# 查找所有的字符a

print(soup.find_all('a'))

# 查找id='link3'的内容

print(soup.find(id='link3'))

示例2：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<html>

<head><title>The Dormouse's story</title></head>

<body>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

'''

soup = BeautifulSoup(html, 'lxml')

# 将p标签下的所有子标签存入到了一个列表中

print (soup.p.contents)

find_all示例:

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<h4>Hello</h4>

</div>

</ul>

</ul>

</div>

'''

soup = BeautifulSoup(html, 'lxml')

# 查找所有的ul标签内容

print(soup.find_all('ul'))

# 针对结果再次find_all,从而获取所有的li标签信息

for ul in soup.find_all('ul'):

print(ul.find_all('li'))

# 查找id为list-1的内容

print(soup.find_all(attrs={'id': 'list-1'}))

# 查找class为element的内容

print(soup.find_all(attrs={'class': 'element'}))

# 查找所有的text='Foo'的文本

print(soup.find_all(text='Foo'))

CSS选择器示例：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html = '''

<h4>Hello</h4>

</div>

</ul>

</ul>

</div>

'''

soup = BeautifulSoup(html, 'lxml')

# 获取class名为panel下panel-heading的内容

print(soup.select('.panel .panel-heading'))

# 获取class名为ul和li的内容

print(soup.select('ul li'))

# 获取class名为element，id为list-2的内容

print(soup.select('#list-2 .element'))

# 使用get_text()获取文本内容

for li in soup.select('li'):

print(li.get_text())

# 获取属性的时候可以通过[属性名]或者attrs[属性名]

for ul in soup.select('ul'):

print(ul['id'])

# print(ul.attrs['id'])

Python BeautifulSoup 使用的更多相关文章

【转】Python BeautifulSoup 中文乱码解决方法
这篇文章主要介绍了Python BeautifulSoup中文乱码问题的2种解决方法,需要的朋友可以参考下解决方法一: 使用python的BeautifulSoup来抓取网页然后输出网页标题,但是输 ...
Python -- BeautifulSoup的学习使用
BeautifulSoup4.3 的使用下载和安装 # 下载 http://www.crummy.com/software/BeautifulSoup/bs4/download/ # 解压后使用r ...
Python beautifulsoup模块
BeautifulSoup中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ BeautifulSoup下载:http://w ...
Python - BeautifulSoup 安装
BeautifulSoup 3.x 1. 下载 BeautifulSoup. [huey@huey-K42JE python]$ wget http://www.crummy.com/software ...
Python BeautifulSoup中文乱码问题的2种解决方法
解决方法一: 使用python的BeautifulSoup来抓取网页然后输出网页标题,但是输出的总是乱码,找了好久找到解决办法,下面分享给大家首先是代码 from bs4 import Beautif ...
python BeautifulSoup库的基本使用
Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating),搜索以 ...
python BeautifulSoup的简单使用
官网:https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 参考:https://www.cnblogs.com/yupeng/p/336203 ...
python BeautifulSoup 介绍--安装
Python中,专门用于HTML/XML解析的库: 特点是: 即使是有bug,有问题的html代码,也可以解析. BeautifulSoup主要有两个版本 BeautifulSoup 3 之前的,比较 ...
python BeautifulSoup库用法总结
1. Beautiful Soup 简介简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.pyt ...
python beautifulsoup/xpath/re详解
自己在看python处理数据的方法,发现一篇介绍比较详细的文章转自:http://blog.csdn.net/lingojames/article/details/72835972 20170531 ...

随机推荐

阿里云服务器配置phpstudy实现域名访问【图文教程】
首先,运行phpStudy,确保Apache和MySql启动,绿色代表正常启动状态. 然后配置站点域名,打开phpStudy的站点域名管理,1.设置域名(你有的域名,最后需要域名解析):2.设置文件的 ...
Python实现对CSV文件的读写功能
我们要处理csv文件,首先要的导入csv模块 import csv #读取csv文件def readCsv(path): #传入变量csv文件的路径 list=[] #定义一个空列表 with ope ...
scrapy Mongodb 储存
pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your ...
python3 第二十九章 - 内置函数之tuple相关
Python元组包含了以下内置函数序号方法及描述实例 1 len(tuple)计算元组元素个数. >>> tuple1 = ('Google', 'Baidu', 'Taoba ...
[转] Shader Blending
引用:1.Unity3D shader Blending2.[风宇冲]Unity3D教程宝典之Shader篇:第十三讲 Alpha混合混合(Blending)是计算机呈现渲染结果的最后阶段,每一个像 ...
thymelead入门 git地址在文档最后
流程:##### 流程###### 1:pom添加依赖 <dependency> <groupId>org.springframework.boot</groupId&g ...
amazeui分页
<link rel="stylesheet" href="../../static/css/manage/amazeui.min.css" /> & ...
SQL视图命名规则：一般以V_xxx_xxxxxx
浅谈React数据流管理
引言:为什么数据流管理如此重要?react的核心思想就是:UI=render(data),data就是我们说的数据流,render是react提供的纯函数,所以用户界面的展示完全取决于数据层.这篇文章 ...
go mysql insert变量到数据库
result, err1 := db.Exec("insert ignore into dish(name,calorie,confidence) values('"+str1+& ...

Python BeautifulSoup 使用

Python BeautifulSoup 使用的更多相关文章

随机推荐

热门专题