scrapy--BeautifulSoup

BeautifulSoup官方文档:https://beautifulsoup.readthedocs.io/zh_CN/latest/#id8

太繁琐的,精简了一些自己用的到的。

1.index.html

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

2..prettify()--标准的缩进格式输出

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

# <html>

#  <head>

#   <title>

#    The Dormouse's story

#   </title>

#  </head>

#  <body>

#   <p class="title">

#    <b>

#     The Dormouse's story

#    </b>

#   </p>

#   <p class="story">

#    Once upon a time there were three little sisters; and their names were

#    <a class="sister" href="http://example.com/elsie" id="link1">

#     Elsie

#    </a>

#    ,

#    <a class="sister" href="http://example.com/lacie" id="link2">

#     Lacie

#    </a>

#    and

#    <a class="sister" href="http://example.com/tillie" id="link2">

#     Tillie

#    </a>

#    ; and they lived at the bottom of a well.

#   </p>

#   <p class="story">

#    ...

#   </p>

#  </body>

# </html>

3.选择标签,属性

soup.title

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):

    print(link.get('href'))

    # http://example.com/elsie

    # http://example.com/lacie

    # http://example.com/tillie

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# ...

#Tag

    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

    tag = soup.b

    type(tag)

    # <class 'bs4.element.Tag'>

#Name

    tag.name

    # u'b'

    tag.name = "blockquote"

    tag

    # <blockquote class="boldest">Extremely bold</blockquote>

#Attributes

    tag['class']

    # u'boldest'

    tag.attrs

    # {u'class': u'boldest'}

    tag['class'] = 'verybold'

    tag['id'] = 1

    tag

    # <blockquote class="verybold" id="1">Extremely bold</blockquote>

    del tag['class']

    del tag['id']

    tag

    # <blockquote>Extremely bold</blockquote>

    tag['class']

    # KeyError: 'class'

    print(tag.get('class'))

    # None

2.find_all

1）name 参数#A.传字符串soup.find_all('b')
# [<b>The Dormouse's story</b>]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<
a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#B.传正则表达式

import re

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

# body

# b

#C.传列表

soup.find_all(["a", "b"])

# [<b>The Dormouse's story</b>,

#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#D.传 True

for tag in soup.find_all(True):

    print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

# a

#E.传方法

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

# [<p class="title"><b>The Dormouse's story</b></p>,

#  <p class="story">Once upon a time there were...</p>,

#  <p class="story">...</p>]

2）keyword 参数

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

3）text 参数

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

4）limit 参数

soup.find_all("a", limit=2)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive 参数

#调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

soup.html.find_all("title")

# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)

# []

3.CSS选择器

（1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')

#[<b>The Dormouse's story</b>]

（2）通过类名查找

print soup.select('.sister')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找

print soup.select('p #link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

（5）属性查找

print soup.select('a[class="sister"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

print soup.select('p a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

soup = BeautifulSoup(html, 'lxml')

print type(soup.select('title'))

print soup.select('title')[0].get_text()

for title in soup.select('title'):

    print title.get_text()

scrapy--BeautifulSoup的更多相关文章

python scrapy,beautifulsoup,regex,sgmparser,request,connection
In [2]: import requests In [3]: s = requests.Session() In [4]: s.headers 如果你是爬虫相关的业务?抓取的网站还各种各样, ...
python大规模爬取京东
python大规模爬取京东主要工具 scrapy BeautifulSoup requests 分析步骤打开京东首页,输入裤子将会看到页面跳转到了这里,这就是我们要分析的起点我们可以看到这个页面 ...
python web开发小结
书籍 <python基础教程> <流畅的python> web框架 flask django tornado ORM sqlalchemy orator 消息队列 celery ...
爬虫四大金刚：requests，selenium，BeautifulSoup，Scrapy
一.简介爬虫 1.什么是爬虫 #1.什么是互联网? 互联网是由网络设备(网线,路由器,交换机,防火墙等等)和一台台计算机连接而成,像一张网一样. #2.互联网建立的目的? 互联网的核心价值在于数据的共 ...
scrapy vs requests+beautifulsoup
两种爬虫模式比较: 1.requests和beautifulsoup都是库,scrapy是框架. 2.scrapy框架中可以加入requests和beautifulsoup. 3.scrapy基于tw ...
Scrapy框架爬虫初探——中关村在线手机参数数据爬取
关于Scrapy如何安装部署的文章已经相当多了,但是网上实战的例子还不是很多,近来正好在学习该爬虫框架,就简单写了个Spider Demo来实践.作为硬件数码控,我选择了经常光顾的中关村在线的手机页面 ...
网络爬虫：使用Scrapy框架编写一个抓取书籍信息的爬虫服务
上周学习了BeautifulSoup的基础知识并用它完成了一个网络爬虫( 使用Beautiful Soup编写一个爬虫系列随笔汇总 ), BeautifulSoup是一个非常流行的Python网 ...
Scrapy爬取自己的博客内容
python中常用的写爬虫的库有urllib2.requests,对于大多数比较简单的场景或者以学习为目的,可以用这两个库实现.这里有一篇我之前写过的用urllib2+BeautifulSoup做的一 ...
scrapy爬虫笔记(三)------写入源文件的爬取
开始爬取网页:(2)写入源文件的爬取为了使代码易于修改,更清晰高效的爬取网页,我们将代码写入源文件进行爬取. 主要分为以下几个步骤: 一.使用scrapy创建爬虫框架: 二.修改并编写源代码,确定我 ...
scrapy爬虫笔记(二)------交互式爬取
开始网页爬取:(1)交互式爬取首先,我们使用scrapy建立起爬虫的框架.在命令行中输入 scrapy shell “url” 如:scrapy shell “http://www.baidu.co ...

随机推荐

BNU 27847——Cellphone Typing——————【字典树】
Cellphone Typing Time Limit: 5000ms Memory Limit: 131072KB This problem will be judged on UVA. Origi ...
intellijidea课程 intellijidea神器使用技巧2-1 无处不在的跳转
idea快捷键(基于windows平台) 1 书签跳转 Ctrl alt [ ] ==> 项目之间的跳转 Ctrl shift E ==> 文件之间的跳转(最近编辑的文件) Ctrl ...
spring boot Configuration Annotation Proessor not found in classpath
出现spring boot Configuration Annotation Proessor not found in classpath的提示是在用了@ConfigurationPropertie ...
一本通 1260：【例9.4】拦截导弹(Noip1999)
拦截导弹(Noip1999) 经典dp题目,这个做法并非最优解,详细参考洛谷导弹拦截,想想200分的做法. #include <iostream> #include <cstdio& ...
[转]C# 单例模式
最近在学设计模式,学到创建型模式的时候,碰到单例模式(或叫单件模式),现在整理一下笔记. 在<Design Patterns:Elements of Resuable Object-Orient ...
初学：react-native 轮播图
参考资料:http://reactscript.com/react-native-card-carousel-component/ import React, {Component} from 're ...
洛谷 P1266 速度限制
题目描述在这个繁忙的社会中,我们往往不再去选择最短的道路,而是选择最快的路线.开车时每条道路的限速成为最关键的问题.不幸的是,有一些限速的标志丢失了,因此你无法得知应该开多快.一种可以辩解的解决方案 ...
增量数据同步中间件DataLink分享（已开源）
项目介绍名称: DataLink['deitə liŋk]译意: 数据链路,数据(自动)传输器语言: 纯java开发(JDK1.8+)定位: 满足各种异构数据源之间的实时增量同步,一个分布式.可扩展 ...
CRUD全栈式编程架构之更精简的设计
精简的程度 ViewModel精简服务精简控制器精简 Index.cshmtl精简 AddOrEdit.cshtml精简效果:最精简的情况下,只需要写Entity这一个数据库实体然后加上一些简单 ...
ie6下按钮下边框消失不显示的问题
最近网站做改版,又发现一个ie6奇葩的问题,就一个很普通带边框的按钮,但在ie6中下边框不显示,ie7没有测试不知道是不是也不显示,其他浏览器正常代码和预览效果如下: <style> b ...

scrapy--BeautifulSoup

scrapy--BeautifulSoup的更多相关文章

随机推荐

热门专题