知识点一：正则表达式详解及其基本使用方法

什么是正则表达式

正则表达式对子符串操作的一种逻辑公式，就是事先定义好的一些特定字符、及这些特定字符的组合，组成一个‘规则字符串’，这个‘规则字符串’用来表达对字符串的一种过滤逻辑。

（非Python独有，re模块实现）

测试正则表达式的网站
1. 测试官网：在线正则表达式测试
2. 学习教程：菜鸟教程RE模块详解
基本对照表

（截的图，不清楚！可以看菜鸟教程正则表达式元字符页面）

re库的用法详解

re.match()方法的使用
re.match尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回None。基本用法为：

re.match(pattern,string,flags=0)

最基本的用法

import re

content = 'Hello 123 4567 World_This is a Regex Demo'

print(len(content))

result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)

print(result)

        #<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>

print(result.span())#span输出匹配结果的范围

        #(0, 41)

print(result.group())#group返回匹配结果

        #Hello 123 4567 World_This is a Regex Demo

泛匹配：（ ' .* ' 就可以把中间字符匹配到，但是必须制定起始位置）

import re

content = 'Hello 1234567 World_This is a Regex Demo'

result1 = re.match('^Hello.*Demo$',content)#泛匹配（.*就可以把中间字符匹配到，但是必须制定起始位置）

print(result1)

        #<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>

print(result1.group())

        #Hello 123 4567 World_This is a Regex Demo

print(result1.span())

        #(0, 41)

匹配目标

import re

content = 'Hello 1234567 World_This is a Regex Demo'

result = re.match('^Hello\s(\d+)\sWorld.*Demo$',content)

print(result)

        #<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>

print(result.group(1))#将第一括号括起来的内容打印出来，依次可推group（2）

        #

print(result.span())

        #(0, 40)

贪婪匹配

import re

content = 'Hello 1234567 World_This is a Regex Demo'

result = re.match('^He.*(\d+).*Demo$',content)

print(result)

        #<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>

print(result.group(1))#打印结果为7，意味着‘.*’将前面的数字全部包含了

         #

#非贪婪匹配（'?'指定模式为非贪婪匹配，让其匹配尽可能少的字符）

import re

content = 'Hello 1234567 World_This is a Regex Demo'

result = re.match('^He.*?(\d+).*Demo$',content)

print(result)

        #<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>

print(result.group(1))

        #

匹配模式(换行符问题)

import re

content = '''Hello 1234567 World_This

is a Regex Demo'''

result1 = re.match('^He.*?(\d+).*?Demo$',content)#无视换行的下场

print(result1)

        #None

result2 = re.match('^He.*?(\d+).*?Demo$',content,re.S)#添加参数re.S就可以无视换行

print(result2.group(1))

        #

转义

import re

content = 'price is $5.00'

result = re.match('price is $5.00',content)

print(result)

        #None

result1 = re.match('price is \$5\.00',content)#添加‘\’即可把特殊字符进行转义

print(result1)

        #<_sre.SRE_Match object; span=(0, 14), match='price is $5.00'>

总结：尽量使用泛匹配，使用括号得到匹配目标，尽量使用非贪婪模式、有换行符就用re.S

re.search()方法的使用（扫描整个字符串并返回第一个成功的匹配）

re.search()与re.match()方法的比较

import re

content = 'Extra stings Hello 123232 World_This is a Regex Demo Extra stings'

result = re.match('Hello.*?(\d+).*?Demo',content)

print(result)

        #结果为None，说明从开始就匹配失败

result1 = re.search('Hello.*?(\d+).*?Demo',content)#re.search不管开头是否相符，只要条件满足就可以找到

print(result1)

        #<_sre.SRE_Match object; span=(13, 52), match='Hello 123232 World_This is a Regex Demo'>

print(result1.group(1))

        #

总结：总的来说，能用search就不用match

匹配演练

基本数据

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a><li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

import re

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a><li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

result = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>',html,re.S)

if result:

    print(result.group(1),result.group(2))

            #齐秦 往事随风

演练一

import re

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a><li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

result = re.search('<li.*?singer="(.*?)">(.*?)</a>',html,re.S)

if result:

    print(result.group(1),result.group(2))

            #任贤齐 沧海一声笑

演练二

import re

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a><li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

result = re.search('<li.*?singer="(.*?)">(.*?)</a>',html)

if result:

    print(result.group(1),result.group(2))

            #begoud 光辉岁月

演练三

re.findall()（搜索字符，以列表的形式返回全部匹配的字符串）

基本用法：

import re

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a><li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

result = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html,re.S)

print(result)#以元组的形式将所有值输出

print(type(result))

        #<class 'list'>

for result in result:

    print(result)

    print(result[0],result[1],result[2])

[('/2.mp3', '任贤齐', '沧海一声笑'), ('/3.mp3', '齐秦', '往事随风'), ('/4.mp3', 'begoud', '光辉岁月'), ('/5.mp3', '陈慧琳', '记事本'), ('/6.mp3', '邓丽君', '但愿人长久')]

<class 'list'>

('/2.mp3', '任贤齐', '沧海一声笑')

/2.mp3 任贤齐 沧海一声笑

('/3.mp3', '齐秦', '往事随风')

/3.mp3 齐秦 往事随风

('/4.mp3', 'begoud', '光辉岁月')

/4.mp3 begoud 光辉岁月

('/5.mp3', '陈慧琳', '记事本')

/5.mp3 陈慧琳 记事本

('/6.mp3', '邓丽君', '但愿人长久')

/6.mp3 邓丽君 但愿人长久

获得的结果

关于换行问题，“ (<a.*?>)? ”括号内表示一个组，“？”表示a标签可能有

import re

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a><li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

result = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>',html,re.S)

print(result)

for result in result:

    print(result[1])

[('', '一路上有你', ''), ('<a href="/2.mp3"singer="任贤齐">', '沧海一声笑', '</a>'), ('<a href="/3.mp3"singer="齐秦">', '往事随风', '</a>'), ('<a href="/4.mp3"singer="begoud">', '光辉岁月', '</a>'), ('<a href="/5.mp3"singer="陈慧琳">记事本</a><li>\n    <li data-view="5">\n        <a href="/6.mp3"singer="邓丽君">', '但愿人长久', '</a>')]

一路上有你

沧海一声笑

往事随风

光辉岁月

但愿人长久

获得的结果

re.sub（替换字符串中每一个匹配的字符串后返回替换后的的字符串

公式：re.sub（正则表达式，要替换成的字符串，原字符串）

删除固定位置内容

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'

content = re.sub('\d+','',content)

print(content)

        #Extra stings Hello  World_This is a Regex Demo Extra stings

把固定位置上的内容替换

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'

content = re.sub('\d+','Replacement',content)

print(content)

        #Extra stings Hello Replacement World_This is a Regex Demo Extra stings

在原有的基础上增加

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'

content = re.sub('(\d+)',r'\1 45545',content)

print(content)

        #Extra stings Hello 1234567 45545 World_This is a Regex Demo Extra stings

练习

html = '''<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="2">一路上有你</li>

    <li data-view="7">

        <a href="/2.mp3"singer="任贤齐">沧海一声笑</a>

    </li>

    <li data-view="4"class="active">

        <a href="/3.mp3"singer="齐秦">往事随风</a>

    </li>

    <li data-view="6"><a href="/4.mp3"singer="begoud">光辉岁月</a></li>

    <li data-view="5"><a href="/5.mp3"singer="陈慧琳">记事本</a></li>

    <li data-view="5">

        <a href="/6.mp3"singer="邓丽君">但愿人长久</a>

    </li>

</ul>

</div>'''

html = re.sub('<a.*?>|</a>','',html)#把a标签替换掉

print(html)

result = re.findall('<li.*?>(.*?)</li>',html,re.S)

print(result)

for result in result:

    print(result.strip())#去掉换行符

re.sub练习

<div id="songs-list">

<h2 class="title">经典老歌</h2>

<p class="introduction">

    经典老歌列表

</p>

<ul id="list"class="list-group">

    <li data-view="">一路上有你</li>

    <li data-view="">

        沧海一声笑

    </li>

    <li data-view=""class="active">

        往事随风

    </li>

    <li data-view="">光辉岁月</li>

    <li data-view="">记事本</li>

    <li data-view="">

        但愿人长久

    </li>

</ul>

</div>

['一路上有你', '\n        沧海一声笑\n    ', '\n        往事随风\n    ', '光辉岁月', '记事本', '\n        但愿人长久\n    ']

一路上有你

沧海一声笑

往事随风

光辉岁月

记事本

但愿人长久

获得的结果

re.compile（将正则字符串编译成正则表达式对象，以便于复用该匹配对象）

基本使用

content = '''hello 1234545 World_This

is a Regex Demo'''

pattern = re.compile('hello.*Demo',re.S)

result = re.match(pattern,content)

print(result)

        #<_sre.SRE_Match object; span=(0, 40), match='hello 1234545 World_This\nis a Regex Demo'>

result1 = re.match('hello.*Demo',content,re.S)

print(result1)

        #<_sre.SRE_Match object; span=(0, 40), match='hello 1234545 World_This\nis a Regex Demo'>

实战演练

实战内容：爬取豆瓣读书首页所有的图书的详情页连接，书名，作者，出版年内容。

一般方法：

import re

import requests

content = requests.get('https://book.douban.com/').text

#print(content)

pattern = re.compile('<li.*?"cover">.*?href="(.*?)" title="(.*?)".*?"more-meta".*?"author">(.*?)</span>.*?"year">(.*?)</span>.*?</li>',re.S)

result = re.findall(pattern,content)

print(result)

[('https://book.douban.com/subject/30274766/?icn=index-editionrecommend', '潦草', '\n                    贾行家\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30228612/?icn=index-editionrecommend', '游泳回家', '\n                    [英]德博拉·利维\n                  ', '\n                    2018-8-1\n                  '), ('https://book.douban.com/subject/30280804/?icn=index-editionrecommend', '薛兆丰经济学讲义', '\n                    薛兆丰\n                  ', '\n                    2018-7-1\n                  '), ('https://book.douban.com/subject/30185326/?icn=index-editionrecommend', '给孩子的未来脑计划', '\n                    魏坤琳\n                  ', '\n                    2018-4\n                  '), ('https://book.douban.com/subject/30288807/?icn=index-editionrecommend', '加密与解密（第4版）', '\n                    段钢\n                  ', '\n                    2018-9-1\n                  '), ('https://book.douban.com/subject/27176955/?icn=index-latestbook-subject', '罗特小说集2', '\n                    [奥] 约瑟夫·罗特&nbsp;/&nbsp;刘炜 主编\n                  ', '\n                    2018-6\n                  '), ('https://book.douban.com/subject/30222403/?icn=index-latestbook-subject', '明治天皇', '\n                    (美) 唐纳德·基恩 (Donald Keene)\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30193776/?icn=index-latestbook-subject', '西游八十一案', '\n                    陈渐\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30274068/?icn=index-latestbook-subject', '经济学的思维方式', '\n                    托马斯·索维尔\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30246163/?icn=index-latestbook-subject', '默读.2', '\n                    Priest\n                  ', '\n                    2018-6\n                  '), ('https://book.douban.com/subject/30199434/?icn=index-latestbook-subject', '原生家庭', '\n                    （美）苏珊·福沃德博士&nbsp;/&nbsp;（美）克雷格·巴克\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/28170953/?icn=index-latestbook-subject', '荣耀', '\n                    [美]纳博科夫\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30167361/?icn=index-latestbook-subject', '柏林1：石之城', '\n                    [美] 贾森·卢茨\n                  ', '\n                    2018-9\n                  '), ('https://book.douban.com/subject/30229646/?icn=index-latestbook-subject', '阿波罗', '\n                    [英] 扎克·斯科特\n                  ', '\n                    2018-7-1\n                  '), ('https://book.douban.com/subject/27197821/?icn=index-latestbook-subject', '洞穴', '\n                    [葡] 若泽·萨拉马戈\n                  ', '\n                    2018-6\n                  '), ('https://book.douban.com/subject/27661637/?icn=index-latestbook-subject', '放牧人生', '\n                    [英]詹姆斯·里班克斯\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30217911/?icn=index-latestbook-subject', '诗人继续沉默', '\n                    [以色列] 亚伯拉罕·耶霍舒亚\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30252127/?icn=index-latestbook-subject', '今天也要好好地过', '\n                    [日] 吉竹伸介\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30243869/?icn=index-latestbook-subject', '冷山', '\n                    [美] 查尔斯·弗雷泽\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30198886/?icn=index-latestbook-subject', '艾略特·厄威特的巴黎', '\n                    [美] 艾略特·厄威特&nbsp;/&nbsp;Elliott Erwitt\n                  ', '\n                    2018-6\n                  '), ('https://book.douban.com/subject/30203733/?icn=index-latestbook-subject', '阳光劫匪友情测试', '\n                    [日] 伊坂幸太郎\n                  ', '\n                    2018-8-1\n                  '), ('https://book.douban.com/subject/26877230/?icn=index-latestbook-subject', '《英国史》（全三卷）', '\n                    [英]西蒙·沙玛\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30175383/?icn=index-latestbook-subject', '犯罪者的七不规范', '\n                    张舟\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30259504/?icn=index-latestbook-subject', '伟大的海', '\n                    [英]大卫‧阿布拉菲亚\n                  ', '\n                    2018-7-1\n                  '), ('https://book.douban.com/subject/30194496/?icn=index-latestbook-subject', '朋友之间', '\n                    [以]阿摩司·奥兹\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30280610/?icn=index-latestbook-subject', '天长地久', '\n                    龙应台\n                  ', '\n                    2018-8-1\n                  '), ('https://book.douban.com/subject/30280340/?icn=index-latestbook-subject', '格林童话', '\n                    格林兄弟\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30235060/?icn=index-latestbook-subject', '情感勒索', '\n                    [美] 苏珊·福沃德&nbsp;/&nbsp;唐娜·弗雷泽\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30238143/?icn=index-latestbook-subject', '奥尔拉', '\n                    [法] 纪尧姆·索雷尔 编绘\n                  ', '\n                    2018-9\n                  '), ('https://book.douban.com/subject/30247531/?icn=index-latestbook-subject', '听音乐（全彩插图第11版）', '\n                    [美] 罗杰·凯密恩&nbsp;/&nbsp;Roger Kamien\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/27598730/?icn=index-latestbook-subject', '突然死亡', '\n                    [墨]阿尔瓦罗·恩里克\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30231921/?icn=index-latestbook-subject', '中国古代的谣言与谶语', '\n                    栾保群\n                  ', '\n                    2018-7-1\n                  '), ('https://book.douban.com/subject/30254431/?icn=index-latestbook-subject', '被猜死的人', '\n                    田耳\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30187740/?icn=index-latestbook-subject', '李霖灿读画四十年', '\n                    李霖灿\n                  ', '\n                    2018-6\n                  '), ('https://book.douban.com/subject/30218856/?icn=index-latestbook-subject', '房客', '\n                    [英] 萨拉·沃特斯\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/27045307/?icn=index-latestbook-subject', '唐物的文化史', '\n                    [日] 河添房江\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30258687/?icn=index-latestbook-subject', '战略级天使', '\n                    白伯欢\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30207028/?icn=index-latestbook-subject', '中国烟草史', '\n                    [美]班凯乐\n                  ', '\n                    2018-7\n                  '), ('https://book.douban.com/subject/30237869/?icn=index-latestbook-subject', '爱情故事的两个版本', '\n                    [塞尔维亚]雅丝米娜·米哈伊洛维奇&nbsp;/&nbsp;[塞尔维亚] 米洛拉德·帕维奇\n                  ', '\n                    2018-8\n                  '), ('https://book.douban.com/subject/30200827/?icn=index-latestbook-subject', '乐队女孩', '\n                    [美]金·戈登\n                  ', '\n                    2018-7\n                  ')]

import re

获得的内容

可以发现，获取的内容中有很多的空格等内容，内容杂乱。解决方法如下：

方法一：使用strip()方法去空格

import re

import requests

content = requests.get('https://book.douban.com/').text

#print(content)

pattern = re.compile('<li.*?"cover">.*?href="(.*?)" title="(.*?)".*?"more-meta".*?"author">(.*?)</span>.*?"year">(.*?)</span>.*?</li>',re.S)

result = re.findall(pattern,content)

for i in result:

    print(i[0].strip(),i[1].strip(),i[2].strip(),i[3].strip())

https://book.douban.com/subject/30274766/?icn=index-editionrecommend 潦草 贾行家 2018-8

https://book.douban.com/subject/30228612/?icn=index-editionrecommend 游泳回家 [英]德博拉·利维 2018-8-1

https://book.douban.com/subject/30280804/?icn=index-editionrecommend 薛兆丰经济学讲义 薛兆丰 2018-7-1

https://book.douban.com/subject/30185326/?icn=index-editionrecommend 给孩子的未来脑计划 魏坤琳 2018-4

https://book.douban.com/subject/30288807/?icn=index-editionrecommend 加密与解密（第4版） 段钢 2018-9-1

https://book.douban.com/subject/30203733/?icn=index-latestbook-subject 阳光劫匪友情测试 [日] 伊坂幸太郎 2018-8-1

https://book.douban.com/subject/30198886/?icn=index-latestbook-subject 艾略特·厄威特的巴黎 [美] 艾略特·厄威特&nbsp;/&nbsp;Elliott Erwitt 2018-6

https://book.douban.com/subject/30246163/?icn=index-latestbook-subject 默读.2 Priest 2018-6

https://book.douban.com/subject/30280610/?icn=index-latestbook-subject 天长地久 龙应台 2018-8-1

https://book.douban.com/subject/30167361/?icn=index-latestbook-subject 柏林1：石之城 [美] 贾森·卢茨 2018-9

https://book.douban.com/subject/26877230/?icn=index-latestbook-subject 《英国史》（全三卷） [英]西蒙·沙玛 2018-7

https://book.douban.com/subject/28170953/?icn=index-latestbook-subject 荣耀 [美]纳博科夫 2018-7

https://book.douban.com/subject/30274068/?icn=index-latestbook-subject 经济学的思维方式 托马斯·索维尔 2018-8

https://book.douban.com/subject/30238143/?icn=index-latestbook-subject 奥尔拉 [法] 纪尧姆·索雷尔 编绘 2018-9

https://book.douban.com/subject/30229646/?icn=index-latestbook-subject 阿波罗 [英] 扎克·斯科特 2018-7-1

https://book.douban.com/subject/27176955/?icn=index-latestbook-subject 罗特小说集2 [奥] 约瑟夫·罗特&nbsp;/&nbsp;刘炜 主编 2018-6

https://book.douban.com/subject/30231921/?icn=index-latestbook-subject 中国古代的谣言与谶语 栾保群 2018-7-1

https://book.douban.com/subject/27598730/?icn=index-latestbook-subject 突然死亡 [墨]阿尔瓦罗·恩里克 2018-7

https://book.douban.com/subject/27197821/?icn=index-latestbook-subject 洞穴 [葡] 若泽·萨拉马戈 2018-6

https://book.douban.com/subject/30259504/?icn=index-latestbook-subject 伟大的海 [英]大卫‧阿布拉菲亚 2018-7-1

https://book.douban.com/subject/30243869/?icn=index-latestbook-subject 冷山 [美] 查尔斯·弗雷泽 2018-8

https://book.douban.com/subject/30252127/?icn=index-latestbook-subject 今天也要好好地过 [日] 吉竹伸介 2018-8

https://book.douban.com/subject/30180831/?icn=index-latestbook-subject 哀歌 [日] 远藤周作 2018-6

https://book.douban.com/subject/27191001/?icn=index-latestbook-subject 东洋的近世 [日]宫崎市定 著&nbsp;/&nbsp;[日]砺波护 编 2018-7-20

https://book.douban.com/subject/30280340/?icn=index-latestbook-subject 格林童话 格林兄弟 2018-8

https://book.douban.com/subject/30222403/?icn=index-latestbook-subject 明治天皇 (美) 唐纳德·基恩 (Donald Keene) 2018-7

https://book.douban.com/subject/30218856/?icn=index-latestbook-subject 房客 [英] 萨拉·沃特斯 2018-7

https://book.douban.com/subject/27045307/?icn=index-latestbook-subject 唐物的文化史 [日] 河添房江 2018-7

https://book.douban.com/subject/30212811/?icn=index-latestbook-subject 夜班经理 [英]约翰·勒卡雷 2018-8

https://book.douban.com/subject/30271484/?icn=index-latestbook-subject 深蓝的故事 深蓝 2018-7

https://book.douban.com/subject/30258687/?icn=index-latestbook-subject 战略级天使 白伯欢 2018-7

https://book.douban.com/subject/30247531/?icn=index-latestbook-subject 听音乐（全彩插图第11版） [美] 罗杰·凯密恩&nbsp;/&nbsp;Roger Kamien 2018-7

https://book.douban.com/subject/30194496/?icn=index-latestbook-subject 朋友之间 [以]阿摩司·奥兹 2018-7

https://book.douban.com/subject/30235060/?icn=index-latestbook-subject 情感勒索 [美] 苏珊·福沃德&nbsp;/&nbsp;唐娜·弗雷泽 2018-7

https://book.douban.com/subject/30193776/?icn=index-latestbook-subject 西游八十一案 陈渐 2018-8

https://book.douban.com/subject/30187740/?icn=index-latestbook-subject 李霖灿读画四十年 李霖灿 2018-6

https://book.douban.com/subject/30254431/?icn=index-latestbook-subject 被猜死的人 田耳 2018-8

https://book.douban.com/subject/30200827/?icn=index-latestbook-subject 乐队女孩 [美]金·戈登 2018-7

https://book.douban.com/subject/30175383/?icn=index-latestbook-subject 犯罪者的七不规范 张舟 2018-7

https://book.douban.com/subject/30199434/?icn=index-latestbook-subject 原生家庭 （美）苏珊·福沃德博士&nbsp;/&nbsp;（美）克雷格·巴克 2018-8

获得的内容

方法二：使用re.sub()方法替换空格

import re

import requests

content = requests.get('https://book.douban.com/').text

pattern = re.compile('<li.*?"cover">.*?href="(.*?)" title="(.*?)".*?"more-meta".*?"author">(.*?)</span>.*?"year">(.*?)</span>.*?</li>',re.S)

result88 = re.findall(pattern,content)

#print(result88)

for result in result88:

    url,name,author,date = result

    author = re.sub('\s','',author)#使用re.sub将（\n）代替

    date = re.sub('\s','',date)

    print(url,name,author,date)

https://book.douban.com/subject/30274766/?icn=index-editionrecommend 潦草 贾行家 2018-8

https://book.douban.com/subject/30228612/?icn=index-editionrecommend 游泳回家 [英]德博拉·利维 2018-8-1

https://book.douban.com/subject/30280804/?icn=index-editionrecommend 薛兆丰经济学讲义 薛兆丰 2018-7-1

https://book.douban.com/subject/30185326/?icn=index-editionrecommend 给孩子的未来脑计划 魏坤琳 2018-4

https://book.douban.com/subject/30288807/?icn=index-editionrecommend 加密与解密（第4版） 段钢 2018-9-1

https://book.douban.com/subject/27598730/?icn=index-latestbook-subject 突然死亡 [墨]阿尔瓦罗·恩里克 2018-7

https://book.douban.com/subject/30229646/?icn=index-latestbook-subject 阿波罗 [英]扎克·斯科特 2018-7-1

https://book.douban.com/subject/30194496/?icn=index-latestbook-subject 朋友之间 [以]阿摩司·奥兹 2018-7

https://book.douban.com/subject/30280610/?icn=index-latestbook-subject 天长地久 龙应台 2018-8-1

https://book.douban.com/subject/27197821/?icn=index-latestbook-subject 洞穴 [葡]若泽·萨拉马戈 2018-6

https://book.douban.com/subject/30231921/?icn=index-latestbook-subject 中国古代的谣言与谶语 栾保群 2018-7-1

https://book.douban.com/subject/30280340/?icn=index-latestbook-subject 格林童话 格林兄弟 2018-8

https://book.douban.com/subject/30222403/?icn=index-latestbook-subject 明治天皇 (美)唐纳德·基恩(DonaldKeene) 2018-7

https://book.douban.com/subject/30193776/?icn=index-latestbook-subject 西游八十一案 陈渐 2018-8

https://book.douban.com/subject/30259504/?icn=index-latestbook-subject 伟大的海 [英]大卫‧阿布拉菲亚 2018-7-1

https://book.douban.com/subject/28170953/?icn=index-latestbook-subject 荣耀 [美]纳博科夫 2018-7

https://book.douban.com/subject/30207028/?icn=index-latestbook-subject 中国烟草史 [美]班凯乐 2018-7

https://book.douban.com/subject/30212811/?icn=index-latestbook-subject 夜班经理 [英]约翰·勒卡雷 2018-8

https://book.douban.com/subject/30200827/?icn=index-latestbook-subject 乐队女孩 [美]金·戈登 2018-7

https://book.douban.com/subject/30167361/?icn=index-latestbook-subject 柏林1：石之城 [美]贾森·卢茨 2018-9

https://book.douban.com/subject/30198886/?icn=index-latestbook-subject 艾略特·厄威特的巴黎 [美]艾略特·厄威特&nbsp;/&nbsp;ElliottErwitt 2018-6

https://book.douban.com/subject/27176955/?icn=index-latestbook-subject 罗特小说集2 [奥]约瑟夫·罗特&nbsp;/&nbsp;刘炜主编 2018-6

https://book.douban.com/subject/30247531/?icn=index-latestbook-subject 听音乐（全彩插图第11版） [美]罗杰·凯密恩&nbsp;/&nbsp;RogerKamien 2018-7

https://book.douban.com/subject/30217911/?icn=index-latestbook-subject 诗人继续沉默 [以色列]亚伯拉罕·耶霍舒亚 2018-7

https://book.douban.com/subject/27045307/?icn=index-latestbook-subject 唐物的文化史 [日]河添房江 2018-7

https://book.douban.com/subject/30258687/?icn=index-latestbook-subject 战略级天使 白伯欢 2018-7

https://book.douban.com/subject/30252127/?icn=index-latestbook-subject 今天也要好好地过 [日]吉竹伸介 2018-8

https://book.douban.com/subject/30180831/?icn=index-latestbook-subject 哀歌 [日]远藤周作 2018-6

https://book.douban.com/subject/30274068/?icn=index-latestbook-subject 经济学的思维方式 托马斯·索维尔 2018-8

https://book.douban.com/subject/30243869/?icn=index-latestbook-subject 冷山 [美]查尔斯·弗雷泽 2018-8

https://book.douban.com/subject/30254431/?icn=index-latestbook-subject 被猜死的人 田耳 2018-8

https://book.douban.com/subject/27191001/?icn=index-latestbook-subject 东洋的近世 [日]宫崎市定著&nbsp;/&nbsp;[日]砺波护编 2018-7-20

https://book.douban.com/subject/30203733/?icn=index-latestbook-subject 阳光劫匪友情测试 [日]伊坂幸太郎 2018-8-1

https://book.douban.com/subject/30271484/?icn=index-latestbook-subject 深蓝的故事 深蓝 2018-7

https://book.douban.com/subject/30246163/?icn=index-latestbook-subject 默读.2 Priest 2018-6

https://book.douban.com/subject/27661637/?icn=index-latestbook-subject 放牧人生 [英]詹姆斯·里班克斯 2018-7

https://book.douban.com/subject/30237869/?icn=index-latestbook-subject 爱情故事的两个版本 [塞尔维亚]雅丝米娜·米哈伊洛维奇&nbsp;/&nbsp;[塞尔维亚]米洛拉德·帕维奇 2018-8

https://book.douban.com/subject/30187740/?icn=index-latestbook-subject 李霖灿读画四十年 李霖灿 2018-6

https://book.douban.com/subject/30218856/?icn=index-latestbook-subject 房客 [英]萨拉·沃特斯 2018-7

https://book.douban.com/subject/26877230/?icn=index-latestbook-subject 《英国史》（全三卷） [英]西蒙·沙玛 2018-7

获得的结果

PYTHON 爬虫笔记四:正则表达式基础用法的更多相关文章

Python爬虫进阶四之PySpider的用法
审时度势 PySpider 是一个我个人认为非常方便并且功能强大的爬虫框架,支持多线程爬取.JS动态解析,提供了可操作界面.出错重试.定时爬取等等的功能,使用非常人性化. 本篇内容通过跟我做一个好玩的 ...
Python爬虫入门四之Urllib库的高级用法
1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Headers 的属性. 首先,打开我们的浏览 ...
转 Python爬虫入门四之Urllib库的高级用法
静觅 » Python爬虫入门四之Urllib库的高级用法 1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我 ...
python爬虫之Beautiful Soup基础知识+实例
python爬虫之Beautiful Soup基础知识 Beautiful Soup是一个可以从HTML或XML文件中提取数据的python库.它能通过你喜欢的转换器实现惯用的文档导航,查找,修改文档 ...
[Python爬虫笔记][随意找个博客入门(一)]
[Python爬虫笔记][随意找个博客入门(一)] 标签(空格分隔): Python 爬虫 2016年暑假来源博客:挣脱不足与蒙昧 1.简单的爬取特定url的html代码 import urllib ...
python爬虫之re正则表达式库
python爬虫之re正则表达式库正则表达式是用来简洁表达一组字符串的表达式. 编译:将符合正则表达式语法的字符串转换成正则表达式特征操作符说明实例 . 表示任何单个字符 [ ] 字符集,对单 ...
python爬虫笔记Day01
python爬虫笔记第一天 Requests库的安装先在cmd中pip install requests 再打开Python IDM写入import requests 完成requests在.py文 ...
爬虫简介、requests 基础用法、urlretrieve()
1. 爬虫简介 2. requests 基础用法 3. urlretrieve() 1. 爬虫简介爬虫的定义网络爬虫(又被称为网页蜘蛛.网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程 ...
Python爬虫实战四之抓取淘宝MM照片
原文:Python爬虫实战四之抓取淘宝MM照片其实还有好多,大家可以看 Python爬虫学习系列教程福利啊福利,本次为大家带来的项目是抓取淘宝MM照片并保存起来,大家有没有很激动呢? 本篇目标 1. ...

随机推荐

「NOI2014」动物园
link : https://loj.ac/problem/2246 水水KMP #include<bits/stdc++.h> #define ll long long #define ...
javaScript 时间转换，将后台返回的时间为一串数字转成正常格式
js完整代码: function transferTime(cTime){ var jsonDate = new Date(parseInt(cTime)); Date.prototype.forma ...
IntelliJ IDEA删除所有断点
参考: http://blog.csdn.net/yanziit/article/details/73459795
在C#的数据类型中，什么属于值类型，什么属于引用类型
转自原文在C#的数据类型中,什么属于值类型,什么属于引用类型类型:整数,浮点数,高精度浮点数,布尔,字符,结构,枚举引用类型:对象(Object),字符串,类,接口,委托,数组除了值类型和引用类型 ...
移除array中重复的item
//move the repeated item NSInteger index = [orignalArray count] - 1; for (id o ...
【面试 IO】【第十一篇】 java IO
1.什么是比特(Bit),什么是字节(Byte),什么是字符(Char),它们长度是多少,各有什么区别 1>Bit最小的二进制单位 ,是计算机的操作部分取值0或者1 2>Byte是计算机 ...
WEBLOGIC启动后，重启后控制台进入缓慢、延迟，探查WEBLOGIC
本文说的是解决过程,可直接点击本行略过探查内容,跳到最后的解决办法!! 现象: 1.WEBLOGIC安装在 CENTOSopenSUSE 等LINUX或者UNIX机器上,无论是虚拟机或者PC或者服务器 ...
tensorflow搭建神经网络基本流程
定义添加神经层的函数 1.训练的数据2.定义节点准备接收数据3.定义神经层:隐藏层和预测层4.定义 loss 表达式5.选择 optimizer 使 loss 达到最小然后对所有变量进行初始化,通过 ...
全文索引-lucene，solr，nutch，hadoop之nutch与hadoop
全文索引-lucene.solr.nutch,hadoop之lucene 全文索引-lucene.solr,nutch,hadoop之solr 我在去年的时候,就想把lucene,solr.nutch ...
【转】css浮动元素的知识
原文: http://www.cnblogs.com/xuyao100/p/8940958.html ------------------------------------------------- ...

PYTHON 爬虫笔记四:正则表达式基础用法

知识点一：正则表达式详解及其基本使用方法

什么是正则表达式

测试正则表达式的网站

基本对照表

re库的用法详解

re.match()方法的使用re.match尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回None。基本用法为：

最基本的用法

泛匹配：（ ' .* ' 就可以把中间字符匹配到，但是必须制定起始位置）

匹配目标

贪婪匹配

#非贪婪匹配（'?'指定模式为非贪婪匹配，让其匹配尽可能少的字符）

匹配模式(换行符问题)

转义

总结：尽量使用泛匹配，使用括号得到匹配目标，尽量使用非贪婪模式、有换行符就用re.S

re.search()方法的使用（扫描整个字符串并返回第一个成功的匹配）

re.search()与re.match()方法的比较

总结：总的来说，能用search就不用match

匹配演练

re.findall()（搜索字符，以列表的形式返回全部匹配的字符串）

基本用法：

关于换行问题，“ (<a.*?>)? ”括号内表示一个组，“？”表示a标签可能有

re.sub（替换字符串中每一个匹配的字符串后返回替换后的的字符串

删除固定位置内容

把固定位置上的内容替换

在原有的基础上增加

练习

re.compile（将正则字符串编译成正则表达式对象，以便于复用该匹配对象）

基本使用

实战演练

实战内容：爬取豆瓣读书首页所有的图书的详情页连接，书名，作者，出版年内容。

一般方法：

方法一：使用strip()方法去空格

方法二：使用re.sub()方法替换空格

PYTHON 爬虫笔记四:正则表达式基础用法的更多相关文章

随机推荐

热门专题

re.match()方法的使用
re.match尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回None。基本用法为：