正则表达式-Python实现

1、概述：

Regular Expression。缩写regex，regexp，R等：

正则表达式是文本处理极为重要的工具。用它可以对字符串按照某种规则进行检索，替换。

Shell编程和高级编程语言中都支持正则表达式。

2、分类：

BRE：基本正则表达式，grep、sed、vi等软件支持，vim有扩展。

ERE:扩展正则表达式，egrep（grep-E）、sed-r。

PCRE: re模块。Python中。几乎所有高级语言都是PCRE的方言或者变种。

3、基本语法

1）元字符

代码	说明	举例
.	匹配除换行符外任意一个字符	.
[abc]	字符集合，只能表示一个字符位置。匹配所包含的任意一个字符	[abc]匹配plain中的’a’
[^abc]	字符集合，只能表示一个字符位置，匹配除去集合内字符的任意一个字符。	[^abc]可以匹配plain中’p’,’l’,’i’或者’n’
[a-z]	字符范围，也是个集合，表示一个字符位置，匹配所包含的任意一个字符。	[a-d]可以匹配plain中的’a’
[^a-z]	字符范围，也是个集合，表示一个字符位置，匹配除去集合内字符的任意一个字符。	[a-d]可以匹配plain的中’p’,’l’,’i’或者’n’
\b	匹配单词的边界	\ba在文本中找到单词a开头的，a\b找到以a结尾的。
\B	不匹配单词的边界	t\B包含t的单词但是不以t结尾，write \Bb不以b开头但是包含b的单词，able
\d	[0-9]匹配1位数字	\d
\D	[^0-9]匹配一位非数字
\w	匹配[a-zA-Z0-9]，包括中文的字	\w
\W	匹配\w之外的字符。
\s	匹配1位空白字符，包括换行符，制表符、空格[\f\r\n\t\v]
\S	匹配非空白字符。

2）转义

凡是在正则表达式中有特殊意义的符号，如果想使用它他的本意，使用\转义，反斜杠自身，得使用\\ \r 转义后代表回车 \n 换行

3）重复

代码	说明	举例
*	表示前面的正则表达式会重复0次或者多次	e\w*单词e中可以有非空白字符。
+	表示前面的正则表达式重复至少1次	e\w+单词e后面至少有一个非空白字符。
？	表示前面的正则表达式会重复0次或者1次	e\w?单词e后面至多有一个非空白字符。
{n}	重复固定的n次	e \w{1}单词中e后面只能有一个非白字符。
{n，}	重复至少n次	e \w{1,} >>>e\w+ e\w{0,}>>>e\w* e \w{0,1}>>>e\w?
{n，m}	重复n到m次	e \w{1,10}单词后面至少一个，至多10个非空白字符。

4）基本练习：

（1）匹配手机号码：

\d{11,}

（2）匹配中国座机：

\d{3,4}-\d{7,8}

5）源代码

代码	说明	举例
x \| y	匹配x或者y	Wood took foot food 使用w\|food 或者（w\|f）odd
捕获
（pattern）	使用小括号指定一个字表达式，也叫分组捕获后会自动分配组号从1开始，可以改变优先级
\数字	匹配对应的分组	(very)\1匹配very very，但捕获的组group是very。
（?：pattern）	如果仅仅为了改变优先级，就不需要捕获分组	(?:w\|f)ood ‘industr(?:y\|lies)’等价于’industry\|industries’
（?<name>exp）(?’name’exp)	分组捕获，但是可以通过name访问分组。Python的语法必须是（?P<name>exp）
零宽断言
(?=exp)	零宽度正测先行断言：断言exp一定在匹配的右边出现，也就是说断言后面一定跟个exp	f(?=00)f后面一定有oo出现
（?<=exp）	零宽度正回顾后发断言：断言exp一定出现在匹配的左边出现。也就是说前面一定有个exp前缀	（?<=f）ood,(?<=t)ook分别匹配ood，ook，ook前面一定有t出现。
负向零宽断言
(?!exp)	零宽度负预测先行断言；断言一定不会出现在右侧，也就是说断言后面一定不是exp	\d{3}(?!\d)匹配三位数字，断言三位数字后面一定不能是数字
(?<!exp)	零宽度负回顾后发断言断言exp一定不能出现在左侧，也就是说断言前面一定不能是exp	（?<!f）ood ood 的左边一定不是f
注释
(?#comment)	注释

断言不占分组号，断言如同条件，只是要求匹配必须满足断言条件。

分组和捕获是同一个意思；

使用正则表达式时候，能用简单表达式，就不要复杂的表达式。

6）贪婪与非贪婪；

代码	说明	举例
*?	匹配任意次，但是尽可能少重复	*？尽可能的少，可以是没有。
+?	匹配至少1次，但是尽可能少重复	+?至少一次。
??	匹配0次或者1次，尽可能少重复	??尽可能的少，至少0次。
{n，}?	匹配至少n次，尽可能没有
{n，m}	至少匹配n次，至多m次，尽可能少重复
	Very very very happy	V.y 和v.?y

7）引擎选项：

代码	说明	Python中
ignoreCase	匹配时忽略大小写	re.l re.lGNORECASE
Singleline	单行模式，可以匹配所有字符，包括\n	re.S re.DOTALL
Multine	多行模式^,行首，$行尾	re.M re.MULTLINE
lgnorePatternWhites	忽略表达式中的空白字符，如果要使用空白字符转义，#可以用来做注释	re.x re.VERBOSE

8）总结：

单行模式；

.可以匹配所有字符，包括换行符

^表示整个字符串的开头，$整个字符串的结尾。

多行模式：

.可以匹配除了换行符之外的字符。

^表示整个字符串的开头，$整个字符串的结尾。

^表示整个字符串的开始，$表示整个字符串的结尾。开始指的是\n紧接着下一个字符，结束指的是/n前的字符。

可以认为，单行模式就如同看穿了换行符，所有文本就是一个长长的只有一行的字符串，所有^表示整个字符串的开头，$整个字符串的结尾。

多行模式，注意字符串看不见的换行符，\r\n 会影响e$测试，e$只能匹配e\n.

*重复任意次限制的话用*？得到了限制。

默认是贪婪模式，也就是尽量多匹配长的字符串。

9）练习题：

匹配一个0-9999之间的任意数字：

^([1-9]?\d\d?|\d)(?!\d)

匹配合法的ip地址：

(\d{1,3}\.){3}\d{1,3}

192.168.1.150

0.0.0.0

255.255.255.255

17.16.52.100

172.16.0.100

400.400.999.888

001.022.003.000

257.257.255.256

ip的验证采用python 的socket模块.

选出含有ftp的链接。且文件的类型是gz或者xz的。

.*ftp.*/([^/]*\.(?:gz|xz))

4、python的正则表达式

1）常量：r.M(re.MULTILINE)多行 r.S(re.DOTALL)单行。 r.L(re.IGNORECASE)忽略大小写。 r.X(re.VERBOSE)忽略空白字符。使用|或运算

2）方法、编译：

re.compile(pattern,flags=0)

设定flag，编译模式。返回正则表达式对象regex。

Pattern就是正则表达式的字符串，flags是选项，正则表达式需要被编译，为了提高效率，为了提高效率，这些编译后的结果就会被保存，下次使用同样的pattern的时候，就会不需要再次编译。

re的其他方法为了提高效率都调用了编译方法，就是为了提速。

3）Re.matth(pattern,string,flags=0)匹配只是做了单次的匹配。从头开始，从第一个字符串匹配上。对匹配字符串加上了一^字符。只是匹配了一次。

Regex.match编译后可以调整位置（切片）可以设置开始和结束的位置。返回match对象 regex = re.compile

match必须是以他开头的，指定索引。

4）re.search(pattern,string,flags=0) 全文搜索，不限定在哪里开始查找，找到第一个匹配对象立即返回，找不到返回none。只是找第一个。

Regex.search（）可以指定位置。

5)re.fullmatch（pattern,string,flags=0）完全匹配。

regex.fullmatch(string) 整个字符串和正则表达式匹配。

import re


s = '''bottle\nbag\nbig\napple'''


for i,c in enumerate(s,1):


    print((i-1,c),end='\n' if i%8==0 else ' ')


print()








print('--match--')


result = re.match('b',s)


print(1,result)


result= re.match('a',s)


print(2,result)


result = re.match('^a',s,re.M)


print(3,result)


result = re.match('^a',s,re.S)


print(4,result)


regex = re.compile('a')


result =regex.match(s)


print(5,result)


result=regex.match(s,15)


print(6,result)





print('--search--')


result = re.search('a',s)


print(7,result)


regex = re.compile('b')


result=regex.search(s,1)


print(8,result)


regex=re.compile('^b',re.M)


result=regex.search(s)


print(8.5,result)


result=regex.search(s,8)


print(9,result)





print('--fullmatch--')


result=re.fullmatch('bag',s)


print(10,result)


regex=re.compile('bag')


result=regex.fullmatch(s)


print(11,result)


result = regex.fullmatch(s,7)


print(12,result)


result=regex.fullmatch(s,7)


print(13,result)

(0, 'b') (1, 'o') (2, 't') (3, 't') (4, 'l') (5, 'e') (6, '\n') (7, 'b')

(8, 'a') (9, 'g') (10, '\n') (11, 'b') (12, 'i') (13, 'g') (14, '\n') (15, 'a')

(16, 'p') (17, 'p') (18, 'l') (19, 'e')

--match--

1 <_sre.SRE_Match object; span=(0, 1), match='b'>

2 None

3 None

4 None

5 None

6 <_sre.SRE_Match object; span=(15, 16), match='a'>

--search--

7 <_sre.SRE_Match object; span=(8, 9), match='a'>

8 <_sre.SRE_Match object; span=(7, 8), match='b'>

8.5 <_sre.SRE_Match object; span=(0, 1), match='b'>

9 <_sre.SRE_Match object; span=(11, 12), match='b'>

--fullmatch--

10 None

11 None

12 None

13 None

6)、全文搜索；

Re.findall（pattern,string,flags=0）全文搜索，全部搜索。返回匹配项的列表

Regex.findall(string,)

Re.finditer（）返回匹配项的可迭代对象。返回的都是match对象

Regex.finditer（）

import re


s = '''bottle\nbag\nbig\napple'''


for i,c in enumerate(s,1):


    print((i-1,c),end='\n' if i%8==0 else ' ')


print()





print('--findall--')


result = re.findall('b',s)


print(1,result)


regex = re.compile('^b')


result = regex.findall(s)


print(2,result)


regex=re.compile('^b',re.M)


result=regex.findall(s,7)


print(3,result)


regex=re.compile('^b',re.S)


result=regex.findall(s)


print(4,result)


regex=re.compile('^b',re.M)


result=regex.findall(s,7,10)


print(5,result)


print('--finder--')


result=regex.finditer(s)


print(1,type(result))


print(2,next(result))


print(3,next(result))

(0, 'b') (1, 'o') (2, 't') (3, 't') (4, 'l') (5, 'e') (6, '\n') (7, 'b')

(8, 'a') (9, 'g') (10, '\n') (11, 'b') (12, 'i') (13, 'g') (14, '\n') (15, 'a')

(16, 'p') (17, 'p') (18, 'l') (19, 'e')

--findall--

1 ['b', 'b', 'b']

2 ['b']

3 ['b', 'b']

4 ['b']

5 ['b']

--finder--

1 <class 'callable_iterator'>

2 <_sre.SRE_Match object; span=(0, 1), match='b'>

3 <_sre.SRE_Match object; span=(7, 8), match='b'>

5、匹配替换：

re.sub（pattern,repleacement,string,count=0,flags=0）替换

regex.sub（replacement,string,count=0）替换

使用pattern对字符串string进行匹配，对匹配项使用replancement替换吧，可以是string，bytes，function。

re.subn（pattern，replacement，string，count=0，flags=0）输出二元组，提供替换的次数。

regex.subn（replacement，string，count=0，flags=0）

regex = re.compile('b\wg')


result = regex.sub('magedu',s)


print(1,result)


result = regex.sub('magedu',s,1)


print((2,result))





regex =re.compile('\s+')


result = regex.subn('\t',s)


print(3,result)

1 bottle

magedu

apple

(2, 'bottle\nmagedu\nbig\napple')

3 ('bottle\tbag\tbig\tapple', 3)

6、分隔字符串：

Re.split(pattern,string,maxsplit=0,flag=0)

Re.split分隔字符串

import re





s= '''01  bottle


02 bag


03        big1


100    able'''





x = re.split('\s+\d+\s+',' '+s)


print(x)

['', 'bottle', 'bag', 'big1', 'able']

7、分组：

使用小括号的pattern捕获的数据放到了组group中。

Match，search函数均可以返回match对象。Findall返回的是字符串列表。。Finditer一个个返回match对象。。

如果pattern中使用了分组，如果有匹配结果，会在match对象中。

1)使用group（N）方式返回对应的分组，1-N对应的是分组，0返回整个匹配的字符串。

2)如果使用了命名分组，可以使用group（‘name’）的方式取分组。

3)也可以使用groups（）返回所有组。

4)使用groupdict返回所有命名的分组。

Matcher.group()

matcher.groups()返回的是二元组。

Matcher.groupdict()字典。

import re


s = '''bottle\nbag\nbig\napple'''


for i,c in enumerate(s,1):


    print((i-1,c),end='\n' if i%8==0 else ' ')


print()


regex = re.compile('(b\w+)')


result = regex.match(s)


print(type(result))


print(1,'match',result.groups())


result =regex.search(s,1)


print(2,'search',result.groups())


regex = re.compile('(b\w+)\n(?P<name2>b\w+)\n(?P<name3>b\w+)')


result = regex.match(s)


print(3,'match',result)


print(4,result.group(3),result.group(2),result.group(1))


print(5,result.group(0).encode())


print(6,result.group('name2'),result.group('name3'))


print(6,result.groups())


print(7,result.groupdict())





result = regex.findall(s)


for x in result:


    print(8,type(x),x)





regex = re.compile('(?P<head>b\w+)')


result = regex.finditer(s)


for x in result:


    print(9,type(x),x,x.group(),x.group('head'))

(0, 'b') (1, 'o') (2, 't') (3, 't') (4, 'l') (5, 'e') (6, '\n') (7, 'b')

(8, 'a') (9, 'g') (10, '\n') (11, 'b') (12, 'i') (13, 'g') (14, '\n') (15, 'a')

(16, 'p') (17, 'p') (18, 'l') (19, 'e')

1 match ('bottle',)

2 search ('bag',)

3 match <_sre.SRE_Match object; span=(0, 14), match='bottle\nbag\nbig'>

4 big bag bottle

5 b'bottle\nbag\nbig'

6 bag big

6 ('bottle', 'bag', 'big')

7 {'name2': 'bag', 'name3': 'big'}

8 <class 'tuple'> ('bottle', 'bag', 'big')

9 <class '_sre.SRE_Match'> <_sre.SRE_Match object; span=(0, 6), match='bottle'> bottle bottle

9 <class '_sre.SRE_Match'> <_sre.SRE_Match object; span=(7, 10), match='bag'> bag bag

9 <class '_sre.SRE_Match'> <_sre.SRE_Match object; span=(11, 14), match='big'> big big

8、练习题：

1）判断邮箱地址。

\w+[-.\w]*@[\w-]+(\.[\w-]+)+

2）html提取：

<[^<>]+>(.*)<^<>+>

3）URL提取。

（\w+）://([^\s]+)

4）身份验证

身份证验证需要使用计算公式，最严格的应该是实名验证。

\d{17}[0-9xX]|\d{15}

5）单词统计利用makekey等进行查找。