Python与正则表达式[0] -> re 模块的正则表达式匹配

正则表达式 / Regular Expression

正则表达式模式
re 模块简介
使用正则表达式进行匹配

正则表达式RE(Regular Expression, Regexp, Regex)，又称为正规表示法，正规表达式，规则表达式，常规表达式，常规表示法，常简写为regex,regexp或RE。计算机科学的一个概念。正则表达式使用单个字符串来描述或匹配一系列符合某个句法规则的字符串。在许多文本编辑器中，正则表达式常被用于检索、替换那些匹配某个模式的文本。

1 正则表达式模式 / RE Pattern

对于正则表达式，其核心与基础是建立起一个正则表达式模式，而一个正则表达式通常由一些特殊字符和符号构成，下面介绍相关的特殊表达式符号与字符。

字符	描述
\	将下一个字符标记为一个特殊字符、或一个原义字符、或一个向后引用、或一个八进制转义符。例如，“n”匹配字符“n”。“\n”匹配一个换行符。串行“\\”匹配“\”而“\(”则匹配“(”。
^	匹配输入字符串的开始位置。如果设置了RegExp对象的Multiline属性，^也匹配“\n”或“\r”之后的位置。
$	匹配输入字符串的结束位置。如果设置了RegExp对象的Multiline属性，$也匹配“\n”或“\r”之前的位置。
*	匹配前面的子表达式零次或多次。例如，zo能匹配“z”以及“zoo”。等价于{0,}。
+	匹配前面的子表达式一次或多次。例如，“zo+”能匹配“zo”以及“zoo”，但不能匹配“z”。+等价于{1,}。
?	匹配前面的子表达式零次或一次。例如，“do(es)?”可以匹配“does”或“does”中的“do”。?等价于{0,1}。
{n}	n是一个非负整数。匹配确定的n次。例如，“o{2}”不能匹配“Bob”中的“o”，但是能匹配“food”中的两个o。
{n,}	n是一个非负整数。至少匹配n次。例如，“o{2,}”不能匹配“Bob”中的“o”，但能匹配“foooood”中的所有o。“o{1,}”等价于“o+”。“o{0,}”则等价于“o*”。
{n,m}	m和n均为非负整数，其中n<=m。最少匹配n次且最多匹配m次。例如，“o{1,3}”将匹配“fooooood”中的前三个o。“o{0,1}”等价于“o?”。请注意在逗号和两个数之间不能有空格。
?	当该字符紧跟在任何一个其他限制符（,+,?，{n}，{n,}，{n,m}）后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多*的匹配所搜索的字符串。例如，对于字符串“oooo”，“o+?”将匹配单个“o”，而“o+”将匹配所有“o”。
.	匹配除“\n”之外的任何单个字符。要匹配包括“\n”在内的任何字符，请使用像“(.\|\n)”的模式。
(pattern)	匹配pattern并获取这一匹配。所获取的匹配可以从产生的Matches集合得到，在VBScript中使用SubMatches集合，在JScript中则使用$0…$9属性。要匹配圆括号字符，请使用“$”或“$”。
(?:pattern)	匹配pattern但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。这在使用或字符“(\|)”来组合一个模式的各个部分是很有用。例如“industr(?:y\|ies)”就是一个比“industry\|industries”更简略的表达式。
(?=pattern)	正向肯定预查，在任何匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如，“Windows(?=95\|98\|NT\|2000)”能匹配“Windows2000”中的“Windows”，但不能匹配“Windows3.1”中的“Windows”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。
(?!pattern)	正向否定预查，在任何不匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如“Windows(?!95\|98\|NT\|2000)”能匹配“Windows3.1”中的“Windows”，但不能匹配“Windows2000”中的“Windows”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始
(?<=pattern)	反向肯定预查，与正向肯定预查类似，只是方向相反。例如，“(?<=95\|98\|NT\|2000)Windows”能匹配“2000Windows”中的“Windows”，但不能匹配“3.1Windows”中的“Windows”。
(?<!pattern)	反向否定预查，与正向否定预查类似，只是方向相反。例如“(?<!95\|98\|NT\|2000)Windows”能匹配“3.1Windows”中的“Windows”，但不能匹配“2000Windows”中的“Windows”。
(?P<name>)	具名匹配子组，以name为标记名的一个匹配子组
(?P=name)	获取具名子组的匹配内容，注意，并非表达式而是匹配成功的内容
(?aiLmsux)	特殊标记参数，可以在正则表达式前嵌入，用于表达式的模式控制，例如“(?im)Python”可以进行忽略大小写，多行模式匹配
(?(id/name)Y\|N)	如果分组中提供的id或name存在，则返回正则表达式的条件匹配Y，否则返回N，\|N是可选项。
x\|y	匹配x或y。例如，“z\|food”能匹配“z”或“food”。“(z\|f)ood”则匹配“zood”或“food”。
[xyz]	字符集合。匹配所包含的任意一个字符。例如，“[abc]”可以匹配“plain”中的“a”。
[^xyz]	负值字符集合。匹配未包含的任意字符。例如，“[^abc]”可以匹配“plain”中的“p”。
[a-z]	字符范围。匹配指定范围内的任意字符。例如，“[a-z]”可以匹配“a”到“z”范围内的任意小写字母字符。
[^a-z]	负值字符范围。匹配任何不在指定范围内的任意字符。例如，“[^a-z]”可以匹配任何不在“a”到“z”范围内的任意字符。
\b	匹配一个单词边界，也就是指单词和空格间的位置。例如，“er\b”可以匹配“never”中的“er”，但不能匹配“verb”中的“er”。
\B	匹配非单词边界。“er\B”能匹配“verb”中的“er”，但不能匹配“never”中的“er”。
\cx	匹配由x指明的控制字符。例如，\cM匹配一个Control-M或回车符。x的值必须为A-Z或a-z之一。否则，将c视为一个原义的“c”字符。
\d	匹配一个数字字符。等价于[0-9]。
\D	匹配一个非数字字符。等价于[^0-9]。
\f	匹配一个换页符。等价于\x0c和\cL。
\n	匹配一个换行符。等价于\x0a和\cJ。
\r	匹配一个回车符。等价于\x0d和\cM。
\s	匹配任何空白字符，包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。
\S	匹配任何非空白字符。等价于[^ \f\n\r\t\v]。
\t	匹配一个制表符。等价于\x09和\cI。
\v	匹配一个垂直制表符。等价于\x0b和\cK。
\w	匹配包括下划线的任何单词字符。等价于“[A-Za-z0-9_]”。
\W	匹配任何非单词字符。等价于“[^A-Za-z0-9_]”。
\xn	匹配n，其中n为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如，“\x41”匹配“A”。“\x041”则等价于“\x04&1”。正则表达式中可以使用ASCII编码。.
\num	匹配num，其中num是一个正整数。对所获取的匹配的引用。例如，“(.)\1”匹配两个连续的相同字符。
\n	标识一个八进制转义值或一个向后引用。如果\n之前至少n个获取的子表达式，则n为向后引用。否则，如果n为八进制数字（0-7），则n为一个八进制转义值。
\nm	标识一个八进制转义值或一个向后引用。如果\nm之前至少有nm个获得子表达式，则nm为向后引用。如果\nm之前至少有n个获取，则n为一个后跟文字m的向后引用。如果前面的条件都不满足，若n和m均为八进制数字（0-7），则\nm将匹配八进制转义值nm。
\nml	如果n为八进制数字（0-3），且m和l均为八进制数字（0-7），则匹配八进制转义值nml。
\un	匹配n，其中n是一个用四个十六进制数字表示的Unicode字符。例如，\u00A9匹配版权符号（©）。

2 re 模块 / re Module

2.1 常量 / Constants

2.1.1 I / IGNORECASE

常量数值: 2

常量功能:忽略大小写的正则表达式模式

2.1.2 L / LOCALE

常量数值: 4

常量功能:使\w, \W, \b, \B取决于当前环境

2.1.3 M / MULTILINE

常量数值: 8

常量功能:多行模式匹配的正则表达式模式

2.1.4 S / DOTALL

常量数值: 16

常量功能:“.”可以匹配换行符“\n”的正则表达式模式

2.1.5 U / UNICODE

常量数值: 32

常量功能:根据Unicode字符集解析字符，影响\w, \W, \b, \B，为默认值，可设为ASCII

2.1.6 X / VERBOSE

常量数值: 64

常量功能:可以抑制空白符的匹配(#为注释符)，从而生成更易读的正则表达式模式

2.1.7 A / ASCII

常量数值: 256

常量功能:使\w, \W, \b, \B, \d, \D匹配对应ASCII字符而不是默认的所有Unicode字符

2.2 函数 / Function

2.2.1 compile()函数

函数调用: pt = re.compile(pattern, flags=0)

函数功能:对一个正则表达式进行编译，返回一个正则表达式编译对象

传入参数: pattern, flags

pattern: str类型，需要编译的正则表达式

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: obj类型，编译后的正则表达式类型

Note: 通过预编译可以将字符串正则表达式编译成字节码，这将提升使用时的性能，当一个正则表达式多次反复使用时，可进行编译，避免每次使用都要通过字符串进行反复调用与编译。

2.2.2 match()函数

函数调用: pt = re.match(pattern, string, flags=0)

函数功能:从待匹配字符串的开头进行匹配

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.2.3 fullmatch()函数

函数调用: pt = re.fullmatch(pattern, string, flags=0)

函数功能:从待匹配字符串的开头进行完全匹配（匹配对象与表达式完全一致）

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.2.4 search()函数

函数调用: pt = re.search(pattern, string, flags=0)

函数功能:对待匹配字符串进行搜索匹配

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.2.5 sub()函数

函数调用: pt = re.sub(pattern, repl, string, count=0, flags=0)

函数功能:对目标字符串进行模式匹配并替换为指定的字符

传入参数: pattern, repl, string, count, flags

pattern: str/obj类型，正则表达式字符串或对象

repl: str/function类型，替换的字符串或一个可返回字符串的函数

string: str类型，待匹配及替换的对象

count: int类型，替换的数量，默认全部匹配成功的结果都进行替换

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配及替换后的正则表达式对象

2.2.6 subn()函数

函数调用: pt = re.subn(pattern, repl, string, count=0, flags=0)

函数功能:对目标字符串进行模式匹配并替换为指定的字符，返回操作数量

传入参数: pattern, repl, string, count, flags

pattern: str/obj类型，正则表达式字符串或对象

repl: str类型，替换的字符串

string: str类型，待匹配及替换的对象

count: int类型，替换的数量，默认全部匹配成功的结果都进行替换

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: (pt, n)

pt: None/obj类型，匹配及替换后的正则表达式对象

n: int类型，返回进行替换操作的数量

2.2.7 split()函数

函数调用: pt = re.split(pattern, string, maxsplit=0, flags=0)

函数功能:对目标字符串根据正则表达式进行分割

传入参数: pattern, string, maxsplit, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配及替换的对象

maxsplit: int类型，分割的最大数量，默认分割全部匹配成的位置

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: list类型，分割后的字符串列表

2.2.8 findall()函数

函数调用: pt = re.findall(pattern, string, flags=0)

函数功能:获取所有(非重复)匹配成功的对象列表

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: list类型，匹配后的正则表达式对象列表

Note: 若正则表达式pattern中有多个子组，则返回的列表中包含的是一个元组，元组的元素为各个子组匹配的对象

2.2.9 finditer()函数

函数调用: pt = re.finditer(pattern, string, flags=0)

函数功能:获取所有(非重复)匹配成功的对象迭代器

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: iterator类型，匹配后的正则表达式对象迭代器

2.3 类 / Class

2.3.1 __Regex类

类实例化：pass

类的功能: RE类具有re模块的基本函数方法，如search/match/sub等

传入参数: pass

返回参数: pass

2.3.2 __Match类

类实例化：mt = re.match/search(pattern, string, flags=0)

类的功能:用于生成导入图片的实例

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.3.2.1 group()方法

函数调用: pt = mt.group(num=0)

函数功能:返回整个匹配对象或者指定的子组

传入参数: num

num: int类型，子组编号

返回参数: pt

pt: str/obj类型，匹配后的正则表达式对象或子组

2.3.2.2 groups()方法

函数调用: pt = mt.groups()

函数功能:返回包含整个匹配对象的元组

传入参数: 无

返回参数: pt

pt: tuple类型，匹配后的正则表达式对象或子组元组，无则返回空元组

2.3.2.3 groupdict()方法

函数调用: pt = mt.groupdict()

函数功能:返回包含整个匹配对象中所有具名子组形成的字典

传入参数: 无

返回参数: pt

pt: dict类型，匹配后的正则表达式对象或子组字典

2.3.2.4 span()方法

函数调用: pt = mt.span(group=0)

函数功能:返回一个元组，包含匹配结果在被匹配对象中的起始终止位置

传入参数: group

group: int类型，子组编号

返回参数: pt

pt: tuple类型，包含位置信息，(start, end)

3 使用正则表达式进行匹配 / RE Match

3.1 常用方法示例

下面列举一些re常用的表达式模式以及函数方法的使用示例，

完整代码如下

 import re

 """

 Regular Expression

 """

 def regexp(pattern, target, *args, grp='group', prt=True, func='search'):

     pat = re.compile(pattern)

     try:

         r = getattr(getattr(re, func)(pattern, target), grp)(*args)

     except AttributeError as e:

         r = None

         # print(e)

     if prt:

         print(r)

     return r

 # Use . to match all

 print(30*'-')

 regexp('.+', 'exit soon') # 'exit soon'

 # Use () to make sub-groups

 print(30*'-')

 regexp('(.+) (.+)', 'exit soon', 1) # 'exit'

 regexp('(.+) (.+)', 'exit soon', 2) # 'soon'

 regexp('(.+) (.+)', 'exit soon', grp='groups')  # ('exit', 'soon')

 # Use ^ to search from head

 print(30*'-')

 regexp('^The', 'The End')   # 'The'

 regexp('^The', 'In The End')    # None

 # Use \b to search boundary

 print(30*'-')

 regexp(r'\bThe\b', 'In The End') # 'The'

 regexp(r'\bThe', 'In TheEnd')   # 'The'

 regexp(r'The\b', 'In TheEnd')   # None

 # match and search

 print(30*'-')

 regexp('The', 'In The End', func='search') # 'the'

 regexp('The', 'In The End', func='match')   # None

 # findall and finditer

 # Note:

 # findall returns a list that contains string of matched result

 # finditer returns a iterator that contains obj of matched result

 # re.IGNORECASE can ignore capitalized

 print(30*'-')

 print(re.findall('The', 'In The End, these things merged', re.I)) # ['The', 'The']

 itera = re.finditer('The', 'In The End, these things merged', re.I)

 for x in itera:

     print(x.group())                                        # 'The'

 # sub and subn

 print(re.sub('X', 'LIKE', 'This is X, X is acting'))

 print(re.subn('X', 'LIKE', 'This is X, X is acting'))

 # split: split(re-expression, string)

 print(re.split(', |\n', 'This is amazing\nin the end, those things merged'))    #['This is amazing', 'in the end', 'those things merged']

 # \N: use \N to represent sub group, N is the number of sub group

 print(re.sub(r'(.{3})-(.{3})-(.{3})', r'\2-\3-\1', '123-def-789'))  # 'def-789-123'

 # (?P<name>): similar to \N, add tag name for each sub group,

 # and use \g<name> to fetch sub group

 print(re.sub(r'(?P<first>\d{3})-(?P<second>\d{3})-(?P<third>\d{3})', r'\g<second>-\g<third>-\g<first>', '123-456-789')) # 456-789-123

 # (?P=name): use this expression to reuse former sub group result

 # Note: this expression only get the matched result, not the re pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-abc-123'))    # Match obj, '123-abc-abc-123'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-def-456'))    # None

 # Note: should use (?P=name) in a re expression with former named group

 print(re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato'))  # 'YYXXXYY'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd')) # failed:(?P=char)-(?P=digit)

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))    # 'abcd-123'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))   # 'abcd-123'

 # groupdict(): return dict of named pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').groupdict())    # {'char': 'abcd', 'digit': '123'}

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('char'))  # 'abcd'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('digit')) # '123'

 # re extensions

 # use (?aiLmsux): re.A, re.I, re.L, re.M, re.S, re.X

 # use (?imsx)

 # (?i) --> re.I/re.IGNORECASE

 print(re.findall(r'(?i)yes', 'yes, Yes, YES'))  # ['yes, Yes, YES']

 # (?m) --> re.M/re.MULTILINE: match multiline, ^ for head of line, $ for end of line

 print(re.findall(r'(?im)^The|\w+e$', '''The first day, the last one,

 the great guy, the puppy love'''))  # ['The', 'the', 'love']

 # (?s) --> re.S/re.DOTALL: dot(.) can be use to replace \n(default not)

 print(re.findall(r'(?s)Th.+', '''This is the biggest suprise

 I'd ever seen'''))  # ['This is the biggest suprise\nI'd ever seen']

 # (?x) --> re.X/re.VERBOSE: make re ignore the blanks and comments after '#' of re pattern

 # This extension can make re pattern easy to read and you can add comments as you want

 print(re.search(r'''(?x)

                     \((\d{3})\) # Area code

                     [ ]         # blank

                     (\d{3})     # prefix

                     -           # connecting line

                     (\d{4})     # suffix

                     ''', '(123) 456-7890').groups())    # ['123', '456', '789']

 # (?:...): make a sub group that no need to save and never use later

 print(re.findall(r'(?:\w{3}\.)(\w+\.com)', 'www.google.com'))   # ['google.com']

 # (?=...) and (?!...)

 # (?=...): match the pattern which end with ..., pattern should place before (?=...)

 print(re.findall(r'\d{3}(?=Start)', '222Start333, this is foo, 777End666')) # ['222']

 # (?!...): match the pattern which not end with ..., pattern should place before (?!...)

 print(re.findall(r'\d{3}(?!End)', '222Start333, this is foo, 777End666')) # ['222', '333', '666']

 # (?<=...) and (?<!...)

 # (?<=...): match the pattern which start with ..., pattern should place after (?<=...)

 print(re.findall(r'(?<=Start)\d{3}', '222Start333, this is foo, 777End666')) # ['333']

 # (?<!...): match the pattern which not stert with ..., pattern should place after (?<!...)

 print(re.findall(r'(?<!End)\d{3}', '222Start333, this is foo, 777End666')) # ['222', '333', '777']

 # (?(id/name)Y|X): if sub group \id or name exists, match Y, otherwise match X

 # Below code first match the first char, if 'x' matched, store a sub group, if 'y', not to store

 # then match second char, if sub group stored('x' matched), match 'y', otherwise match 'x', finally return result

 print(re.search(r'(?:(x)|y)(?(1)y|x)', 'yx'))

 # Greedy match: '+' and '*' act greedy match, appending '?' to make no greedy match

 print(re.search(r'.+(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.+?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

 print(re.search(r'.*(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.*?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

 # match and fullmatch

 print(re.match(r'This is a full match', 'this is a full match string', re.I))   # this is a full match

 print(re.fullmatch(r'This is a full match', 'this is a full match string', re.I))   # None

 # span function

 print(re.search(r'Google', 'www.google.com', re.I).span())  # (4, 10)

分段解释

导入re模块后，为方便后续使用，首先定义一个regexp方法，实现对re的正则表达式编译，输出，以及显示的功能。默认参数下，使用 search 方法进行匹配，并对匹配结果调用 group 函数，获取返回信息，可通过传入参数修改 regexp 的默认运行参数。

Note: 此处为方便后续使用而降低了代码的可读性。

 import re

 """

 Regular Expression

 """

 def regexp(pattern, target, *args, grp='group', prt=True, func='search'):

     pat = re.compile(pattern)

     try:

         r = getattr(getattr(re, func)(pattern, target), grp)(*args)

     except AttributeError as e:

         r = None

         # print(e)

     if prt:

         print(r)

     return r

接下来利用上的 regexp 函数来完成后续的匹配工作，注释部分为输出结果。

使用“.”来匹配所有非“\n”的字符

 # Use . to match all

 print(30*'-')

 regexp('.+', 'exit soon\nHurry up.') # 'exit soon'

使用“()”来生成匹配子组

 # Use () to make sub-groups

 print(30*'-')

 regexp('(.+) (.+)', 'exit soon', 1) # 'exit'

 regexp('(.+) (.+)', 'exit soon', 2) # 'soon'

 regexp('(.+) (.+)', 'exit soon', grp='groups')  # ('exit', 'soon')

使用“^”来表示从头部匹配(“$为尾部”)

 # Use ^ to search from head

 print(30*'-')

 regexp('^The', 'The End')   # 'The'

 regexp('^The', 'In The End')    # None

使用“\b”来匹配边界(空白符等)

 # Use \b to search boundary

 print(30*'-')

 regexp(r'\bThe\b', 'In The End') # 'The'

 regexp(r'\bThe', 'In TheEnd')   # 'The'

 regexp(r'The\b', 'In TheEnd')   # None

match和search函数的对比，区别在于匹配开始的位置

 # match and search

 print(30*'-')

 regexp('The', 'In The End', func='search') # 'the'

 regexp('The', 'In The End', func='match')   # None

findall和finditer的对比，区别在于返回对象是列表还是迭代器

 # findall and finditer

 # Note:

 # findall returns a list that contains string of matched result

 # finditer returns a iterator that contains obj of matched result

 # re.IGNORECASE can ignore capitalized

 print(30*'-')

 print(re.findall('The', 'In The End, these things merged', re.I)) # ['The', 'The']

 itera = re.finditer('The', 'In The End, these things merged', re.I)

 for x in itera:

     print(x.group())                                        # 'The'

sub和subn的对比，区别在于返回对象中是否包含计数值

 # sub and subn

 print(re.sub('X', 'LIKE', 'This is X, X is acting'))

 print(re.subn('X', 'LIKE', 'This is X, X is acting'))

使用split函数切分目标字符串

 # split: split(re-expression, string)

 print(re.split(', |\n', 'This is amazing\nin the end, those things merged'))

使用\N来获取对应子组的匹配内容，(?P<name>)来生成具名子组

 # \N: use \N to represent sub group, N is the number of sub group

 print(re.sub(r'(.{3})-(.{3})-(.{3})', r'\2-\3-\1', '123-def-789'))  # 'def-789-123'

 # (?P<name>): similar to \N, add tag name for each sub group,

 # and use \g<name> to fetch sub group

 print(re.sub(r'(?P<first>\d{3})-(?P<second>\d{3})-(?P<third>\d{3})', r'\g<second>-\g<third>-\g<first>', '123-456-789')) # 456-789-123

使用(?P=name)来调用具名子组的匹配内容

Note: 此处的(?P=name)需要与(?P<name>)在同一个表达式模式内，且调用的是匹配结果而不是匹配模式。若需要在不同表达式中使用，可以利用\N或\g<name>来获取子组的匹配结果。

 # (?P=name): use this expression to reuse former sub group result

 # Note: this expression only get the matched result, not the re pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-abc-123'))    # Match obj, '123-abc-abc-123'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-def-456'))    # None

 # Note: should use (?P=name) in a re expression with former named group

 print(re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato'))  # 'YYXXXYY'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd')) # failed:(?P=char)-(?P=digit)

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))    # 'abcd-123'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))   # 'abcd-123'

使用groupdict()来获取匹配结果的字典

 # groupdict(): return dict of named pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').groupdict())    # {'char': 'abcd', 'digit': '123'}

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('char'))  # 'abcd'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('digit'))

re表达式的aimLsux标签的使用

 # re extensions

 # use (?aiLmsux): re.A, re.I, re.L, re.M, re.S, re.X

 # use (?imsx)

 # (?i) --> re.I/re.IGNORECASE

 print(re.findall(r'(?i)yes', 'yes, Yes, YES'))  # ['yes, Yes, YES']

 # (?m) --> re.M/re.MULTILINE: match multiline, ^ for head of line, $ for end of line

 print(re.findall(r'(?im)^The|\w+e$', '''The first day, the last one,

 the great guy, the puppy love'''))  # ['The', 'the', 'love']

 # (?s) --> re.S/re.DOTALL: dot(.) can be use to replace \n(default not)

 print(re.findall(r'(?s)Th.+', '''This is the biggest suprise

 I'd ever seen'''))  # ['This is the biggest suprise\nI'd ever seen']

 # (?x) --> re.X/re.VERBOSE: make re ignore the blanks and comments after '#' of re pattern

 # This extension can make re pattern easy to read and you can add comments as you want

 print(re.search(r'''(?x)

                     \((\d{3})\) # Area code

                     [ ]         # blank

                     (\d{3})     # prefix

                     -           # connecting line

                     (\d{4})     # suffix

                     ''', '(123) 456-7890').groups())    # ['123', '456', '789']

使用(?:…)来生成一个无需保存复用的子组

 # (?:...): make a sub group that no need to save and never use later

 print(re.findall(r'(?:\w{3}\.)(\w+\.com)', 'www.google.com'))   # ['google.com']

使用正向肯定搜索和正向否定搜索

 # (?=...) and (?!...)

 # (?=...): match the pattern which end with ..., pattern should place before (?=...)

 print(re.findall(r'\d{3}(?=Start)', '222Start333, this is foo, 777End666')) # ['222']

 # (?!...): match the pattern which not end with ..., pattern should place before (?!...)

 print(re.findall(r'\d{3}(?!End)', '222Start333, this is foo, 777End666')) # ['222', '333', '666']

使用反向肯定搜索和反向否定搜索

 # (?<=...) and (?<!...)

 # (?<=...): match the pattern which start with ..., pattern should place after (?<=...)

 print(re.findall(r'(?<=Start)\d{3}', '222Start333, this is foo, 777End666')) # ['333']

 # (?<!...): match the pattern which not stert with ..., pattern should place after (?<!...)

 print(re.findall(r'(?<!End)\d{3}', '222Start333, this is foo, 777End666')) # ['222', '333', '777']

使用条件选择匹配

 # (?(id/name)Y|X): if sub group \id or name exists, match Y, otherwise match X

 # Below code first match the first char, if 'x' matched, store a sub group, if 'y', not to store

 # then match second char, if sub group stored('x' matched), match 'y', otherwise match 'x', finally return result

 print(re.search(r'(?:(x)|y)(?(1)y|x)', 'yx'))

贪婪匹配与非贪婪匹配

 # Greedy match: '+' and '*' act greedy match, appending '?' to make no greedy match

 print(re.search(r'.+(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.+?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

 print(re.search(r'.*(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.*?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

完全匹配与非完全匹配

 # match and fullmatch

 print(re.match(r'This is a full match', 'this is a full match string', re.I))   # this is a full match

 print(re.fullmatch(r'This is a full match', 'this is a full match string', re.I))   # None

span()函数查看匹配位置

 # span function

 print(re.search(r'Google', 'www.google.com', re.I).span())  # (4, 10)

3.2 函数替换实例

介绍一个替换实例，利用函数/匿名函数来代替替换对象，

首先赋值被匹配对象，定义double函数，该函数会获取子组并将其乘2后返回字符串，利用sub函数的repl参数传入double函数，进行匹配替换，同样可以使用lambda函数进行实现。

Note: 当sub的repl是一个函数时，该函数需接受一个参数，该参数为匹配结果的返回实例。

 import re

 s = 'AD892SDA213VC2'

 def double(matched):

     value = int(matched.group('value'))

     return str(value*2)

 print(re.sub(r'(?P<value>\d+)', double, s))

 print(re.sub(r'(?P<value>\d+)', lambda x: str(int(x.group('value'))*2), s))

 # Output: 'AD1784SDA426VC4'

Python与正则表达式[0] -> re 模块的正则表达式匹配的更多相关文章

python全栈开发-re模块（正则表达式）应用（字符串的处理）
一.概述就其本质而言,正则表达式(或 RE)是一种小型的.高度专业化的编程语言,要讲他的具体用法要讲一本书!它内嵌在Python中,并通过 re 模块实现.你可以为想要匹配的相应字符串集指定规则:该 ...
Python学习笔记：re模块（正则表达式）
本文是部分内容参考自:http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html,虽然这篇博客是基于Python2.4的老版本,但是基础的P ...
正则表达式之re模块
re模块一.什么是正则表达式与re模块?1.1 字符组1.2 元字符1.2.1 单个使用1.2.2 组合使用二.为什么要使用正则三.如何使用3.1 re模块的三种比较重要的方法3.1.1 findal ...
python中的正则表达式（re模块）
一.简介正则表达式本身是一种小型的.高度专业化的编程语言,而在python中,通过内嵌集成re模块,程序媛们可以直接调用来实现正则匹配.正则表达式模式被编译成一系列的字节码,然后由用C编写的匹配引擎 ...
python与正则表达式：re模块详解
re模块是python中处理正在表达式的一个模块正则表达式知识储备:http://www.cnblogs.com/huamingao/p/6031411.html 1. match(pattern, ...
Python自动化运维之8、正则表达式re模块
re模块正则表达式使用单个字符串来描述.匹配一系列符合某个句法规则的字符串,在文本处理方面功能非常强大,也经常用作爬虫,来爬取特定内容,Python本身不支持正则,但是通过导入re模块,Python ...
Python之正则表达式（re模块）
本节内容 re模块介绍使用re模块的步骤 re模块简单应用示例关于匹配对象的说明说说正则表达式字符串前的r前缀 re模块综合应用实例正则表达式(Regluar Expressions)又称规则 ...
python常用模块（1）：collections模块和re模块（正则表达式详解）
从今天开始我们就要开始学习python的模块,今天先介绍两个常用模块collections和re模块.还有非常重要的正则表达式,今天学习的正则表达式需要记忆的东西非常多,希望大家可以认真记忆.按常理来 ...
python基础之正则表达式，re模块
1.正则表达式正则表达式:是字符串的规则,只是检测字符串是否符合条件的规则而已 1.检测某一段字符串是否符合规则 2.将符合规则的匹配出来re模块:是用来操作正则表达式的 2.正则表达式组成字符组 ...

随机推荐

剑指Offer - 九度1348 - 数组中的逆序对
剑指Offer - 九度1348 - 数组中的逆序对2014-01-30 23:19 题目描述: 在数组中的两个数字,如果前面一个数字大于后面的数字,则这两个数字组成一个逆序对.输入一个数组,求出这个 ...
python学习笔记六：内置函数
一.数学相关 1.绝对值:abs(-1) 2.最大最小值:max([1,2,3]).min([1,2,3]) 3.序列长度:len('abc').len([1,2,3]).len((1,2,3)) 4 ...
安装cloudbase-init和qga批处理
@echo off title Auto Install color 1F ::CloudBase-Init echo. msiexec /i \\192.168.122.47\cloudbase\C ...
shell之ip命令
转:出处我也不知道了,学习时候记下的笔记 1.作用 ip是iproute2软件包里面的一个强大的网络配置工具,它能够替代一些传统的网络管理工具,例如ifconfig.route等,使用权限为超级用户. ...
Linux特殊权限位
SUID 运行某程序时,相应进程的属主是程序文件自身的属主,而不是启动者(启动者临时获得文件属主的权限) chmod u+s file chmod u-s file SGID 运行某程 ...
ocrosoft Contest1316 - 信奥编程之路~~~~~第三关问题 K: 大整数加法
http://acm.ocrosoft.com/problem.php?cid=1316&pid=10 题目描述求两个不超过200位的非负整数的和. 输入有两行,每行是一个不超过200 ...
BI商业智能培训系列——（二）SSIS入门
简介: SSIS,Microsoft SQL Server Integration Services.Integration意为"整合"."一体化".上篇博客中 ...
[NOWCODER] myh的超级多项式
题面已知$f_i=(\sum_{j=1}^ka_j{v_j}^i )\bmod 1004535809$ 给定$v_1,v_2,\ldots,v_k,f_1,f_2,\ldots f_k$ 求$f_n ...
2016华中农业大学预赛 E 想法题
Problem E: Balance Time Limit: 1 Sec Memory Limit: 128 MBSubmit: 205 Solved: 64[Submit][Status][We ...
java 复习整理（四 String类详解）
String 类详解 StringBuilder与StringBuffer的功能基本相同,不同之处在于StringBuilder是非线程安全的,而StringBuffer是线程安全的,因此效率上S ...

Python与正则表达式[0] -> re 模块的正则表达式匹配

Python与正则表达式[0] -> re 模块的正则表达式匹配的更多相关文章

随机推荐

热门专题