Python re module (regular expressions)

regular expressions (RE) 简介

　　re模块是python中处理正在表达式的一个模块

 r"""Support for regular expressions (RE).

 This module provides regular expression matching operations similar to

 those found in Perl.  It supports both 8-bit and Unicode strings; both

 the pattern and the strings being processed can contain null bytes and

 characters outside the US ASCII range.

 Regular expressions can contain both special and ordinary characters.

 Most ordinary characters, like "A", "a", or "0", are the simplest

 regular expressions; they simply match themselves.  You can

 concatenate ordinary characters, so last matches the string 'last'.

 The special characters are:

     "."      Matches any character except a newline.

     "^"      Matches the start of the string.

     "$"      Matches the end of the string or just before the newline at

              the end of the string.

     "*"      Matches 0 or more (greedy) repetitions of the preceding RE.

              Greedy means that it will match as many repetitions as possible.

     "+"      Matches 1 or more (greedy) repetitions of the preceding RE.

     "?"      Matches 0 or 1 (greedy) of the preceding RE.

     *?,+?,?? Non-greedy versions of the previous three special characters.

     {m,n}    Matches from m to n repetitions of the preceding RE.

     {m,n}?   Non-greedy version of the above.

     "\\"     Either escapes special characters or signals a special sequence.

     []       Indicates a set of characters.

              A "^" as the first character indicates a complementing set.

     "|"      A|B, creates an RE that will match either A or B.

     (...)    Matches the RE inside the parentheses.

              The contents can be retrieved or matched later in the string.

     (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).

     (?:...)  Non-grouping version of regular parentheses.

     (?P<name>...) The substring matched by the group is accessible by name.

     (?P=name)     Matches the text matched earlier by the group named name.

     (?#...)  A comment; ignored.

     (?=...)  Matches if ... matches next, but doesn't consume the string.

     (?!...)  Matches if ... doesn't match next.

     (?<=...) Matches if preceded by ... (must be fixed length).

     (?<!...) Matches if not preceded by ... (must be fixed length).

     (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,

                        the (optional) no pattern otherwise.

 The special sequences consist of "\\" and a character from the list

 below.  If the ordinary character is not on the list, then the

 resulting RE will match the second character.

     \number  Matches the contents of the group of the same number.

     \A       Matches only at the start of the string.

     \Z       Matches only at the end of the string.

     \b       Matches the empty string, but only at the start or end of a word.

     \B       Matches the empty string, but not at the start or end of a word.

     \d       Matches any decimal digit; equivalent to the set [0-9] in

              bytes patterns or string patterns with the ASCII flag.

              In string patterns without the ASCII flag, it will match the whole

              range of Unicode digits.

     \D       Matches any non-digit character; equivalent to [^\d].

     \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in

              bytes patterns or string patterns with the ASCII flag.

              In string patterns without the ASCII flag, it will match the whole

              range of Unicode whitespace characters.

     \S       Matches any non-whitespace character; equivalent to [^\s].

     \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]

              in bytes patterns or string patterns with the ASCII flag.

              In string patterns without the ASCII flag, it will match the

              range of Unicode alphanumeric characters (letters plus digits

              plus underscore).

              With LOCALE, it will match the set [0-9_] plus characters defined

              as letters for the current locale.

     \W       Matches the complement of \w.

     \\       Matches a literal backslash.

 This module exports the following functions:

     match     Match a regular expression pattern to the beginning of a string.

     fullmatch Match a regular expression pattern to all of a string.

     search    Search a string for the presence of a pattern.

     sub       Substitute occurrences of a pattern found in a string.

     subn      Same as sub, but also return the number of substitutions made.

     split     Split a string by the occurrences of a pattern.

     findall   Find all occurrences of a pattern in a string.

     finditer  Return an iterator yielding a match object for each match.

     compile   Compile a pattern into a RegexObject.

     purge     Clear the regular expression cache.

     escape    Backslash all non-alphanumerics in a string.

 Some of the functions in this module takes flags as optional parameters:

     A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D

                    match the corresponding ASCII character categories

                    (rather than the whole Unicode categories, which is the

                    default).

                    For bytes patterns, this flag is the only available

                    behaviour and needn't be specified.

     I  IGNORECASE  Perform case-insensitive matching.

     L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.

     M  MULTILINE   "^" matches the beginning of lines (after a newline)

                    as well as the string.

                    "$" matches the end of lines (before a newline) as well

                    as the end of the string.

     S  DOTALL      "." matches any character at all, including the newline.

     X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.

     U  UNICODE     For compatibility only. Ignored for string patterns (it

                    is the default), and forbidden for bytes patterns.

 This module also defines an exception 'error'.

 """

虽然在Python 中使用正则表达式有几个步骤，但每一步都相当简单。

1．用import re 导入正则表达式模块。

2．用re.compile()函数创建一个Regex 对象（记得使用原始字符串）。

3．向Regex 对象的search()方法传入想查找的字符串。它返回一个Match 对象。

4．调用Match 对象的group()方法，返回实际匹配文本的字符串。

              向re.compile()传递原始字符串

Python 中转义字符使用倒斜杠（\）。字符串'\n'表示一个换行字符，

而不是倒斜杠加上一个小写的n。你需要输入转义字符\\，才能打印出一个倒斜杠。

所以'\\n'表示一个倒斜杠加上一个小写的n。但是，通过在字符串的第一个引号之

前加上r，可以将该字符串标记为原始字符串，它不包括转义字符。

因为正则表达式常常使用倒斜杠，向re.compile()函数传入原始字符串就很方

便， 而不是输入额外得到斜杠。

输入r'\d\d\d-\d\d\d-\d\d\d\d' ，

比输入'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'要容易得多。

1.compile()

python代码最终会被编译为字节码，之后才被解释器执行。
在模式匹配之前，正在表达式模式必须先被编译成regex对象，预先编译可以提高性能，re.compile()就是用于提供此功能。

def compile(pattern, flags=0):

    "Compile a regular expression pattern, returning a pattern object."

    return _compile(pattern, flags)

2. findall(pattern, string, flags=0)

match和search均用于匹配单值，即：只能匹配字符串中的一个，如果想要匹配到字符串中所有符合条件的元素，则需要使用 findall。
findall，获取非重复的匹配列表；

如果有一个组则以列表形式返回，且每一个匹配均是字符串；
如果模型中有多个组，则以列表形式返回，且每一个匹配均是元祖；空的匹配也会包含在结果中

def findall(pattern, string, flags=0):

    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return

    a list of groups; this will be a list of tuples if the pattern

    has more than one group.

    Empty matches are included in the result."""

    return _compile(pattern, flags).findall(string)

3. match(pattern, string, flags=0)

　　从字符串的开头进行匹配，匹配成功就返回一个匹配对象，匹配失败就返回None

　　flags的几种值:

X 忽略空格和注释
I 忽略大小写的区别 case-insensitive matching
S . 匹配任意字符，包括新行

def match(pattern, string, flags=0):

    """Try to apply the pattern at the start of the string, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).match(string)

4. search(pattern, string, flags=0)

　　浏览整个字符串去匹配第一个，未匹配成功返回None

def search(pattern, string, flags=0):

    """Scan through string looking for a match to the pattern, returning

    a match object, or None if no match was found."""

    return _compile(pattern, flags).search(string)

search() vs. match()

# Python offers two different primitive operations based on regular expressions:

# re.match() checks for a match only at the beginning of the string,

# while re.search() checks for a match anywhere in the string (this is what Perl does by default).

>>> re.match("c", "abcdef")    # No match

>>> re.search("c", "abcdef")   # Match

<_sre.SRE_Match object at ...>

# Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:

# re.match('str', "string")  等价于  re.search('^str', "string")

>>> re.match("c", "abcdef")    # No match

>>> re.search("^c", "abcdef")  # No match

>>> re.search("^a", "abcdef")  # Match

<_sre.SRE_Match object at ...>

# Note however that in MULTILINE mode match() only matches at the beginning of the string,

# whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.

# 多行匹配 模式 对 match() 无效

# 带^的正则匹配 search() 在 多行匹配 模式下，会去 字符串的每一行 匹配 要查找的字符或字符串

>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match

>>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match

<_sre.SRE_Match object at ...>

5. sub(pattern,repl,string,count=0,flags=0)

　　替换匹配成功的指定位置字符串

def sub(pattern, repl, string, count=0, flags=0):

    """Return the string obtained by replacing the leftmost

    non-overlapping occurrences of the pattern in string by the

    replacement repl.  repl can be either a string or a callable;

    if a string, backslash escapes in it are processed.  If it is

    a callable, it's passed the match object and must return

    a replacement string to be used."""

    return _compile(pattern, flags).sub(repl, string, count)

6. split(pattern,string,maxsplit=0,flags=0)
　　根据正则匹配分割字符串

def split(pattern, string, maxsplit=0, flags=0):

    """Split the source string by the occurrences of the pattern,

    returning a list containing the resulting substrings.  If

    capturing parentheses are used in pattern, then the text of all

    groups in the pattern are also returned as part of the resulting

    list.  If maxsplit is nonzero, at most maxsplit splits occur,

    and the remainder of the string is returned as the final element

    of the list."""

    return _compile(pattern, flags).split(string, maxsplit)

7. group()与groups()

匹配对象的两个主要方法：

group() 返回所有匹配对象，或返回某个特定子组，如果没有子组，返回全部匹配对象

groups() 返回一个包含唯一或所有子组的的元组，如果没有子组，返回空元组

Python re module (regular expressions)的更多相关文章

Python标准模块—Regular Expressions
作者:zhbzz2007 出处:http://www.cnblogs.com/zhbzz2007 欢迎转载,也请保留这段声明.谢谢! 1 模块简介正则表达式是一门小语言,你可以在Python中或者其 ...
[Python] Regular Expressions
1. regular expression Regular expression is a special sequence of characters that helps you match or ...
【Python学习笔记】Coursera课程《Using Python to Access Web Data 》密歇根大学 Charles Severance——Week2 Regular Expressions课堂笔记
Coursera课程<Using Python to Access Web Data > 密歇根大学 Charles Severance Week2 Regular Expressions ...
Python之Regular Expressions（正则表达式）
在编写处理字符串的程序或网页时,经常会有查找符合某些复杂规则的字符串的需要.正则表达式就是用于描述这些规则的工具.换句话说,正则表达式就是记录文本规则的代码. 很可能你使用过Windows/Dos下用 ...
正则表达式（Regular expressions）使用笔记
Regular expressions are a powerful language for matching text patterns. This page gives a basic intr ...
PCRE Perl Compatible Regular Expressions Learning
catalog . PCRE Introduction . pcre2api . pcre2jit . PCRE Programing 1. PCRE Introduction The PCRE li ...
Regular Expressions --正则表达式官方教程
http://docs.oracle.com/javase/tutorial/essential/regex/index.html This lesson explains how to use th ...
Introducing Regular Expressions 学习笔记
Introducing Regular Expressions 读书笔记工具: regexbuddy:http://download.csdn.net/tag/regexbuddy%E7%A0%B4 ...
8 Regular Expressions You Should Know
Regular expressions are a language of their own. When you learn a new programming language, they're ...

随机推荐

自定义Swap
网上看到的一篇文章加深了对指针的了解,收藏一下自定义的swap函数是一个老掉牙的问题,而这个问题对于理解指针和内存中的栈是很有帮助的一般自定swap函数是这样的: 1.swap函数的功能是实现两个 ...
Codeforces Round #401 (Div. 2)【A,B,C,D】
最近状态极差..水题不想写,难题咬不动..哎,CF的题那么简单,还搞崩了= =.真是巨菜无比. Codeforces777A 题意:略. 思路: 构造出3!次变换,然后输出就好. Code: #inc ...
剑指Offer的学习笔记（C#篇）-- 连续子数组的最大和
题目描述 HZ偶尔会拿些专业问题来忽悠那些非计算机专业的同学.今天测试组开完会后,他又发话了:在古老的一维模式识别中,常常需要计算连续子向量的最大和,当向量全为正数的时候,问题很好解决.但是,如果向量 ...
Mac下磁盘无法抹除问题解决
安装CentOS到扩容卡,每次安装都会造成bootcamp分区的windows出问题,遂安装ubantu,结果扩容卡有问题-->无法读取您的磁盘,打开磁盘工具无法抹除,由于无法读取在window ...
python基本数据类型2——操作
字符串 name = "alex" # 移除两边的空格 print(name.strip()) #strip不修改值 # 是否以"al"开头 print(nam ...
4.高级数据过滤 ---SQL
一.AND操作符要通过不止一个列进行过滤,可以使用A ND操作符给WHERE子句附加条件. SELECT prod_id, prod_price, prod_name FROM Products ; ...
POJ-3352-RoadConstruction（边双联通分量，缩点）
链接:https://vjudge.net/problem/POJ-3352#author=0 题意: 给一个无向连通图,至少添加几条边使得去掉图中任意一条边不改变图的连通性(即使得它变为边双连通图) ...
送气球.jpg（模拟）
链接:https://ac.nowcoder.com/acm/contest/318/A 来源:牛客网时间限制:C/C++ 1秒,其他语言2秒空间限制:C/C++ 262144K,其他语言5242 ...
Codecraft-17 and Codeforces Round #391 (Div. 1 + Div. 2, combined) C
It's that time of the year, Felicity is around the corner and you can see people celebrating all aro ...
Ubuntu系统修改服务器的静态ip地址
Ubuntu 16.04 #vi /etc/network/interfaces auto lo iface lo inet loopback auto ens3 iface ens3 inet st ...

Python re module (regular expressions)

search() vs. match()

Python re module (regular expressions)的更多相关文章

随机推荐

热门专题