【362】python 正则表达式

正则表达式就是为字符串定义一个规则，符合这个规则就认为是“匹配”。
正则表达式使用字符串表示的，需了解如何用字符来描述字符。

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

　　除了简单地判断是否匹配之外，正则表达式还有提取子串的强大功能。用()表示的就是要提取的分组（Group）。比如：

^(\d{3})-(\d{3,8})$分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码

re.match(pattern, string, flags=0)

test = '用户输入的字符串'

if re.match(r'正则表达式', test):

    print('ok')

else:

    print('failed')

re.search 扫描整个字符串并返回第一个成功的匹配。

re.search(pattern, string, flags=0)
span()：返回搜索的索引区间
group()：返回匹配的结果

re.compile 编译生成一个正则表达式对象，可以用来使用其 match() 和 search() 等方法

　　下面的代码可以实现相同的效果

prog = re.compile(pattern)

result = prog.match(string)

# is equivalent to

result = re.match(pattern, string)

　　主要作用就是可以重用对象，提高效率，类似使用函数

re.sub 用于替换字符串中的匹配项。

re.sub(pattern, repl, string, count=0, flags=0)
re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

ref: Python re.sub Examples

☀☀☀<< 举例 >>☀☀☀

import re

name = "alex@bingnan#is a good boy!!!! Hahahaha-?-=-=-=+_+_+_+_+$%$%#@#@#$!#@)(!$&)*#(@)*$#(@467749237492365)"

name_alpha = re.sub("[^a-zA-Z]", " ", name)

print(name_alpha)

# Eliminate duplicate whitespaces

print(re.sub(r"\s+", " ", name_alpha))

# output

# alex bingnan is a good boy     Hahahaha

# alex bingnan is a good boy Hahahaha

Python 的re模块提供了re.sub用于替换字符串中的匹配项。

compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，供 match() 和 search() 这两个函数使用。

findall 在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。

注意： match 和 search 是匹配一次 findall 匹配所有。

finditer 和 findall 类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回。

split 方法按照能够匹配的子串将字符串分割后返回列表

\d 可以匹配一个数字；

\d matches any digit, while \D matches any nondigit:

\w 可以匹配一个字母或数字或下划线；

\w matches any character that can be part of a word (Python identifier), that is, a letter, the underscore or a digit, while \W matches any other character:

\W 可以匹配非数字字母下划线；

\s 表示一个空白格（也包括Tab、回车等空白格）；

\s matches any space, while \S matches any nonspace character:

. 表示任意字符；

* 表示任意字符长度（包括0个）（>=0）；（其前面的一个字符，或者通过小括号匹配多个字符）

+ 表示至少一个字符（>=1）；与前面字符合并解析，如\s可以匹配一个空格（也包括Tab等空白符），所以\s+表示至少有一个空格，例如匹配' '，' '等；

# 匹配最左边，即是0个字符

>>> re.search('\d*', 'a123456b')

<_sre.SRE_Match object; span=(0, 0), match=''>

# 匹配最长

>>> re.search('\d\d\d*', 'a123456b')

<_sre.SRE_Match object; span=(1, 7), match='123456'>

>>> re.search('\d\d*', 'a123456b')

<_sre.SRE_Match object; span=(1, 7), match='123456'>

# 两个的倍数匹配

>>> re.search('\d(\d\d)*', 'a123456b')

<_sre.SRE_Match object; span=(1, 6), match='12345'>

+ 表示至少一个字符（>=1）；（其前面的一个字符，或者通过小括号匹配多个字符）

>>> re.search('.\d+', 'a123456b')

<_sre.SRE_Match object; span=(0, 7), match='a123456'>

>>> re.search('(.\d)+', 'a123456b')

<_sre.SRE_Match object; span=(0, 6), match='a12345'>

? 表示0个或1个字符；（其前面的一个字符，或者通过小括号匹配多个字符）

>>> re.search('\s(\d\d)?\s', 'a 12 b')

<_sre.SRE_Match object; span=(1, 5), match=' 12 '>

>>> re.search('\s(\d\d)?\s', 'a  b')

<_sre.SRE_Match object; span=(1, 3), match='  '>

>>> re.search('\s(\d\d)?\s', 'a 1 b')

# 无返回值，没有匹配成功

[ ] 匹配，同时需要转义的字符，在里面不需要，如 [.] 表示点

>>> re.search('[.]', 'abcabc.123456.defdef')

<re.Match object; span=(6, 7), match='.'>

>>> # 一次匹配中括号里面的任意字符

>>> re.search('[cba]+', 'abcabc.123456.defdef')

<re.Match object; span=(0, 6), match='abcabc'>

>>> re.search('.[\d]*', 'abcabc.123456.defdef')

<re.Match object; span=(0, 1), match='a'>

>>> re.search('\.[\d]*', 'abcabc.123456.defdef')

<re.Match object; span=(6, 13), match='.123456'>

>>> re.search('[.\d]+', 'abcabc.123456.defdef')

<re.Match object; span=(6, 14), match='.123456.'>

{n} 表示n个字符；与前面字符匹配来解析，例如\d{3}表示匹配3个数字，如'010'；

{n,m} 表示n-m个字符；与前面字符匹配来解析，例如\d{3,8}表示3-8个数字，例如'1234567'。

[0-9a-zA-Z\_] 可以匹配一个数字、字母或者下划线；

[0-9a-zA-Z\_]+ 可以匹配至少由一个数字、字母或者下划线组成的字符串，比如'a100'，'0_Z'，'Py3000'等等；

[a-zA-Z\_][0-9a-zA-Z\_]* 可以匹配由字母或下划线开头，后接任意个由一个数字、字母或者下划线组成的字符串，也就是Python合法的变量；

[a-zA-Z\_][0-9a-zA-Z\_]{0, 19} 更精确地限制了变量的长度是1-20个字符（前面1个字符+后面最多19个字符）。

- 在 [] 中表示范围，如果横线挨着中括号则被视为真正的横线
举例：如果要匹配'010-12345'这样的号码呢？由于'-'是特殊字符，在正则表达式中，要用'\'转义，所以，上面的正则是\d{3}\-\d{3,8}。
Ranges of letters or digits can be provided within square brackets, letting a hyphen separate the first and last characters in the range. A hyphen placed after the opening square bracket or before the closing square bracket is interpreted as a literal character:

>>> re.search('[e-h]+', 'ahgfea')

<re.Match object; span=(1, 5), match='hgfe'>

>>> re.search('[B-D]+', 'ABCBDA')

<re.Match object; span=(1, 5), match='BCBD'>

>>> re.search('[4-7]+', '154465571')

<re.Match object; span=(1, 8), match='5446557'>

>>> re.search('[-e-gb]+', 'a--bg--fbe--z')

<re.Match object; span=(1, 12), match='--bg--fbe--'>

>>> re.search('[73-5-]+', '14-34-576')

<re.Match object; span=(1, 8), match='4-34-57'>

^ 在 [ ] 中表示后面字符除外的其他字符

Within a square bracket, a caret after placed after the opening square bracket excludes the characters that follow within the brackets:

>>> re.search('[^4-60]+', '0172853')

<re.Match object; span=(1, 5), match='1728'>

>>> re.search('[^-u-w]+', '-stv')

<re.Match object; span=(1, 3), match='st'>

A|B 可以匹配A或B，所以(P|p)ython可以匹配'Python'或者'python'。

Whereas square brackets surround alternative characters, a vertical bar separates alternative patterns:

>>> re.search('two|three|four', 'one three two')

<re.Match object; span=(4, 9), match='three'>

>>> re.search('|two|three|four', 'one three two')

<re.Match object; span=(0, 0), match=''>

>>> re.search('[1-3]+|[4-6]+', '01234567')

<re.Match object; span=(1, 4), match='123'>

>>> re.search('([1-3]|[4-6])+', '01234567')

<re.Match object; span=(1, 7), match='123456'>

>>> re.search('_\d+|[a-z]+_', '_abc_def_234_')

<re.Match object; span=(1, 5), match='abc_'>

>>> re.search('_(\d+|[a-z]+)_', '_abc_def_234_')

<re.Match object; span=(0, 5), match='_abc_'>

^ 表示行的开头，^\d表示必须以数字开头。

$ 表示行的结束，\d$表示必须以数字结束。

A caret at the beginning of the pattern string matches the beginning of the data string; a dollar at the end of the pattern string matches the end of the data string:

>>> re.search('\d*', 'abc')

<re.Match object; span=(0, 0), match=''>

>>> re.search('^\d*', 'abc')

<re.Match object; span=(0, 0), match=''>

>>> re.search('\d*$', 'abc')

<re.Match object; span=(3, 3), match=''>

>>> re.search('^\d*$', 'abc')

>>> re.search('^\s*\d*\s*$', ' 345 ')

<re.Match object; span=(0, 5), match=' 345 '>

如果不在最前或最后，可以视为普通字符，但是在最前最后的时候想变成普通字符需要加上反斜杠

Escaping a dollar at the end of the pattern string, escaping a caret at the beginning of the pattern string or after the opening square bracket of a character class, makes dollar and caret lose the special meaning they have in those contexts context and let them be treated as literal characters:

>>> re.search('\$', '$*')

<re.Match object; span=(0, 1), match='$'>

>>> re.search('\^', '*^')

<re.Match object; span=(1, 2), match='^'>

>>> re.search('[\^]', '^*')

<re.Match object; span=(0, 1), match='^'>

>>> re.search('[^^]', '^*')

<re.Match object; span=(1, 2), match='*'>

^(\d{})-(\d{,})$ 分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码：

group(0)：永远是原始字符串；
group(1)：表示第1个子串；
group(2)：表示第2个子串，以此类推。

分组顺序：按照左括号的顺序开始

Parentheses allow matched parts to be saved. The object returned by re.search() has a group() method that without argument, returns the whole match and with arguments, returns partial matches; it also has a groups()method that returns all partial matches:

>>> R = re.search('((\d+) ((\d+) \d+)) (\d+ (\d+))',

              '  1 23 456 78 9 0 '

             )

>>> R

<re.Match object; span=(2, 15), match='1 23 456 78 9'>

>>> R.group()

'1 23 456 78 9'

>>> R.groups()

('1 23 456', '1', '23 456', '23', '78 9', '9')

>>> [R.group(i) for i in range(len(R.groups()) + 1)]

['1 23 456 78 9', '1 23 456', '1', '23 456', '23', '78 9', '9']

?: 二选一，括号不计入分组

>>> R = re.search('([+-]?(?:0|[1-9]\d*)).*([+-]?(?:0|[1-9]\d*))',

              ' a = -3014, b = 0 '

             )

>>> R

<re.Match object; span=(5, 17), match='-3014, b = 0'>

>>> R.groups()

('-3014', '0')

.* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符

模式	描述
^	匹配字符串的开头
$	匹配字符串的末尾。
.	匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。
[...]	用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k'
[^...]	不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。
re*	匹配0个或多个的表达式。
re+	匹配1个或多个的表达式。
re?	匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式
re{ n}	匹配n个前面表达式。例如，"o{2}"不能匹配"Bob"中的"o"，但是能匹配"food"中的两个o。
re{ n,}	精确匹配n个前面表达式。例如，"o{2,}"不能匹配"Bob"中的"o"，但能匹配"foooood"中的所有o。"o{1,}"等价于"o+"。"o{0,}"则等价于"o*"。
re{ n, m}	匹配 n 到 m 次由前面的正则表达式定义的片段，贪婪方式
a\| b	匹配a或b
(re)	匹配括号内的表达式，也表示一个组
(?imx)	正则表达式包含三种可选标志：i, m, 或 x 。只影响括号中的区域。
(?-imx)	正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。
(?: re)	类似 (...), 但是不表示一个组
(?imx: re)	在括号中使用i, m, 或 x 可选标志
(?-imx: re)	在括号中不使用i, m, 或 x 可选标志
(?#...)	注释.
(?= re)	前向肯定界定符。如果所含正则表达式，以 ... 表示，在当前位置成功匹配时成功，否则失败。但一旦所含表达式已经尝试，匹配引擎根本没有提高；模式的剩余部分还要尝试界定符的右边。
(?! re)	前向否定界定符。与肯定界定符相反；当所含表达式不能在字符串当前位置匹配时成功。
(?> re)	匹配的独立模式，省去回溯。
\w	匹配数字字母下划线
\W	匹配非数字字母下划线
\s	匹配任意空白字符，等价于 [\t\n\r\f]。
\S	匹配任意非空字符
\d	匹配任意数字，等价于 [0-9]。
\D	匹配任意非数字
\A	匹配字符串开始
\Z	匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串。
\z	匹配字符串结束
\G	匹配最后匹配完成的位置。
\b	匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。
\B	匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。
\n, \t, 等。	匹配一个换行符。匹配一个制表符, 等
\1...\9	匹配第n个分组的内容。
\10	匹配第n个分组的内容，如果它经匹配。否则指的是八进制字符码的表达式。
----

举例：

\d{} ：匹配3个数字

\s+ ：至少有一个空格

\d{,} ：3-8个数字

>>> mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'

>>> mySent.split(' ')

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']

>>> import re

>>> listOfTokens = re.split(r'\W*', mySent)

>>> listOfTokens

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']

>>> [tok for tok in listOfTokens if len(tok) > 0]

['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']

>>> [tok.lower() for tok in listOfTokens if len(tok) > 0]

['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

>>> [tok.lower() for tok in listOfTokens if len(tok) > 2]

['this', 'book', 'the', 'best', 'book', 'python', 'have', 'ever', 'laid', 'eyes', 'upon']

>>>

参考：python爬虫（5）--正则表达式 - 小学森也要学编程 - 博客园

实现删除引号内部的内容，注意任意匹配使用【.*】

a = 'Sir Nina said: \"I am a Knight,\" but I am not sure'

b = "Sir Nina said: \"I am a Knight,\" but I am not sure"

print(re.sub(r'"(.*)"', '', a),

re.sub(r'"(.*)"', '', b), sep='\n')

Output:

Sir Nina said:  but I am not sure

Sir Nina said:  but I am not sure

Example from Eric Martin's learning materials of COMP9021

The following function checks that its argument is a string:

that from the beginning: ^
consists of possibly some spaces: ␣*
followed by an opening parenthesis: \(
possibly followed by spaces: ␣*
possibly followed by either + or -: [+-]?
followed by either 0, or a nonzero digit followed by any sequence of digits: 0|[1-9]\d*
possibly followed by spaces: ␣*
followed by a comma: ,
followed by characters matching the pattern described by 1-7
followed by a closing parenthesis: \)
possibly followed by some spaces: ␣*
all the way to the end: $

Pairs of parentheses surround both numbers to match to capture them. For point 5, a surrounding pair of parentheses is needed; ?: makes it non-capturing:

>>> def validate_and_extract_payoffs(provided_input):

	pattern = '^ *\( *([+-]?(?:0|[1-9]\d*)) *,'\

		      ' *([+-]?(?:0|[1-9]\d*)) *\) *$'

	match = re.search(pattern, provided_input)

	if match:

		return (match.groups())

>>> validate_and_extract_payoffs('(+0, -7 )')

('+0', '-7')

>>> validate_and_extract_payoffs('  (-3014,0)  ')

('-3014', '0')

【362】python 正则表达式的更多相关文章

Python 正则表达式入门（中级篇）
Python 正则表达式入门(中级篇) 初级篇链接:http://www.cnblogs.com/chuxiuhong/p/5885073.html 上一篇我们说在这一篇里,我们会介绍子表达式,向前向 ...
Python正则表达式中的re.S
title: Python正则表达式中的re.S date: 2014-12-21 09:55:54 categories: [Python] tags: [正则表达式,python] --- 在Py ...
Python 正则表达式入门（初级篇）
Python 正则表达式入门(初级篇) 本文主要为没有使用正则表达式经验的新手入门所写. 转载请写明出处引子首先说正则表达式是什么? 正则表达式,又称正规表示式.正规表示法.正规表达式.规则表达 ...
python正则表达式re
Python正则表达式: re 正则表达式的元字符有. ^ $ * ? { [ ] | ( )．表示任意字符［］用来匹配一个指定的字符类别,所谓的字符类别就是你想匹配的一个字符集,对于字符集中的字符可 ...
Python正则表达式详解
我用双手成就你的梦想 python正则表达式 ^ 匹配开始 $ 匹配行尾 . 匹配出换行符以外的任何单个字符,使用-m选项允许其匹配换行符也是如此 [...] 匹配括号内任何当个字符(也有或的意思) ...
比较详细Python正则表达式操作指南(re使用)
比较详细Python正则表达式操作指南(re使用) Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式.Python 1.5之前版本则是通过 regex 模块提供 E ...
Python正则表达式学习摘要及资料
摘要在正则表达式中,如果直接给出字符,就是精确匹配. {m,n}? 对于前一个字符重复 m 到 n 次,并且取尽可能少的情况在字符串'aaaaaa'中,a{2,4} 会匹配 4 个 a,但 a{2 ...
python正则表达式小例几则
会用到的语法正则字符释义举例 + 前面元素至少出现一次 ab+:ab.abbbb 等 * 前面元素出现0次或多次 ab*:a.ab.abb 等 ? 匹配前面的一次或0次 Ab?: A.Ab 等 ...
Python 正则表达式-OK
Python正则表达式入门一. 正则表达式基础 1.1. 简单介绍正则表达式并不是Python的一部分. 正则表达式是用于处理字符串的强大工具, 拥有自己独特的语法以及一个独立的处理引擎, 效率上 ...
Python天天美味(15) - Python正则表达式操作指南(re使用)(转)
http://www.cnblogs.com/coderzh/archive/2008/05/06/1185755.html 简介 Python 自1.5版本起增加了re 模块,它提供 Perl 风格 ...

随机推荐

IIS 反向代理设置
http://blog.csdn.net/yuanguozhengjust/article/details/23576033
SSH框架总结(环境搭建+框架分析+实例源码下载)
一.SSH框架简介 SSH是struts+spring+hibernate集成的web应用程序开源框架. Struts:用来控制的,核心控制器是Controller. Spring:对Struts和H ...
spark streaming检查点使用
import org.apache.spark._ import org.apache.spark.streaming._ /** * Created by code-pc on 16/3/14. * ...
第11章拾遗5：IPv6和IPv4共存技术（1）_双栈技术和6to4隧道技术
6. IPv6和IPv4共存技术 6.1 双栈技术 (1)双协议主机的协议结构 (2)双协议栈示意图 ①双协议主机在通信时首先通过支持双协议的DNS服务器查询与目的主机名对应的IP地址. ②再根据指定 ...
python利用socket写一个文件上传
1.先将一张图片拖入‘文件上传’的目录下,利用socket把这张图片写到叫‘yuan’的文件中 2.代码: #模拟服务端 import subprocess import os import sock ...
TableStore：多行数据操作
1.批量写 public static void batchWriteRow(SyncClient client) { BatchWriteRowRequest request = new Batch ...
centos7 安装Node.js并配置为全局可用
本文Node.js版本为5.12.0,登录 https://nodejs.org/dist/v5.12.0/,需指定其他版本的话可以直接修改版本号进行登录. 为了方便使用tar命令对文件进行解压,我们 ...
StanFord ML 笔记第十部分
第十部分: 1.PCA降维 2.LDA 注释:一直看理论感觉坚持不了,现在进行<机器学习实战>的边写代码边看理论
Docker的安装和启动
2.Docker安装与启动 2.1安装环境说明 Docker官方建议在Ubuntu中安装,因为Docker是基于Ubuntu发布的,而且一般Docker出现的问题Ubuntu是最先更新或者打补丁的.在 ...

【362】python 正则表达式

【362】python 正则表达式的更多相关文章

随机推荐

热门专题