python正则表达式的用法
- import re
- r1 = re.compile(r'(?im)(?P<name></html>)$')
- content = """
- <HTML>
- boxsuch as 'box' and 'boxes', but not 'inbox'. In other words
- box
- <html>dsafdsafdas </html> </ahtml>
- </html>
- </HTML>
- """
- reobj = re.compile("(?im)(?P<name></.*?html>)$")
- for match in reobj.finditer(content):
- # match start: match.start()
- # match end (exclusive): match.end()
- # matched text: match.group()
- print "start>>", match.start()
- print "end>>", match.end()
- print "span>>", match.span()
- print "match.group()>>", match.group()
- print "*"*20
- if r1.match(content): print 'match succeeds'
- else: print 'match fails' # prints: match fails
- if r1.search(content): print 'search succeeds' # prints: search succeeds
- else: print 'search fails'
- print r1.flags
- print r1.groupindex
- print r1.pattern
- l = r1.split(content)
- print "l>>", l
- for item in r1.findall(content):
- print "item>>", item
- s = r1.sub("aa", content)
- print "s>>", s
- s_subn, s_sub_count = r1.subn("aaaaaaaaaaaa", content)
- print "s_subn>>", s_subn
- print "s_sub_count>>", s_sub_count
[ Team LiB ] |
9.7 Regular Expressions and the re ModuleA regular expression is a string that represents a pattern. With regular expression functionality, you can compare that pattern to another string and see if any part of the string matches the pattern. The re module supplies all of Python's regular expression functionality. The compile function builds a regular expression object from a pattern string and optional flags. The methods of a regular expression object look for matches of the regular expression in a string and/or perform substitutions. Module re also exposes functions equivalent to a regular expression's methods, but with the regular expression's pattern string as their first argument. Regular expressions can be difficult to master, and this book does not purport to teach them桰 cover only the ways in which you can use them inpython. For general coverage of regular expressions, I recommend the book Mastering Regular Expressions, by Jeffrey Friedl (O'Reilly). Friedl's book offers thorough coverage of regular expressions at both the tutorial and advanced levels. 9.7.1 Pattern-String SyntaxThe pattern string representing a regular expression follows a specific syntax:
Since regular expression patterns often contain backslashes, you generally want to specify them using raw-string syntax (covered in Chapter 4). Pattern elements (e.g., r'\t', which is equivalent to the non-raw string literal '\\t') do match the corresponding special characters (e.g., the tab character '\t'). Therefore, you can use raw-string syntax even when you do need a literal match for some such special character. Table 9-2 lists the special elements in regular expression pattern syntax. The exact meanings of some pattern elements change when you use optional flags, together with the pattern string, to build the regular expression object. The optional flags are covered later in this chapter. Table 9-2. Regular expression pattern syntax
9.7.2 Common Regular Expression Idioms'.*' as a substring of a regular expression's pattern string means "any number of repetitions (zero or more) of any character." In other words, '.*' matches any substring of a target string, including the empty substring. '.+' is similar, but it matches only a non-empty substring. For example: 'pre.*post' matches a string containing a substring 'pre' followed by a later substring 'post', even if the latter is adjacent to the former (e.g., it matches both 'prepost' and 'pre23post'). On the other hand: 'pre.+post' matches only if 'pre' and 'post' are not adjacent (e.g., it matches 'pre23post' but does not match 'prepost'). Both patterns also match strings that continue after the 'post'. To constrain a pattern to match only strings that end with 'post', end the pattern with \Z. For example: r'pre.*post\Z' matches 'prepost', but not 'preposterous'. Note that we need to express the pattern with raw-string syntax (or escape the backslash \ by doubling it into \\), as it contains a backslash. Using raw-string syntax for all regular expression pattern literals is good practice in Python, as it's the simplest way to ensure you'll never fail to escape a backslash. Another frequently used element in regular expression patterns is \b, which matches a word boundary. If you want to match the word 'his' only as a whole word and not its occurrences as a substring in such words as 'this' and 'history', the regular expression pattern is: r'\bhis\b' with word boundaries both before and after. To match the beginning of any word starting with 'her', such as 'her' itself but also 'hermetic', but not words that just contain 'her' elsewhere, such as 'ether', use: r'\bher' with a word boundary before, but not after, the relevant string. To match the end of any word ending with 'its', such as 'its' itself but also 'fits', but not words that contain 'its' elsewhere, such as 'itsy', use: r'its\b' with a word boundary after, but not before, the relevant string. To match whole words thus constrained, rather than just their beginning or end, add a pattern element \w* to match zero or more word characters. For example, to match any full word starting with 'her', use: r'\bher\w*' And to match any full word ending with 'its', use: r'\w*its\b' 9.7.3 Sets of CharactersYou denote sets of characters in a pattern by listing the characters within brackets ([ ]). In addition to listing single characters, you can denote a range by giving the first and last characters of the range separated by a hyphen (-). The last character of the range is included in the set, which is different from other Python ranges. Within a set, special characters stand for themselves, except \, ], and -, which you must escape (by preceding them with a backslash) when their position is such that, unescaped, they would form part of the set's syntax. In a set, you can also denote a class of characters by escaped-letter notation, such as \d or \S. However, \b in a set denotes a backspace character, not a word boundary. If the first character in the set's pattern, right after the [, is a caret (^), the set is complemented. In other words, the set matches any character except those that follow ^ in the set pattern notation. A frequent use of character sets is to match a word, using a definition of what characters can make up a word that differs from \w's default (letters and digits). To match a word of one or more characters, each of which can be a letter, an apostrophe, or a hyphen, but not a digit (e.g., 'Finnegan-O'Hara'), use: r"[a-zA-z'\-]+" It's not strictly necessary to escape the hyphen with a backslash in this case, since its position makes it syntactically unambiguous. However, the backslash makes the pattern somewhat more readable, by visually distinguishing the hyphen that you want to have as a character in the set from those used to denote ranges. 9.7.4 AlternativesA vertical bar (|) in a regular expression pattern, used to specify alternatives, has low precedence. Unless parentheses change the grouping, |applies to the whole pattern on either side, up to the start or end of the string, or to another |. A pattern can be made up of any number of subpatterns joined by |. To match such a regular expression, the first subpattern is tried first, and if it matches, the others are skipped. If the first subpattern does not match, the second subpattern is tried, and so on. | is neither greedy nor non-greedy, as it doesn't take into consideration the length of the match. If you have a list L of words, a regular expression pattern that matches any of the words is: '|'.join([r'\b%s\b' % word for word in L]) If the items of L can be more-general strings, not just words, you need to escape each of them with function re.escape, covered later in this chapter, and you probably don't want the \b word boundary markers on either side. In this case, use the regular expression pattern: '|'.join(map(re.escape,L)) 9.7.5 GroupsA regular expression can contain any number of groups, from none up to 99 (any number is allowed, but only the first 99 groups are fully supported). Parentheses in a pattern string indicate a group. Element (?P<id>...) also indicates a group, and in addition gives the group a name, id, that can be any Python identifier. All groups, named and unnamed, are numbered from left to right, 1 to 99, with group number 0 indicating the whole regular expression. For any match of the regular expression with a string, each group matches a substring (possibly an empty one). When the regular expression uses |, some of the groups may not match any substring, although the regular expression as a whole does match the string. When a group doesn't match any substring, we say that the group does not participate in the match. An empty string '' is used to represent the matching substring for a group that does not participate in a match, except where otherwise indicated later in this chapter. For example: r'(.+)\1+\Z' matches a string made up of two or more repetitions of any non-empty substring. The (.+) part of the pattern matches any non-empty substring (any character, one or more times), and defines a group thanks to the parentheses. The \1+ part of the pattern matches one or more repetitions of the group, and the \Z anchors the match to end-of-string. 9.7.6 Optional FlagsA regular expression pattern element with one or more of the letters "iLmsux" between (? and ) lets you set regular expression options within the regular expression's pattern, rather than by the flags argument to function compile of module re. Options apply to the whole regular expression, no matter where the options element occurs in the pattern. For clarity, options should always be at the start of the pattern. Placement at the start is mandatory if x is among the options, since x changes the way Python parses the pattern. Using the explicit flags argument is more readable than placing an options element within the pattern. The flags argument to function compile is a coded integer, built by bitwise ORing (with Python's bitwise OR operator, |) one or more of the following attributes of module re. Each attribute has both a short name (one uppercase letter), for convenience, and a long name (an uppercase multiletter identifier), which is more readable and thus normally preferable:
For example, here are three ways to define equivalent regular expressions with function compile, covered later in this chapter. Each of these regular expressions matches the word "hello" in any mix of upper- and lowercase letters: import re The third approach is clearly the most readable, and thus the most maintainable, even though it is slightly more verbose. Note that the raw-string form is not necessary here, since the patterns do not include backslashes. However, using raw strings is still innocuous, and is the recommended style for clarity. Option re.VERBOSE (or re.X) lets you make patterns more readable and understandable by appropriate use of whitespace and comments. Complicated and verbose regular expression patterns are generally best represented by strings that take up more than one line, and therefore you normally want to use the triple-quoted raw-string format for such pattern strings. For example: repat_num1 = r'(0[0-7]*|0x[\da-fA-F]+|[1-9]\d*)L?\Z' The two patterns defined in this example are equivalent, but the second one is made somewhat more readable by the comments and the free use of whitespace to group portions of the pattern in logical ways. 9.7.7 Match Versus SearchSo far, we've been using regular expressions to match strings. For example, the regular expression with pattern r'box' matches strings such as 'box' and 'boxes', but not 'inbox'. In other words, a regular expression match can be considered as implicitly anchored at the start of the target string, as if the regular expression's pattern started with \A. Often, you're interested in locating possible matches for a regular expression anywhere in the string, without any anchoring (e.g., find the r'box' match inside such strings as 'inbox', as well as in 'box' and 'boxes'). In this case, the Python term for the operation is a search, as opposed to a match. For such searches, you use the search method of a regular expression object, while the match method only deals with matching from the start. For example: import re 9.7.8 Anchoring at String Start and EndThe pattern elements ensuring that a regular expression search (or match) is anchored at string start and string end are \A and \Z respectively. More traditionally, elements ^ for start and $ for end are also used in similar roles. ^ is the same as \A, and $ is the same as \Z, for regular expression objects that are not multiline (i.e., that do not contain pattern element (?m) and are not compiled with the flag re.M or re.MULTILINE). For a multiline regular expression object, however, ^ anchors at the start of any line (i.e., either at the start of the whole string or at the position right after a newline character \n). Similarly, with a multiline regular expression, $ anchors at the end of any line (i.e., either at the end of the whole string or at the position right before \n). On the other hand, \A and \Z anchor at the start and end of the whole string whether the regular expression object is multiline or not. For example, here's how to check if a file has any lines that end with digits: import re A pattern of r'\d\n' would be almost equivalent, but in that case the search would fail if the very last character of the file were a digit not followed by a terminating end-of-line character. With the example above, the search succeeds if a digit is at the very end of the file's contents, as well as in the more usual case where a digit is followed by an end-of-line character. 9.7.9 Regular Expression ObjectsA regular expression object r has the following read-only attributes detailing how r was built (by function compile of module re, covered later in this chapter):
These attributes make it easy to get back from a compiled regular expression object to its pattern string and flags, so you never have to store those separately. A regular expression object r also supplies methods to locate matches for r's regular expression within a string, as well as to perform substitutions on such matches. Matches are generally represented by special objects, covered in the later Section 9.7.10.
When r has no groups, findall returns a list of strings, each a substring of s that is a non-overlapping match with r. For example, here's how to print out all words in a file, one per line: import re When r has one group, findall also returns a list of strings, but each is the substring of s matching r's group. For example, if you want to print only words that are followed by whitespace (not punctuation), you need to change only one statement in the previous example: reword = re.compile('(\w+)\s') When r has n groups (where n is greater than 1), findall returns a list of tuples, one per non-overlapping match with r. Each tuple has n items, one per group of r, the substring of s matching the group. For example, here's how to print the first and last word of each line that has at least two words: import re
Returns an appropriate match object when a substring of s, starting at index start and not reaching as far as index end, matches r. Otherwise, matchreturns None. Note that match is implicitly anchored at the starting position start in s. To search for a match with r through s, from start onwards, callr.search, not r.match. For example, here's how to print all lines in a file that start with digits: import re
Returns an appropriate match object for the leftmost substring of s, starting not before index start and not reaching as far as index end, that matchesr. When no such substring exists, search returns None. For example, to print all lines containing digits, one simple approach is as follows: import re
Returns a list L of the splits of s by r (i.e., the substrings of s that are separated by non-overlapping, non-empty matches with r). For example, to eliminate all occurrences of substring 'hello' from a string, in any mix of lowercase and uppercase letters, one way is: import re When r has n groups, n more items are interleaved in L between each pair of splits. Each of the n extra items is the substring of s matching r's corresponding group in that match, or None if that group did not participate in the match. For example, here's one way to remove whitespace only when it occurs between a colon and a digit: import re If maxsplit is greater than 0, at most maxsplit splits are in L, each followed by n items as above, while the trailing substring of s after maxsplit matches of r, if any, is L's last item. For example, to remove only the first occurrence of substring 'hello' rather than all of them, change the last statement in the first example above to: astring = ''.join(rehello.split(astring, 1))
Returns a copy of s where non-overlapping matches with r are replaced by repl, which can be either a string or a callable object, such as a function. An empty match is replaced only when not adjacent to the previous match. When count is greater than 0, only the first count matches of r within s are replaced. When count equals 0, all matches of r within s are replaced. For example, here's another way to remove only the first occurrence of substring 'hello' in any mix of cases: import re Without the final 1 argument to method sub, this example would remove all occurrences of 'hello'. When repl is a callable object, repl must accept a single argument (a match object) and return a string to use as the replacement for the match. In this case, sub calls repl, with a suitable match-object argument, for each match with r that sub is replacing. For example, to uppercase all occurrences of words starting with 'h' and ending with 'o' in any mix of cases, you can use the following: import re Method sub is a good way to get a callback to a callable you supply for every non-overlapping match of r in s, without an explicit loop, even when you don't need to perform any substitution. The following example shows this by using the sub method to build a function that works just like methodfindall for a regular expression without groups: import re The example needs Python 2.2, not just because it uses lexically nested scopes, but because in Python 2.2 re tolerates repl returning None and treats it as if it returned '', while in Python 2.1 re was more pedantic and insisted on repl returning a string. When repl is a string, sub uses repl itself as the replacement, except that it expands back references. A back reference is a substring of repl of the form \g<id>, where id is the name of a group in r (as established by syntax (?P<id>) in r's pattern string), or \dd, where dd is one or two digits, taken as a group number. Each back reference, whether named or numbered, is replaced with the substring of s matching the group of r that the back reference indicates. For example, here's how to enclose every word in braces: import re
subn is the same as sub, except that subn returns a pair (new_string, n) where n is the number of substitutions that subn has performed. For example, to count the number of occurrences of substring 'hello' in any mix of cases, one way is: import re 9.7.10 Match ObjectsMatch objects are created and returned by methods match and search of a regular expression object. There are also implicitly created by methods suband subn when argument repl is callable, since in that case a suitable match object is passed as the actual argument on each call to repl. A match object m supplies the following attributes detailing how m was created:
A match object m also supplies several methods.
These methods return the delimiting indices, within m.string, of the substring matching the group identified by groupid, where groupid can be a group number or name. When the matching substring is m.string[i:j], m.start returns i, m.end returns j, and m.span returns (i, j). When the group did not participate in the match, i and j are -1.
Returns a copy of s where escape sequences and back references are replaced in the same way as for method r.sub, covered in the previous section.
When called with a single argument groupid (a group number or name), group returns the substring matching the group identified by groupid, or Noneif that group did not participate in the match. The common idiom m.group( ), also spelled m.group(0), returns the whole matched substring, since group number 0 implicitly means the whole regular expression. When group is called with multiple arguments, each argument must be a group number or name. group then returns a tuple with one item per argument, the substring matching the corresponding group, or None if that group did not participate in the match.
Returns a tuple with one item per group in r. Each item is the substring matching the corresponding group, or default if that group did not participate in the match.
Returns a dictionary whose keys are the names of all named groups in r. The value for each name is the substring matching the corresponding group, or default if that group did not participate in the match. 9.7.11 Functions of Module reThe re module supplies the attributes listed in the earlier section Section 9.7.6. It also provides a function that corresponds to each method of a regular expression object (findall, match, search, split, sub, and subn), each with an additional first argument, a pattern string that the function implicitly compiles into a regular expression object. It's generally preferable to compile pattern strings into regular expression objects explicitly and call the regular expression object's methods, but sometimes, for a one-off use of a regular expression pattern, calling functions of module re can be slightly handier. For example, to count the number of occurrences of substring 'hello' in any mix of cases, one function-based way is: import re In cases such as this one, regular expression options (here, for example, case insensitivity) must be encoded as regular expression pattern elements (here, (?i)), since the functions of module re do not accept a flags argument. Module re also supplies error, the class of exceptions raised upon errors (generally, errors in the syntax of a pattern string), and two additional functions.
Creates and returns a regular expression object, parsing string pattern as per the syntax covered in Section 9.7.1, and using integer flags as in the section Section 9.7.6, both earlier in this chapter.
Returns a copy of string s where each non-alphanumeric character is escaped (i.e., preceded by a backslash \). This is handy when you need to match string s literally as part (or all) of a regular expression pattern string. |
python正则表达式的用法的更多相关文章
- Python正则表达式初识(十)附正则表达式总结
今天分享正则表达式最后一个特殊字符“\d”,具体的教程如下. 1.特殊字符“\d”十分常用,其代表的意思是数字.代码演示如下图所示. 其中“+”的意思是表示连续,在这里代表的意思是连续的数字.但是输出 ...
- Python正则表达式Regular Expression基本用法
资料来源:http://blog.csdn.net/whycadi/article/details/2011046 直接从网上资料转载过来,作为自己的参考.这个写的很清楚.先拿来看看. 1.正则表 ...
- 比较详细Python正则表达式操作指南(re使用)
比较详细Python正则表达式操作指南(re使用) Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式.Python 1.5之前版本则是通过 regex 模块提供 E ...
- Python正则表达式学习摘要及资料
摘要 在正则表达式中,如果直接给出字符,就是精确匹配. {m,n}? 对于前一个字符重复 m 到 n 次,并且取尽可能少的情况 在字符串'aaaaaa'中,a{2,4} 会匹配 4 个 a,但 a{2 ...
- python正则表达式 小例几则
会用到的语法 正则字符 释义 举例 + 前面元素至少出现一次 ab+:ab.abbbb 等 * 前面元素出现0次或多次 ab*:a.ab.abb 等 ? 匹配前面的一次或0次 Ab?: A.Ab 等 ...
- Python天天美味(15) - Python正则表达式操作指南(re使用)(转)
http://www.cnblogs.com/coderzh/archive/2008/05/06/1185755.html 简介 Python 自1.5版本起增加了re 模块,它提供 Perl 风格 ...
- python 正则表达式Re
Python正则表达式指南这篇文章很好,推荐阅读. 本文则是简单记录下我自己学习Re的笔记, 环境是python3.5. 1.简单的Re语法 ^ 匹配字符串开始位置. $ 匹配字符串结束位置. \b ...
- 【repost】Python正则表达式
星光海豚 python正则表达式详解 正则表达式是一个很强大的字符串处理工具,几乎任何关于字符串的操作都可以使用正则表达式来完成,作为一个爬虫工作者,每天和字符串打交道,正则表达式更是不可或缺的技 ...
- 玩弄 python 正则表达式
这里记录一个我常用的模型,每次久了不使用正则就会忘记. 记得最好玩的一句关于正则表达式的话就是 当你想到一件事情可以用正则表达式解决的时候 现在你就面临了两个问题了. python里面使用了re模块对 ...
随机推荐
- Date及DateFormat用法
Date 与DateFormat之间的转化String <————>Date Date与Calendar 之间的转化Long<————>Date 日历小程序 Scanner i ...
- linux实现一个定时任务
设置定时任务删除logs脚本数据 编写脚本 touch cleanLogs.sh #! /bin/sh -name "*.log*" -exec rm -f {} \; 使用r ...
- ggpubr进行“paper”组图合并,也许比PS,AI更简单
本文转载自微信公众号 “生信补给站”,https://mp.weixin.qq.com/s/41iKTulTwGcY-dHtqqSnLA 多个图形进行组图展示,可以既展示一个“事情”的多个角度,也可以 ...
- C++反汇编第五讲,认识C++中的Try catch语法,以及在反汇编中还原
我们以前讲SEH异常处理的时候已经说过了,C++中的Try catch语法只不过是对SEH做了一个封装. 如果不懂SEH异常处理,请点击博客链接熟悉一下,当然如果不想知道,也可以直接往下看.因为异常处 ...
- vue项目前端限制页面长时间未操作超时退出到登录页
之前项目超时判断是后台根据token判断的,这样判断需要请求接口才能得到返回结果,这样就出现页面没有接口请求时还可以点击,有接口请求时才会退出 现在需要做到的效果是:页面超过30分钟未操作时,无论点击 ...
- 第一章 Django之web框架(1)
Django 是新一代 Web 框架 中非常出色的成员.那么 Web 框架这个术语的确切含义到底是 什么呢? 要回答这个问题,让我们来看看通过编写标准的 CGI 程序来开发 Web 应用,这在大约19 ...
- 2019年C题 视觉情报信息分析
2019 年第十六届中国研究生数学建模竞赛C 题 任务1中 图三:图3 中拍照者距离地面的高度 目录: 0.试题分析: 1.构建摄像机模型 2.摄像机参数假定 3.像平面坐标计算 4.图像标定及数值测 ...
- Winform 多项目共用AssemblyInfo解决方案
Winform 多项目共用AssemblyInfo解决方案: 操作步骤如下: 第一步:复制任何项目中的AssemblyInfo.cs文件至指定目录 第二步:删除所有项目的AssemblyInfo.cs ...
- CentOS7.x忘记root密码如何破解
在CentOS7.x中,有一个单用户模式.CentOS7.x进入单用户模式与CentOS6.x略有不同,要复杂一些. 如果我们忘记了root的密码,可以在单用户模式下重置密码. 注意:此操作必须在服务 ...
- java--springmvc
springmvc请求图 SpringMVC内部的执行流程1.用户发起到达中央调度器DispatcherServlet2.中央调度器DispatcherServlet把请求(some.do)交给了处理 ...