Regular expressions are powerful, but with great power comes great responsibility. Because of the way most regex engines work, it is surprisingly easy to construct a regular expression that can take a very long time to run. In my previous post on regex performance, I discussed why and under what conditions certain regexes take forever to match their input. If my last post answered the question of why regexes are sometimes slow, this post aims to answer the question of what to do about it, as well as show how much faster certain techniques can make your regexes1.

Like my previous post, this post assumes you are somewhat familiar with regexes. Check out this excellent site if you need an intro, a refresher, or clarification on some of the techniques discussed below.

Finally, keep in mind that different regex engines work in different ways and incorporate different optimizations. The following tricks will likely help the performance of your regexes. To avoid needlessly obfuscating your regexes with performance enhancements that make no real difference, I urge you to benchmark your regular expressions with a set of your expected input. Don’t forget to include matching and non-matching input if you expect to have both. Try each of the techniques and see for yourself which one offers the best performance boost.

Without further ado, here are five regular expression techniques that can dramatically reduce

processing time:

  1. Character classes
  2. Possessive quantifiers (and atomic groups)
  3. Lazy quantifiers
  4. Anchors and boundaries
  5. Optimizing regex order

Character Classes

This is the most important thing to keep in mind when crafting performant regexes. Character classes specify what characters you are trying, or not trying, to match. The more specific you can be here, the better. You should almost always aim to replace the . in your.*s with something more specific. The.* will invariably shoot to the end of your line (or even your whole input if you have dot all enabled) and will then backtrack. When using a specific character class, you have control over how many characters the * will cause the regex engine to consume, giving you the power to stop the rampant backtracking.

To demonstrate this, let’s consider the two regular expressions:

1.field1=(.*) field2=(.*) field3=(.*) field4=(.*).*

2.field1=([^ ]*) field2=([^ ]*) field3=([^ ]*) field4=([^ ]*).*

I ran a (quick and dirty) benchmark against the following inputs:

1.field1=cat field2=dog field3=parrot field4=mouse field5=hamster

2.field1=cat dog parrot mouse

3.field1=cat field2=dog field3=parrot field5=mouse

This benchmark and all the other benchmarks in this post were conducted in the same way. Each regex was fed each input 1,000,000 times and the overall time was measured on average. These are the numbers I got for this particular experiment:

  Regex 1 (the .* one) Regex 2 (the character class one) Performance improvement
Input 1 (matching) 3606ms 736ms 79.6%
Input 2 (not matching) 591ms 225ms 61.9%
Input3(almost matching) 2520ms 597ms 76.3%

Here we can see that even with matching input, the vague dot starry regex takes way longer. In all cases, the specific regex performed way better. This will almost always be the case no matter what your regex is and no matter what your input is. Specificity is the number one way to improve the performance of your regexes. Just say that over and over again. Like a mantra.

Possessive Quantifiers (and Atomic Groups)

Possessive quantifiers (denoted with a +) and atomic groups (?>…) both do the same thing: once they consume text, they will never let it go. This can be nice for performance reasons because it cuts down on the backtracking that regexes are wont to do so much of. Generally speaking, though, you may be hard pressed to find a use case where atomic groups will be a real game changer in terms of performance. This is because the main performance heavy hitter is the infamous .* that causes lots of backtracking. If you changed the .* to a .*+ to make it possessive, you eliminate all backtracking, but you can’t matching anything else after that point since the + never gives back any text. Thus, your regex already has to be fairly specific in order to even use atomic groups; therefore, your performance boost will be incremental. Nonetheless, the possessive quantifier can still be surprisingly helpful. Consider these two regexes to match an IPv4 address:

1.^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*

2.^(\d{1,3}+\.\d{1,3}+\.\d{1,3}+\.\d{1,3}+).*

on the following two inputs:

1.107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET /" 200 2144

2.9.21.2015 non matching text that kind of matches

When matching the non-matching text, the regex without the possessive quantifier consumes the first few characters and, on not seeing a match, it backtracks all the characters one by one hoping to still find a match. With the possessive quantifier, as soon as the regex doesn’t find a match, it stops looking and doesn’t bother backtracking.

When running my benchmark on these regexes, I got the following results:



How much of a boost in performance you’ll get from this is use-case specific, but if you can use the atomic quantifier, then you should as it can pretty much only help.

Lazy Quantifiers

The lazy quantifier is a powerful performance booster. In many naive regexes, greedy quantifiers (*’s) can be safely replaced by lazy quantifiers (*?’s), giving the regex a performance kick without changing the result.

Consider the following example. When given the input

# Query_time: 0.304 Lock_time: 0.81 Rows_sent: 1 Rows_examined: 1 Rows_affected: 0 Rows_read: 4505295

and the greedy regex:

.* Lock_time: (\d\.\d+) .*

the regex engine would first shoot to the end of the string. It then backtracks until it gets to Lock_time, where it can consume the rest of the input. The alternative lazy regex

.*? Lock_time: (\d\.\d+) .*

would consume starting from the beginning of the string until it reaches Lock_time, at which point it could proceed to match the rest of the string. If the Lock_time field appears toward the beginning of the string, the lazy quantifier should be used. If the Lock_time field appears toward the end, it might be appropriate to use the greedy quantifier.

Some regex performance guides will advise you to be wary when using the lazy quantifier because it does its own kind of backtracking. It consumes one character at a time and then attempts to match the rest of the regex. If that fails, it “backtracks” and moves the cursor one character over and repeats. This can sometimes make the lazy star not at all faster or even slower than the greedy star. I saw this slight performance degradation in only one of my benchmarks.

I ran my benchmark on the following three inputs. The first input matches toward the beginning, the second input matches toward the end, and the third input doesn’t match at all.

1.# Query_time: 0.304 Lock_time: 0.81 Rows_sent: 1 Rows_read: 4505295 Rows_affected: 0 Rows_examined: 1

2.# Query_time: 0.304 Rows_sent: 1 Rows_read: 4505295 Rows_affected: 0 Lock_time: 0.81 Rows_examined: 1

3.# Query_time: 0.304 Rows_sent: 1 Query_time: 0.304 Rows_sent: 1 Query_time: 0.304 Rows_sent: 1 Rows_examined: 1

I matched against the two regexes mentioned above:

1..*Lock_time: (\d\.\d+).*

2..*?Lock_time: (\d\.\d+).*



The performance characteristics change when you add more .*s. Consider these regexes which match two fields:

3..*Lock_time: (\d\.\d+).*Rows_examined: (\d+).*

4..*?Lock_time: (\d\.\d+).*?Rows_examined: (\d+).*

I ran the benchmark against the same inputs and got these results:



Given these results, I’d say it’s generally a good idea to use the lazy quantifier wherever possible, but it is still important to benchmark just to be sure, as different regex engines optimize in different ways.

Anchors and Boundaries

Anchors and boundaries tell the regex engine that you intend the cursor to be in a particular place in the string. The most common anchors are ^ and $, indicating the beginning and end of the line (as opposed to \A and \Z which match the beginning and end of the input). Common boundaries include the word boundary \b and non-word boundary \B. For example, \bhttp\b matches http but not https. These techniques are useful when crafting regexes that are as specific as possible.

This is a pretty simple example, but it should serve as a reminder to use anchors whenever possible considering the impact it has on performance.

Here are two regexes to find an IPv4 address.

1.\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

2.^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The second regex is specific about the IP address appearing at the beginning of the string.

We’re searching for the regex in input that looks like this:

107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET /extension/bsupport/design/cl/images/btn_letschat.png HTTP/1.1" 200 2144

Non-matching input would look something like this:

[07/Dec/2012:23:57:13 +0000] 1354924633 GET "/favicon.ico" "" HTTP/1.1 200 82726 "-" "ELB-HealthChecker/1.0"

Here are the results of my benchmark:



Regex 2 of course runs much faster on non-matching input because it throws out the non-matching input almost immediately. In short, if you can use an anchor or a boundary, then you should because they can pretty much only help the performance of your regex.

Order Matters

Here I am talking about the ordering of alternations—when a regex has two or more valid options separated by a | character. Order will also matter if you have multiple lookaheads or lookbehinds. The idea is to order each option in the way that will minimize the amount of work the regex engine will need to do. For alternations, you want the most common option to be first, followed by the rarer options. If the rarer options are first, the regex engine will waste time checking those before checking the more common options which are likelier to succeed. For multiple lookaheads and lookbehinds, you want the rarest to be first, since all lookaheads and lookbehinds must match for the regex to proceed. If you start with the one that is least likely to match, the regex will fail faster.

This one is a bit of a micro-optimization, but it can give you a decent boost depending on your use case, and it can’t possibly hurt because the two expressions are equivalent. I ran a benchmark on the following two regexes:

1..*(?<='field5' : '|"field5" : ")([^'"]*).*

2..*(?<="field5" : "|'field5' : ')([^"']*).*

On the following input:

{"field1" : "wool", "field2" : “silk", "field3" : "linen", "field4" : "merino", "field5" : "alpaca"}"

I’m searching for a json field “field5” and I check to see if the json is formatted with double quotes or single quotes. Since double quotes are far more common in json, the option to check double quotes should be first. The benchmark showed the following performance difference:

Concluding Thoughts

Regex performance is an interesting topic. For most people, regexes are whipped out only in special circumstances to solve a very specific type of problem. Normally, it doesn’t matter if a regex is a bit slower than it could be. Many people who develop very latency-sensitive applications avoid regexes as they are notoriously slow. If a regex is really the only tool to get the job done but it must be blazingly fast, your options are to use a regex engine that is backed by the Thompson NFA algorithm4 (and, consequently, to say goodbye to back references) or to live and breathe the time-saving regex techniques in this post. Lastly, as is always the case when optimizing performance, benchmarking is key. Regex performance depends heavily on the input and the regex. Your benchmark should use the same regex engine and should measure against input that is similar to what you expect to match in your production application.

I hope that these posts have made you wiser and that your regexes are now much defter. You are blessed now with the knowledge of what makes a good regex and what makes a bad regex. Equipped with these new instruments and knowledge, you are ready to craft your own powerful, yet efficient regular expressions. Happy regexing!

Five Invaluable Techniques to Improve Regex Performance的更多相关文章

  1. Ten ways to improve the performance of large tables in MySQL--转载

    原文地址:http://www.tocker.ca/2013/10/24/improving-the-performance-of-large-tables-in-mysql.html Today I ...

  2. to improve sqlite performance

    INSERT is really slow - I can only do few dozen INSERTs per second http://www.sqlite.org/faq.html#q1 ...

  3. build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics for an e-commerce portal

    https://cloudxlab.com/blog/real-time-analytics-dashboard-with-apache-spark-kafka/

  4. Chapter 6 — Improving ASP.NET Performance

    https://msdn.microsoft.com/en-us/library/ff647787.aspx Retired Content This content is outdated and ...

  5. (转)A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers

    A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers. Updated 20 ...

  6. Java Performance Optimization Tools and Techniques for Turbocharged Apps--reference

    Java Performance Optimization by: Pierre-Hugues Charbonneau reference:http://refcardz.dzone.com/refc ...

  7. (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

    Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...

  8. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  9. [XAF] How to improve the application's performance

    [自己的解决方案]数据量大时,可显著提升用户使用体验! 1.Root ListView 参考官方的E1554 点击导航菜单后首先跳出查询条件设置窗体进行设置 可设置查询方案或查询方案的查询条件,排序字 ...

随机推荐

  1. eeee

    Math Behind Rx https://github.com/ReactiveX/RxSwift/blob/master/Documentation/MathBehindRx.md Gettin ...

  2. CorelDRAW X8官方正版特惠下载

    CorelDRAW X8自发布以来,价格居高不下,这也使一众忠粉望而却步,之前看过CorelDRAW做活动,都是X6\X7这些比较早的版本,比较新的版本也没做什么优惠,不过还好看了一下,CorelDR ...

  3. day09网络编程

    一 操作系统基础 操作系统:(Operating System,简称OS)是管理和控制计算机硬件与软件资源的计算机程序,是直接运行在“裸机”上的最基本的系统软件,任何其他软件都必须在操作系统的支持下才 ...

  4. 路飞学城Python-Day142

    第2节:UA身份伪装 反爬机制 User-Agent:请求载体的身份标识 通过不同的手段的当前的请求载体是不一样的,请求信息也是不一样的,常见的请求信息都是以键和值的形式存在 浏览器的开发者工具 Ne ...

  5. Fiddler4抓包工具使用教程

    本文参考自http://blog.csdn.net/ohmygirl/article/details/17846199,纯属读书笔记,加深记忆 1.抓包工具有很多,为什么要使用Fiddler呢?原因如 ...

  6. Django 连接MySQL 通过pymysql 库

    ython3 如何安装pymysql 库,在此不再做多的讲解,如果有想知道如何安装的朋友,请求参考下面的连接地址: 第一步:应用setting.py 文件设置mysql 数据库连接相关属性.   DA ...

  7. AOJ 2224 Save your cats( 最小生成树 )

    链接:传送门 题意:有个女巫把猫全部抓走放在一个由 n 个木桩(xi,yi),m 个篱笆(起点终点木桩的编号)围成的法术领域内,我们必须用圣水才能将篱笆打开,然而圣水非常贵,所以我们尽量想降低花费来解 ...

  8. echarts地图的基本使用配置

    一.空气质量图 代码和配置如下: <template> <div class="box"> <div id="map">&l ...

  9. HDU 2857 Mirror and Light

    /* hdu 2857 Mirror and Light 计算几何 镜面反射 */ #include<stdio.h> #include<string.h> #include& ...

  10. Java Pattern Matcher 正则表达式需要转义的字符

    见:http://blog.csdn.net/bbirdsky/article/details/45368709 /** * 转义正则特殊字符 ($()*+.[]?\^{},|) * * @param ...