Regular expressions are powerful, but with great power comes great responsibility. Because of the way most regex engines work, it is surprisingly easy to construct a regular expression that can take a very long time to run. In my previous post on regex performance, I discussed why and under what conditions certain regexes take forever to match their input. If my last post answered the question of why regexes are sometimes slow, this post aims to answer the question of what to do about it, as well as show how much faster certain techniques can make your regexes1.

Like my previous post, this post assumes you are somewhat familiar with regexes. Check out this excellent site if you need an intro, a refresher, or clarification on some of the techniques discussed below.

Finally, keep in mind that different regex engines work in different ways and incorporate different optimizations. The following tricks will likely help the performance of your regexes. To avoid needlessly obfuscating your regexes with performance enhancements that make no real difference, I urge you to benchmark your regular expressions with a set of your expected input. Don’t forget to include matching and non-matching input if you expect to have both. Try each of the techniques and see for yourself which one offers the best performance boost.

Without further ado, here are five regular expression techniques that can dramatically reduce

processing time:

  1. Character classes
  2. Possessive quantifiers (and atomic groups)
  3. Lazy quantifiers
  4. Anchors and boundaries
  5. Optimizing regex order

Character Classes

This is the most important thing to keep in mind when crafting performant regexes. Character classes specify what characters you are trying, or not trying, to match. The more specific you can be here, the better. You should almost always aim to replace the . in your.*s with something more specific. The.* will invariably shoot to the end of your line (or even your whole input if you have dot all enabled) and will then backtrack. When using a specific character class, you have control over how many characters the * will cause the regex engine to consume, giving you the power to stop the rampant backtracking.

To demonstrate this, let’s consider the two regular expressions:

1.field1=(.*) field2=(.*) field3=(.*) field4=(.*).*

2.field1=([^ ]*) field2=([^ ]*) field3=([^ ]*) field4=([^ ]*).*

I ran a (quick and dirty) benchmark against the following inputs:

1.field1=cat field2=dog field3=parrot field4=mouse field5=hamster

2.field1=cat dog parrot mouse

3.field1=cat field2=dog field3=parrot field5=mouse

This benchmark and all the other benchmarks in this post were conducted in the same way. Each regex was fed each input 1,000,000 times and the overall time was measured on average. These are the numbers I got for this particular experiment:

  Regex 1 (the .* one) Regex 2 (the character class one) Performance improvement
Input 1 (matching) 3606ms 736ms 79.6%
Input 2 (not matching) 591ms 225ms 61.9%
Input3(almost matching) 2520ms 597ms 76.3%

Here we can see that even with matching input, the vague dot starry regex takes way longer. In all cases, the specific regex performed way better. This will almost always be the case no matter what your regex is and no matter what your input is. Specificity is the number one way to improve the performance of your regexes. Just say that over and over again. Like a mantra.

Possessive Quantifiers (and Atomic Groups)

Possessive quantifiers (denoted with a +) and atomic groups (?>…) both do the same thing: once they consume text, they will never let it go. This can be nice for performance reasons because it cuts down on the backtracking that regexes are wont to do so much of. Generally speaking, though, you may be hard pressed to find a use case where atomic groups will be a real game changer in terms of performance. This is because the main performance heavy hitter is the infamous .* that causes lots of backtracking. If you changed the .* to a .*+ to make it possessive, you eliminate all backtracking, but you can’t matching anything else after that point since the + never gives back any text. Thus, your regex already has to be fairly specific in order to even use atomic groups; therefore, your performance boost will be incremental. Nonetheless, the possessive quantifier can still be surprisingly helpful. Consider these two regexes to match an IPv4 address:

1.^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*

2.^(\d{1,3}+\.\d{1,3}+\.\d{1,3}+\.\d{1,3}+).*

on the following two inputs:

1.107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET /" 200 2144

2.9.21.2015 non matching text that kind of matches

When matching the non-matching text, the regex without the possessive quantifier consumes the first few characters and, on not seeing a match, it backtracks all the characters one by one hoping to still find a match. With the possessive quantifier, as soon as the regex doesn’t find a match, it stops looking and doesn’t bother backtracking.

When running my benchmark on these regexes, I got the following results:



How much of a boost in performance you’ll get from this is use-case specific, but if you can use the atomic quantifier, then you should as it can pretty much only help.

Lazy Quantifiers

The lazy quantifier is a powerful performance booster. In many naive regexes, greedy quantifiers (*’s) can be safely replaced by lazy quantifiers (*?’s), giving the regex a performance kick without changing the result.

Consider the following example. When given the input

# Query_time: 0.304 Lock_time: 0.81 Rows_sent: 1 Rows_examined: 1 Rows_affected: 0 Rows_read: 4505295

and the greedy regex:

.* Lock_time: (\d\.\d+) .*

the regex engine would first shoot to the end of the string. It then backtracks until it gets to Lock_time, where it can consume the rest of the input. The alternative lazy regex

.*? Lock_time: (\d\.\d+) .*

would consume starting from the beginning of the string until it reaches Lock_time, at which point it could proceed to match the rest of the string. If the Lock_time field appears toward the beginning of the string, the lazy quantifier should be used. If the Lock_time field appears toward the end, it might be appropriate to use the greedy quantifier.

Some regex performance guides will advise you to be wary when using the lazy quantifier because it does its own kind of backtracking. It consumes one character at a time and then attempts to match the rest of the regex. If that fails, it “backtracks” and moves the cursor one character over and repeats. This can sometimes make the lazy star not at all faster or even slower than the greedy star. I saw this slight performance degradation in only one of my benchmarks.

I ran my benchmark on the following three inputs. The first input matches toward the beginning, the second input matches toward the end, and the third input doesn’t match at all.

1.# Query_time: 0.304 Lock_time: 0.81 Rows_sent: 1 Rows_read: 4505295 Rows_affected: 0 Rows_examined: 1

2.# Query_time: 0.304 Rows_sent: 1 Rows_read: 4505295 Rows_affected: 0 Lock_time: 0.81 Rows_examined: 1

3.# Query_time: 0.304 Rows_sent: 1 Query_time: 0.304 Rows_sent: 1 Query_time: 0.304 Rows_sent: 1 Rows_examined: 1

I matched against the two regexes mentioned above:

1..*Lock_time: (\d\.\d+).*

2..*?Lock_time: (\d\.\d+).*



The performance characteristics change when you add more .*s. Consider these regexes which match two fields:

3..*Lock_time: (\d\.\d+).*Rows_examined: (\d+).*

4..*?Lock_time: (\d\.\d+).*?Rows_examined: (\d+).*

I ran the benchmark against the same inputs and got these results:



Given these results, I’d say it’s generally a good idea to use the lazy quantifier wherever possible, but it is still important to benchmark just to be sure, as different regex engines optimize in different ways.

Anchors and Boundaries

Anchors and boundaries tell the regex engine that you intend the cursor to be in a particular place in the string. The most common anchors are ^ and $, indicating the beginning and end of the line (as opposed to \A and \Z which match the beginning and end of the input). Common boundaries include the word boundary \b and non-word boundary \B. For example, \bhttp\b matches http but not https. These techniques are useful when crafting regexes that are as specific as possible.

This is a pretty simple example, but it should serve as a reminder to use anchors whenever possible considering the impact it has on performance.

Here are two regexes to find an IPv4 address.

1.\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

2.^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The second regex is specific about the IP address appearing at the beginning of the string.

We’re searching for the regex in input that looks like this:

107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET /extension/bsupport/design/cl/images/btn_letschat.png HTTP/1.1" 200 2144

Non-matching input would look something like this:

[07/Dec/2012:23:57:13 +0000] 1354924633 GET "/favicon.ico" "" HTTP/1.1 200 82726 "-" "ELB-HealthChecker/1.0"

Here are the results of my benchmark:



Regex 2 of course runs much faster on non-matching input because it throws out the non-matching input almost immediately. In short, if you can use an anchor or a boundary, then you should because they can pretty much only help the performance of your regex.

Order Matters

Here I am talking about the ordering of alternations—when a regex has two or more valid options separated by a | character. Order will also matter if you have multiple lookaheads or lookbehinds. The idea is to order each option in the way that will minimize the amount of work the regex engine will need to do. For alternations, you want the most common option to be first, followed by the rarer options. If the rarer options are first, the regex engine will waste time checking those before checking the more common options which are likelier to succeed. For multiple lookaheads and lookbehinds, you want the rarest to be first, since all lookaheads and lookbehinds must match for the regex to proceed. If you start with the one that is least likely to match, the regex will fail faster.

This one is a bit of a micro-optimization, but it can give you a decent boost depending on your use case, and it can’t possibly hurt because the two expressions are equivalent. I ran a benchmark on the following two regexes:

1..*(?<='field5' : '|"field5" : ")([^'"]*).*

2..*(?<="field5" : "|'field5' : ')([^"']*).*

On the following input:

{"field1" : "wool", "field2" : “silk", "field3" : "linen", "field4" : "merino", "field5" : "alpaca"}"

I’m searching for a json field “field5” and I check to see if the json is formatted with double quotes or single quotes. Since double quotes are far more common in json, the option to check double quotes should be first. The benchmark showed the following performance difference:

Concluding Thoughts

Regex performance is an interesting topic. For most people, regexes are whipped out only in special circumstances to solve a very specific type of problem. Normally, it doesn’t matter if a regex is a bit slower than it could be. Many people who develop very latency-sensitive applications avoid regexes as they are notoriously slow. If a regex is really the only tool to get the job done but it must be blazingly fast, your options are to use a regex engine that is backed by the Thompson NFA algorithm4 (and, consequently, to say goodbye to back references) or to live and breathe the time-saving regex techniques in this post. Lastly, as is always the case when optimizing performance, benchmarking is key. Regex performance depends heavily on the input and the regex. Your benchmark should use the same regex engine and should measure against input that is similar to what you expect to match in your production application.

I hope that these posts have made you wiser and that your regexes are now much defter. You are blessed now with the knowledge of what makes a good regex and what makes a bad regex. Equipped with these new instruments and knowledge, you are ready to craft your own powerful, yet efficient regular expressions. Happy regexing!

Five Invaluable Techniques to Improve Regex Performance的更多相关文章

  1. Ten ways to improve the performance of large tables in MySQL--转载

    原文地址:http://www.tocker.ca/2013/10/24/improving-the-performance-of-large-tables-in-mysql.html Today I ...

  2. to improve sqlite performance

    INSERT is really slow - I can only do few dozen INSERTs per second http://www.sqlite.org/faq.html#q1 ...

  3. build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics for an e-commerce portal

    https://cloudxlab.com/blog/real-time-analytics-dashboard-with-apache-spark-kafka/

  4. Chapter 6 — Improving ASP.NET Performance

    https://msdn.microsoft.com/en-us/library/ff647787.aspx Retired Content This content is outdated and ...

  5. (转)A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers

    A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers. Updated 20 ...

  6. Java Performance Optimization Tools and Techniques for Turbocharged Apps--reference

    Java Performance Optimization by: Pierre-Hugues Charbonneau reference:http://refcardz.dzone.com/refc ...

  7. (转) Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

    Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-1 ...

  8. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  9. [XAF] How to improve the application's performance

    [自己的解决方案]数据量大时,可显著提升用户使用体验! 1.Root ListView 参考官方的E1554 点击导航菜单后首先跳出查询条件设置窗体进行设置 可设置查询方案或查询方案的查询条件,排序字 ...

随机推荐

  1. cms初步构想

    一.cms系统的初步构想 公司正准备使用yii框架重新弄个类cms的系统: 初步的功能: 栏目文章的管理 SEO的优化功能 推荐位管理 一些思路和规则: 数据库表名的定义:通过"大模块名称+ ...

  2. Jmeter报内存溢出解决方案

    描述:wimdows环境,做上传图片接口测试,涉及图片合成和上传,图片采用base64编码.每1s启动200线程的时候,Jmeter报内存溢出错误. 解决方案: 1.修改jmeter.bat: set ...

  3. React+Antd遇到的坑

    第一次尝试React+antd,发现果然不愧是传说中的坑货,一个又一个坑.必须要记录. react + antd,都是最新版本,使用npm和yarn各种add,build,start 1. 资源文件, ...

  4. (转)shiro权限框架详解06-shiro与web项目整合(下)

    http://blog.csdn.net/facekbook/article/details/54962975 shiro和web项目整合,实现类似真实项目的应用 web项目中认证 web项目中授权 ...

  5. html IMG 标签水平居中 ,和图片过大 溢出处理

    max-width: 100%;//父元素的宽度 display: block; margin: 0 auto; display: table-cell; 垂直居中 vertical-align: m ...

  6. img-responsive class图片响应式

    在BootStrap中,给<img>添加 .img-responsive样式就可以实现图片响应式. 1 <img src="..." class="im ...

  7. Blender软件导出的obj数据格式文件内容解读

    [cube.obj] # Blender v2.78 (sub 0) OBJ File: '' # www.blender.org mtllib cube.mtl #这里是引用了一个外部材质文件cub ...

  8. Python 纸牌游戏

    纸牌游戏 # card.py from random import shuffle class Card: # 黑桃,红桃,方块,梅花 suits = ['spades', 'hearts', 'di ...

  9. easyui获取当前点击对象tabs的title和Index

    观察上面打开的tabs选项卡,肯定会有一个目前是被选中状态,而这个状态的class属性也肯定是和其他tabs不一样的,有个class等于tabs-selected的 var title = $('.t ...

  10. TensorFlow CNN 测试CIFAR-10数据集

    本系列文章由 @yhl_leo 出品,转载请注明出处. 文章链接: http://blog.csdn.net/yhl_leo/article/details/50738311 1 CIFAR-10 数 ...