C++中三种正则表达式比较（C regex，C ++regex，boost regex）

工作需要用到C++中的正则表达式，以下三种正则可供参考

1，C regex

#include <regex.h>

#include <iostream>

#include <sys/types.h>

#include <stdio.h>

#include <cstring>

#include <sys/time.h>

using namespace std;

const int times = 1000000;

int main(int argc,char** argv)

{

    char pattern[512]="finance\.sina\.cn|stock1\.sina\.cn|3g\.sina\.com\.cn.*(channel=finance|_finance$|ch=stock|/stock/)|dp.sina.cn/.*ch=9&";

    const size_t nmatch = 10;

    regmatch_t pm[10];

    int z ;

    regex_t reg;

    char lbuf[256]="set",rbuf[256];

    char buf[3][256] = {"finance.sina.cn/google.com/baidu.com.google.sina.cndddddddddddddddddddddda.sdfasdfeoasdfnahsfonadsdf",

                    "3g.com.sina.cn.google.com.dddddddddddddddddddddddddddddddddddddddddddddddddddddbaidu.com.sina.egooooooooo",

                    "http://3g.sina.com.cn/google.baiduchannel=financegogo.sjdfaposif;lasdjf.asdofjas;dfjaiel.sdfaosidfj"};

    printf("input strings:\n");

    timeval end,start;

    gettimeofday(&start,NULL);

    regcomp(&reg,pattern,REG_EXTENDED|REG_NOSUB);

    for(int i = 0 ; i < times; ++i)

    {

        for(int j = 0 ; j < 3; ++j)

        {

            z = regexec(&reg,buf[j],nmatch,pm,REG_NOTBOL);

/*          if(z==REG_NOMATCH)

                printf("no match\n");

            else

                printf("ok\n");

                */

        }

    }

    gettimeofday(&end,NULL);

    uint time = (end.tv_sec-start.tv_sec)*1000000 + end.tv_usec - start.tv_usec;

    cout<<time/1000000<<" s and "<<time%1000000<<" us."<<endl;

    return 0 ;

}

使用正则表达式可简单的分成几步：

1.编译正则表达式

2.执行匹配

3.释放内存

首先，编译正则表达式

int regcomp(regex_t *preg, const char *regex, int cflags);

reqcomp()函数用于把正则表达式编译成某种格式，可以使后面的匹配更有效。

preg： regex_t结构体用于存放编译后的正则表达式；

regex：指向正则表达式指针；

cflags：编译模式

共有如下四种编译模式：

REG_EXTENDED：使用功能更强大的扩展正则表达式

REG_ICASE：忽略大小写

REG_NOSUB：不用存储匹配后的结果

REG_NEWLINE：识别换行符，这样‘$’就可以从行尾开始匹配，‘^’就可以从行的开头开始匹配。否则忽略换行符，把整个文本串当做一个字符串处理。

其次，执行匹配

int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags);

preg：已编译的正则表达式指针；

string：目标字符串；

nmatch:pmatch数组的长度；

pmatch：结构体数组，存放匹配文本串的位置信息；

eflags：匹配模式

共两种匹配模式：

REG_NOTBOL：The match-beginning-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above). This flag may be used when different portions of a string are passed to regexec and the beginning of the string should not be interpreted as the beginning of the line.

REG_NOTEOL:The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above)

最后，释放内存
void regfree(regex_t *preg);
当使用完编译好的正则表达式后，或者需要重新编译其他正则表达式时，一定要使用这个函数清空该变量。

其他，处理错误
size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
当执行regcomp 或者regexec 产生错误的时候，就可以调用这个函数而返回一个包含错误信息的字符串。
errcode：由regcomp 和 regexec 函数返回的错误代号。
preg：已经用regcomp函数编译好的正则表达式，这个值可以为NULL。
errbuf：指向用来存放错误信息的字符串的内存空间。
errbuf_size：指明buffer的长度，如果这个错误信息的长度大于这个值，则regerror 函数会自动截断超出的字符串，但他仍然会返回完整的字符串的长度。所以我们可以用如下的方法先得到错误字符串的长度。

当然我在测试的时候用到的也比较简单，所以就直接用了，速度一会再说！

2，C++ regex

#include <regex>

#include <iostream>

#include <stdio.h>

#include <string>

using namespace std;

int main(int argc,char** argv)

{

    regex pattern("[[:digit:]]",regex_constants::extended);

    printf("input strings:\n");

    string buf;

    while(cin>>buf)

    {

        printf("*******\n%s\n********\n",buf.c_str());

        if(buf == "quit")

        {

            printf("quit just now!\n");

            break;

        }

        match_results<string::const_iterator> result;

        printf("run compare now!  '%s'\n", buf.c_str());

        bool valid = regex_match(buf,result,pattern);

        printf("compare over now!  '%s'\n", buf.c_str());

        if(!valid)

            printf("no match!\n");

        else

            printf("ok\n");

    }

    return 0 ;

}

/*  write by xingming

 *  time:2012年10月19日15:51:53

 *  for: test regex

 *  */

#include <regex>

#include <iostream>

#include <stdio.h>

#include <string>

using namespace std;

int main(int argc,char** argv)

{

    regex pattern("[[:digit:]]",regex_constants::extended);

    printf("input strings:\n");

    string buf;

    while(cin>>buf)

    {

        printf("*******\n%s\n********\n",buf.c_str());

        if(buf == "quit")

        {

            printf("quit just now!\n");

            break;

        }

        match_results<string::const_iterator> result;

        printf("run compare now!  '%s'\n", buf.c_str());

        bool valid = regex_match(buf,result,pattern);

        printf("compare over now!  '%s'\n", buf.c_str());

        if(!valid)

            printf("no match!\n");

        else

            printf("ok\n");

    }

    return 0 ;

}

C++这个真心不想多说它，测试过程中发现字符匹配的时候 ‘a' 是可以匹配的，a+也是可以的，[[:w:]]也可以匹配任意字符，但[[:w:]]+就只能匹配一个字符，+号貌似不起作用了。所以后来就干脆放弃了这伟大的C++正则，如果有大牛知道这里面我错在哪里了，真心感谢你告诉我一下，谢谢。

3，boost regex

#include <iostream>

#include <string>

#include <sys/time.h>

#include "boost/regex.hpp"

using namespace std;

using namespace boost;

const int times = 10000000;

int main()

{

    regex  pattern("finance\\.sina\\.cn|stock1\\.sina\\.cn|3g\\.sina\\.com\\.cn.*(channel=finance|_finance$|ch=stock|/stock/)|dp\\.s

ina\\.cn/.*ch=9&");

    cout<<"input strings:"<<endl;

    timeval start,end;

    gettimeofday(&start,NULL);

    string input[] = {"finance.sina.cn/google.com/baidu.com.google.sina.cn",

                      "3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo",

                      "http://3g.sina.com.cn/google.baiduchannel=financegogo"};

    for(int i = 0 ;i < times; ++ i)

    {

        for(int j = 0 ; j < 3;++j)

        {

            //if(input=="quit")

            //  break;

            //cout<<"string:'"<<input<<'\''<<endl;

            cmatch what;

            if(regex_search(input[j].c_str(),what,pattern)) ;

            //  cout<<"OK!"<<endl;

            else ;

            //  cout<<"error!"<<endl;

        }

    }

    gettimeofday(&end,NULL);

    uint time = (end.tv_sec-start.tv_sec)*1000000 + end.tv_usec - start.tv_usec;

    cout<<time/1000000<<" s and "<<time%1000000<<" us."<<endl;

    return 0 ;

}

boost正则不用多说了，要是出去问，C++正则怎么用啊？那90%的人会推荐你用boost正则，他实现起来方便，正则库也很强大，资料可以找到很多，所以我也不在阐述了。

4，对比情况

单位(us)	boost regex						单位(us)	C regex
	1	2	3	4	5	平均		1	2	3	4	5	平均
1w	218,699					218,700	1w	90,631					90,632
10w	2,186,109	2,194,524	2,188,762	2,186,343	2,192,902	2,191,350	10w	902,658	907,547	915,934	891,250	903,899	900,113
100w	25,606,021	28,633,984	28,956,997	26,912,245	26,909,788	27,669,546	100w	9,030,497	9,016,080	8,939,238	8,953,076	9,041,565	8,983,831
1000w	218,126,580					218,126,581	1000w	89,609,061					89,609,062



正则	finance\\.sina\\.cn\|stock1\\.sina\\.cn\|3g\\.sina\\.com\\.cn.(channel=finance\|_finance$\|ch=stock\|/stock/)\|dp\\.s ina\\.cn/.ch=9&						正则	finance\.sina\.cn\|stock1\.sina\.cn\|3g\.sina\.com\.cn.(channel=finance\|_finance$\|ch=stock\|/stock/)\|dp.sina.cn/.ch=9&
字符串	{"finance.sina.cn/google.com/baidu.com.google.sina.cn" ,						字符串	{"finance.sina.cn/google.com/baidu.com.google.sina.cn" ,
	"3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo" ,							"3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo" ,
	"http://3g.sina.com.cn/google.baiduchannel=financegogo"};							http://3g.sina.com.cn/google.baiduchannel=financegogo};

总结：

C regex的速度让我吃惊啊，相比boost的速度，C regex的速度几乎要快上3倍，看来正则引擎的选取上应该有着落了！

上面的表格中我用到的正则和字符串是一样的（在代码中C regex的被我加长了），速度相差几乎有3倍，C的速度大约在30+w/s , 而boost的速度基本在15-w/s ,所以对比就出来了！

在这里Cregex的速度很让我吃惊了已经，但随后我的测试更让我吃惊。

我以前在.net正则方面接触的比较多，就写了一个.net版本的作为对比，

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Text.RegularExpressions;

namespace 平常测试

{

    class Program

    {

        static int times = 1000000;

        static void Main(string[] args)

        {

            Regex reg = new Regex(@"(?>finance\.sina\.cn|stock1\.sina\.cn|3g\.sina\.com\.cn.*(?:channel=finance|_finance$|ch=stock|/stock/)|dp.sina.cn/.*ch=9&)",RegexOptions.Compiled);

            string[] str = new string[]{@"finance.sina.cn/google.com/baidu.com.google.sina.cn",

                    @"3g.com.sina.cn.google.com.baidu.com.sina.egooooooooo",

                    @"http://3g.sina.com.cn/google.baiduchannel=financegogo"};

            int tt = 0;

            DateTime start = DateTime.Now;

            for (int i = 0; i < times; ++i)

            {

                for (int j = 0; j < 3; ++j)

                {

                    if (reg.IsMatch(str[j])) ;

                        //Console.WriteLine("OK!");

                    //else

                        //Console.WriteLine("Error!");

                }

            }

            DateTime end = DateTime.Now;

            Console.WriteLine((end - start).TotalMilliseconds);

            Console.WriteLine(tt);

            Console.ReadKey();

        }

    }

}

结果发现，正则在不进行RegexOptions.Compiled 的时候，速度和C regex的基本一样，在编译只会，速度会比C regex快上一倍，这不由得让我对微软的那群人的敬畏之情油然而生啊。

但随后我去查看了一下该博客上面C regex的描述，发现我可以再申明正则的时候加入编译模式，随后我加入了上面代码里的 REG_NOSUB（在先前测试的时候是没有加入的），结果让我心理面很激动的速度出来了，C regex 匹配速度竟然达到了 300+w/s，也就是比原来的（不加入REG_NOSUB)的代码快了将近10倍。

之后我变换了匹配的字符串，将其长度生了一倍，达到每个100字符左右（代码里面所示），匹配速度就下来了，但是也能达到 100w/s左右，这肯定满足我们现在的需求了。

结果很显然，当然会选择C regex了

C++中三种正则表达式比较（C regex，C ++regex，boost regex）的更多相关文章

Spring中三种配置Bean的方式
Spring中三种配置Bean的方式分别是: 基于XML的配置方式基于注解的配置方式基于Java类的配置方式一.基于XML的配置这个很简单,所以如何使用就略掉. 二.基于注解的配置 Sprin ...
iOS开发UI篇—iOS开发中三种简单的动画设置
iOS开发UI篇—iOS开发中三种简单的动画设置 [在ios开发中,动画是廉价的] 一.首尾式动画代码示例: // beginAnimations表示此后的代码要“参与到”动画中 [UIView b ...
C#中三种定时器对象的比较
·关于C#中timer类在C#里关于定时器类就有3个1.定义在System.Windows.Forms里2.定义在System.Threading.Timer类里3.定义在System.Timers ...
转-Web Service中三种发送接受协议SOAP、http get、http post
原文链接:web服务中三种发送接受协议SOAP/HTTP GET/HTTP POST 一.web服务中三种发送接受协议SOAP/HTTP GET/HTTP POST 在web服务中,有三种可供选择的发 ...
C#中三种定时器对象的比较【转】
https://www.cnblogs.com/zxtceq/p/5667281.html C#中三种定时器对象的比较 ·关于C#中timer类在C#里关于定时器类就有3个1.定义在System.W ...
深入浅出spring IOC中三种依赖注入方式
深入浅出spring IOC中三种依赖注入方式 spring的核心思想是IOC和AOP,IOC-控制反转,是一个重要的面向对象编程的法则来消减计算机程序的耦合问题,控制反转一般分为两种类型,依赖注入和 ...
Android中三种超实用的滑屏方式汇总(转载)
Android中三种超实用的滑屏方式汇总现如今主流的Android应用中,都少不了左右滑动滚屏这项功能,(貌似现在好多人使用智能机都习惯性的有事没事的左右滑屏,也不知道在干什么...嘿嘿),由于 ...
VMWare中三种网络连接模式的区别
VMWare中有桥接.NAT.host-only三种网络连接模式,在搭建伪分布式集群时,需要对集群的网络连接进行配置,而这一操作的前提是理解这三种网络模式的区别. 参考以下两篇文章可以更好的理解: V ...
js中三种定义变量 const， var， let 的区别
js中三种定义变量的方式const, var, let的区别 1.const定义的变量不可以修改,而且必须初始化. 1 const b = 2;//正确 2 // const b;//错误,必须初始化 ...

随机推荐

Chapter 14 G-estimation of Structural Nested Models
目录 14.1 The causal question revisited 14.2 Exchangeability revisited 14.3 Structural nested mean mod ...
matplotlib 进阶之Tight Layout guide
目录简单的例子 Use with GridSpec Legend and Annotations Use with AxesGrid1 Colorbar 函数链接 matplotlib教程学习笔记 ...
Dubbo为什么要用Go重写？
先说两句我常常在散步时思考很多技术上的「为什么问题」,有时一个问题会想很久,直到问题的每一个点都能说服自己时,才算完结.于是想把这些思考记录下来,形成文章,可以当做一个新的系列.这些文章中你可能看不 ...
c#16进制转浮点数单精度类型
c#16进制转浮点数单精度类型: string s = "4144147B"; MatchCollection matches = Regex.Matches(s, @" ...
<学习opencv>opencv数据类型
目录 Opencv数据类型: 基础类型概述固定向量类class cv::Vec<> 固定矩阵类cv::Matx<> 点类 Point class cv::Scalar 深入了 ...
CS5211替代PS8625|设计DP转LVDS转接板|替代PS8625方案
1.CS5211与PS8625功能概述 CS5211是一个eDP到LVDS转换器,配置灵活,适用于低成本显示系统.CS5211与eDP 1.2兼容,支持1通道和2通道模式,每通道速度为1.62Gbps ...
css 基础 rgba表示法
color:rgba(); //r表示red 红色 //g表示green 绿色 //b表示blue 蓝色 //a 表示透明度 color:rgb(0,0,0,0) //黑色 color:rgb(255 ...
Linux架构中代理服务器配置与负载均衡
本期内容概要代理负载均衡内容详细 1.代理 1.主要作用: 将流量平均分配 2.代理的方式 01 正向代理外部想要访问服务器先找代理找到之后还需要找服务器应用:VPN 02 反向代理外 ...
Linux查看进程启动时间和运行多长时间
Linux 查看进程启动时间和运行多长时间启动时间 ps -eo lstart 运行多长时间 ps -eo etime -bash-4.1$ ps -eo pid,lstart,etime | gr ...
第10组 Beta冲刺 (1/5)
1.1基本情况 ·队名:今晚不睡觉 ·组长博客:https://www.cnblogs.com/cpandbb/p/14012521.html ·作业博客:https://edu.cnblogs.co ...

C++中三种正则表达式比较（C regex，C ++regex，boost regex）

C++中三种正则表达式比较（C regex，C ++regex，boost regex）的更多相关文章

随机推荐

热门专题