完整格式链接：https://blog.imakiseki.cf/2022/03/07/techdev/python-cpp-string-find-perf-test/

背景

最近在备战一场算法竞赛，语言误选了 Python ，无奈只能着手对常见场景进行语言迁移。而字符串查找的场景在算法竞赛中时有出现。本文即对此场景在 Python 和竞赛常用语言 C++ 下的速度进行对比，并提供相关参数和运行结果供他人参考。

参数

硬件和操作系统

                   -`                    root@<hostname>

                  .o+`                   ------------

                 `ooo/                   OS: Arch Linux ARM aarch64

                `+oooo:                  Host: Raspberry Pi 4 Model B

               `+oooooo:                 Kernel: 5.16.12-1-aarch64-ARCH

               -+oooooo+:                Uptime: 3 hours, 32 mins

             `/:-:++oooo+:               Packages: 378 (pacman)

            `/++++/+++++++:              Shell: zsh 5.8.1

           `/++++++++++++++:             Terminal: /dev/pts/0

          `/+++ooooooooooooo/`           CPU: (4) @ 1.500GHz

         ./ooosssso++osssssso+`          Memory: 102MiB / 7797MiB

        .oossssso-````/ossssss+`

       -osssssso.      :ssssssso.

      :osssssss/        osssso+++.

     /ossssssss/        +ssssooo/-

   `/ossssso+/:-        -:/+osssso+-

  `+sso+:-`                 `.-/+oso:

 `++:.                           `-/+/

 .`                                 `/

编译环境和解释环境

Python
- 解释器：Python 3.10.2 (main, Jan 23 2022, 21:20:14) [GCC 10.2.0] on linux
- 交互环境：IPython 8.0.1
C++
- 编译器：g++ (GCC) 11.2.0
- 编译命令：g++ test.cpp -Wall -O2 -g -std=c++11 -o test

场景

本次实测设置两个场景：场景 1 的源串字符分布使用伪随机数生成器生成，表示字符串查找的平均情况；场景 2 的源串可连续分割成 20,000 个长度为 50 的字符片段，其中第 15,001 个即为模式串，形如“ab…b”（1 个“a”，49 个 “b”），其余的字符片段形如“ab…c”（1 个“a”，48 个“b”，1 个“c”）。

项目	场景 1：平均情况	场景 2：较坏情况
字符集	小写字母	`abc`
字符分布	`random.choice`	有较强规律性
源串长度	1,000,000	1,000,000
模式串长度	1,000	50
模式串出现位置	250,000、500,000、750,000	750,000
模式串出现次数	1	1

测试方法

本次实测中，Python 语言使用内置类型 str 的 .find() 成员函数，C++ 语言分别使用 string 类的 .find() 成员函数、strstr 标准库函数和用户实现的 KMP 算法。

测试对象	核心代码
Python	`src.find(pat)`
C++ - `test.cpp`	`src.find(pat)`
C++ - `test_strstr.cpp`	`strstr(src, pat)`
C++ - `test_kmp.cpp`	`KMP(src, pat)`

源代码

生成源串和模式串

import random

# 场景 1：

# 源串

s = "".join(chr(random.choice(range(ord("a"), ord("z") + 1))) for _ in range(1000000))

# 模式串列表，三个元素各对应一个模式串

p = [s[250000:251000], s[500000:501000], s[750000:751000]]

# 场景 2：

# 模式串

p = 'a' + 'b' * 49

# 其他字符片段

_s = "a" + "b" * 48 + "c"

# 源串

s = _s * 15000 + p + _s * 4999

# 存储到文件，便于 C++ 程序获取

with open('source.in', 'w') as f:

    f.write(s)

with open('pattern.in', 'w') as f:

    f.write(p[0])

测试代码

Python

In []: %timeit s.find(p[0])

C++ - `test.cpp`

#include <chrono>

#include <iostream>

#include <cstring>

#include <fstream>

#define LOOP_COUNT (1000)

using namespace std;

using std::chrono::high_resolution_clock;

using std::chrono::duration_cast;

using std::chrono::duration;

using std::chrono::milliseconds;

double test(string s, string p, size_t* pos_ptr) {

    auto t1 = high_resolution_clock::now();

    *pos_ptr = s.find(p);

    auto t2 = high_resolution_clock::now();

    duration<double, milli> ms_double = t2 - t1;

    return ms_double.count();

}

int main() {

    string s, p;

    size_t pos;

    ifstream srcfile("source.in");

    ifstream patfile("pattern.in");

    srcfile >> s;

    patfile >> p;

    double tot_time = 0;

    for (int i = 0; i < LOOP_COUNT; ++i) {

        tot_time += test(s, p, &pos);

    }

    cout << "Loop count:            " << LOOP_COUNT << endl;

    cout << "Source string length:  " << s.length() << endl;

    cout << "Pattern string length: " << p.length() << endl;

    cout << "Search result:         " << pos << endl;

    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;

}

C++ - `test_strstr.cpp`

#include <chrono>

#include <iostream>

#include <cstring>

#include <fstream>

#define LOOP_COUNT (1000)

using namespace std;

using std::chrono::high_resolution_clock;

using std::chrono::duration_cast;

using std::chrono::duration;

using std::chrono::milliseconds;

char s[1000005], p[1005], *pos=NULL;

double test(char* s, char* p, char** pos_ptr) {

    auto t1 = high_resolution_clock::now();

    *pos_ptr = strstr(s, p);

    auto t2 = high_resolution_clock::now();

    duration<double, milli> ms_double = t2 - t1;

    return ms_double.count();

}

int main() {

    ifstream srcfile("source.in");

    ifstream patfile("pattern.in");

    srcfile >> s;

    patfile >> p;

    double tot_time = 0;

    for (int i = 0; i < LOOP_COUNT; ++i) {

        tot_time += test(s, p, &pos);

    }

    cout << "Loop count:            " << LOOP_COUNT << endl;

    cout << "Source string length:  " << strlen(s) << endl;

    cout << "Pattern string length: " << strlen(p) << endl;

    cout << "Search result:         " << pos - s << endl;

    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;

}

C++ - `test_kmp.cpp`

#include <chrono>

#include <iostream>

#include <cstring>

#include <fstream>

#include <cstdlib>

#define LOOP_COUNT (1000)

using namespace std;

using std::chrono::high_resolution_clock;

using std::chrono::duration_cast;

using std::chrono::duration;

using std::chrono::milliseconds;

int dp[1005];

int KMP(string s, string p) {

    int m = s.length(), n = p.length();

    if (n == 0) return 0;

    if (m < n) return -1;

    memset(dp, 0, sizeof(int) * (n+1));

    for (int i = 1; i < n; ++i) {

        int j = dp[i+1];

        while (j > 0 && p[j] != p[i]) j = dp[j];

        if (j > 0 || p[j] == p[i]) dp[i+1] = j + 1;

    }

    for (int i = 0, j = 0; i < m; ++i)

        if (s[i] == p[j]) { if (++j == n) return i - j + 1; }

        else if (j > 0) {

            j = dp[j];

            --i;

        }

    return -1;

}

double test(string s, string p, int* pos_ptr) {

    auto t1 = high_resolution_clock::now();

    *pos_ptr = KMP(s, p);

    auto t2 = high_resolution_clock::now();

    duration<double, milli> ms_double = t2 - t1;

    return ms_double.count();

}

int main() {

    string s, p;

    int pos;

    ifstream srcfile("source.in");

    ifstream patfile("pattern.in");

    srcfile >> s;

    patfile >> p;

    double tot_time = 0;

    for (int i = 0; i < LOOP_COUNT; ++i) {

        tot_time += test(s, p, &pos);

    }

    cout << "Loop count:            " << LOOP_COUNT << endl;

    cout << "Source string length:  " << s.length() << endl;

    cout << "Pattern string length: " << p.length() << endl;

    cout << "Search result:         " << pos << endl;

    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;

}

结果

IPython 的 %timeit 魔法命令可以输出代码多次执行的平均时间和标准差，在此取平均时间。C++ 的代码对每个模式串固定运行 1,000 次后取平均时间。

以下时间若无特别说明，均以微秒为单位，保留到整数位。

场景	模式串出现位置	Python	C++ - `test.cpp`	C++ - `test_strstr.cpp`	C++ - `test_kmp.cpp`
场景 1	250,000	105	523	155	2564
场景 1	500,000	183	1053	274	3711
场景 1	750,000	291	1589	447	4900
场景 2	750,000	2630*	618	353	3565

* 原输出为“2.63 ms”。IPython 的 %timeit 输出的均值保留 3 位有效数字，由于此时间已超过 1 毫秒，微秒位被舍弃。此处仍以微秒作单位，数值记为“2630”。

局限性

本次实测时使用的设备硬件上劣于算法竞赛中的标准配置机器，实测结果中的“绝对数值”参考性较低。

总结

根据上表中的结果，在给定环境和相关参数条件下，场景 1 中 Python 的运行时间大约为 C++ 中 string::find 的五分之一，与 std:strstr 接近；而在场景 2 中 Python 的运行时间明显增长，但 C++ 的前两种测试方法的运行时间与先前接近甚至更短。四次测试中，C++ 的用户实现的 KMP 算法运行时间均较长，长于同条件下 Python 的情况。

Python 中的内置类型 str 的快速查找（.find()）和计数（.count()）算法基于 Boyer-Moore 算法和 Horspool 算法的混合，其中后者是前者的简化，而前者与 Knuth-Morris-Pratt 算法有关。

有关 C++ 的 string::find 比 std::strstr 运行时间长的相关情况，参见 Bug 66414 - string::find ten times slower than strstr。

值得关注的是：C++ 中自行实现的 KMP 算法的运行时间竟然远长于 C++ 标准库甚至 Python 中的算法。这也类似于常说的“自己设计汇编代码运行效率低于编译器”的情况。Stack Overflow 的一个问题 strstr faster than algorithms? 下有人回答如下：

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.

KMP 算法并非是所有线性复杂度算法中最快的。在不同的环境（软硬件、测试数据等）下，KMP 与其变种乃至其他线性复杂度算法，孰优孰劣都无法判断。编译器在设计时考虑到诸多可能的因素，尽可能使不同环境下都能有相对较优的策略来得到结果。因而，在保证结果正确的情况下，与其根据算法原理自行编写，不如直接使用标准库中提供的函数。

同时本次实测也在运行时间角度再次印证 Python 并不适合在算法竞赛中取得高成绩的说法。

参考

【实测】Python 和 C++ 下字符串查找的速度对比的更多相关文章

Python复杂场景下字符串处理相关问题与解决技巧
1.如何拆分含有多种分隔符的字符串¶ ''' 实际案例: 我们要把某个字符串依据分隔符号拆分不同的字段,该字符串包含多种不同的分隔符,例如: s=’ab;cd|efg|hi,jkl|mn\topq ...
Python实现Linux下文件查找
import os, sys def search(curpath, s): L = os.listdir(curpath) #列出当前目录下所有文件 for subpath in L: #遍历当前目 ...
python多继承下的查找顺序-MRO原则演变与C3算法
在python历史版本中的演变史 python2.2之前: MRO原则: 只有经典类,遵循深度优先(从左到右)原则, 存在的问题:在有重叠的多继承中,违背重写可用原则解决办法是再设计类的时候不要设计 ...
python 字符串查找
python 字符串查找有4个方法,1 find,2 index方法,3 rfind方法,4 rindex方法. 1 find()方法: )##从下标1开始,查找在字符串里第一个出现的子串:返回结果3 ...
详解 Python 中的下划线命名规则
在 python 中,下划线命名规则往往令初学者相当疑惑:单下划线.双下划线.双下划线还分前后……那它们的作用与使用场景到底有何区别呢?今天就来聊聊这个话题. 1.单下划线(_) 通常情况下,单 ...
【循序渐进学Python】3. Python中的序列——字符串
字符串是零个或多个的字符所组成的序列,字符串是Python内建的6种序列之一,在Python中字符串是不可变的. 1. 格式化字符串字符串格式化使用字符串格式化操作符即百分号%来实现.在%左侧放置一 ...
第二百九十五节，python操作redis缓存-字符串类型
python操作redis缓存-字符串类型首先要安装redis-py模块 python连接redis方式,有两种连接方式,一种是直接连接,一张是通过连接池连接注意:以后我们都用的连接池方式连接,直 ...
python初学者日记01(字符串操作方法)
时间:2018/12/16 作者:永远的码农(博客园) 环境: win10,pycharm2018,python3.7.1 1.1 基础操作(交互输入输出) input = input(" ...
『Python基础-4』字符串
# 『Python基础-4』字符串目录 1.什么是字符串 2.修改字符串 2.1 修改字符串大小 2.2 合并(拼接)字符串 2.3 使用乘号'*'来实现字符串的叠加效果. 2.4 在字符串中添加空 ...

随机推荐

unixbench服务器性能测试
使用 Unixbench5 进行linux系统服务器性能综合测试 1.简介 Unixbench是一个类unix系(Unix,BSD,Linux)统下的性能测试工具,一个开源工具,被广泛用与测试linu ...
字节Android Native Crash治理之Memory Corruption工具原理与实践
作者:字节跳动终端技术--庞翔宇内容摘要 MemCorruption工具是字节跳动AppHealth (Client Infrastructure - AppHealth) 团队开发的一款用于定 ...
Java之static静态关键字详解|final关键字详解
前言在Java语言中,static表示"静态"的意思,使用场景可以用来修饰成员变量和成员方法,当然也可以是静态代码块.static的主要作用在于创建独立于具体对象的域变量或者方法 ...
Mac和Linux远程连接服务器异常修复（WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!）
感谢大佬:https://blog.csdn.net/wd2014610/article/details/79945424 一.今天在使用SSH,连接远程服务器的时候,刚开始是没问题的. 后来阿里云主 ...
git rebase git merge
Git rebase 使用方法 1. git checkout feature 2. git rebase master feature 相当于git rebase master + git chec ...
js Object.prototype.hasOwnProperty() 与 for in 区别
hasOwnProperty() 方法会返回一个布尔值,指示对象自身属性中是否具有指定的属性语法 obj.hasOwnProperty(prop) 参数 prop要检测的属性 [字符串] 名称或者 ...
python——schedule库实现定时任务
今天给脚本加定时任务用到一个schedule库,使用起来很简单方便.https://github.com/dbader/schedulehttps://schedule.readthedocs.io/ ...
《Effective Python》笔记——第3章类与继承
一.尽量用辅助类来维护程序的状态如下,用字典存储简单数据 class SimpleGradebook(): def __init__(self): self.__grades = {} def ad ...
《手把手教你》系列技巧篇（六十五）-java+ selenium自动化测试 - cookie -下篇（详细教程）
1.简介今天这一篇,宏哥主要讲解:利用WebDriver 提供可以读取.添加和删除cookie 信息的相关操作方法.验证浏览器中是否存在某个cookie.原因是:因为基于真实的cookie 的测试是 ...
Elasticsearch按请求体基本查询
1 分页: localhost:9200/get-together/_search {"query": {"match_all": {}},"from ...

【实测】Python 和 C++ 下字符串查找的速度对比

背景

参数