Python实现C代码统计工具(四)

标签： Python 计时持久化

Python实现C代码统计工具(四)
- 标签： Python 计时持久化
声明
运行测试环境
一. 自定义计时函数
二. 性能优化

声明

本文介绍若干种有别于cProfile/profile模块的Python程序计时方法，并对《Python实现C代码统计工具(三)》中的C代码统计工具进行性能优化。本文所述的方法也适用于其他Python脚本。

运行测试环境

本文基于Windows和Linux两种系统平台，运行和测试所涉及的代码实例。平台信息如下：

>>> import sys, platform

>>> print '%s %s, Python %s' %(platform.system(), platform.release(), platform.python_version())

Windows XP, Python 2.7.11

>>> sys.version

'2.7.11 (v2.7.11:6d1b6a68f775, Dec  5 2015, 20:32:19) [MSC v.1500 32 bit (Intel)]'

和

>>> import sys, platform

>>> print '%s %s, Python %s' %(platform.system(), platform.release(), platform.python_version())

Linux 2.6.18-194.el5PAE, Python 2.4.3

>>> sys.version

'2.4.3 (#1, Jun 11 2009, 14:09:58) \n[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)]'

注意，Python不同版本间语法存在差异，故文中某些代码实例需要稍作修改，以便在低版本Python环境中运行。

一. 自定义计时函数

Python标准库提供的cProfile/profile模块，计时输出信息较多。本节将介绍其他几种精度略低但简单易用的计时工具。根据代码粒度不同，将其分为三类。

1.1 整个程序计时

Unix/Linux系统中，可用time命令简单地统计整个程序的耗时。例如：

[wangxiaoyuan_@localhost PyTest]$ time python BCLineCounter.py bulk

FileLines  CodeLines  CommentLines  EmptyLines  CommentPercent

15450      10437      3264          2538        0.24            <Total:40 Code Files>

real    0m2.803s

user    0m1.124s

sys     0m0.052s

统计值real表示程序运行的实际耗时，user表示程序执行用户态代码(内核外)耗费的CPU时间，sys表示程序在内核态运行所耗费的CPU时间(即调用特定内核函数的耗时)。若user和sys时间之和小于real时间，表明程序为I/O密集型(I/O bound)，即程序的性能问题很可能与等待I/O有关。

time命令的详细描述参见《Linux用户态程序计时方式详解》。

1.2 代码片段计时

代码片段计时分为函数计时和语句块计时。这两种计时均可使用Python标准库timeit模块，该模块的详细介绍参见官方帮助。

本小节将使用timeit模块的timeit()方法，即timeit(stmt='pass', setup='pass', timer=<defaulttimer>, number=1000000)。其中，参数stmt为待计时的目标代码；setup为执行代码的准备工作(通常是import之类的语句)，不计入时间；timer在Windows系统中为time.clock()，Linux系统中则为time.time()，取默认值即可；number指示stmt重复执行的次数。该方法返回执行stmt代码number遍所用的时间，单位为秒，float类型。

除timeit()方法外，对于特定函数的计时，可使用装饰器(decorator)；对于语句块计时，则可使用上下文管理器(context manager)。

以装饰器为例：

import functools, sys, time

def FuncTimer(repeats=10000):

    def decorator(func):

        @functools.wraps(func)

        def wrapper(*args, **kwargs):

            #Windows系统中clock()粒度为毫秒，time()粒度为1/60秒；

            #Unix系统中clock()粒度为1/100秒，time()精度较其更高。

            if sys.platform == "win32":

                timerFunc = time.clock

            else:

                timerFunc = time.time

            try:

                startTime = timerFunc()

                for i in range(repeats):

                    ret = func(*args, **kwargs)

            finally:  #当目标函数发生异常时，仍旧输出计时信息

                endTime = timerFunc()

                print '%s.%s() =>' %(func.__module__, func.__name__),

                print 'Time Elasped: %.3f msec, repeated %d time(s).' \

                      %(((endTime-startTime)*1000.0), repeats)

            return ret

        return wrapper

    return decorator

运行如下代码，对比自定义装饰器FuncTimer与timeit模块计时效果：

@FuncTimer(10)

def DecoratedFunc():

    L = []

    for i in range(100): L.append(i)

def RawFunc():

    L = []

    for i in range(100): L.append(i)

DecoratedFunc()

import timeit; print '%.6f sec' %timeit.timeit(stmt=RawFunc, number=10)

输出如下：

__main__.DecoratedFunc() => Time Elasped: 0.164 msec, repeated 10 time(s).

0.000174 sec

可见，计时效果非常接近。

注意，FuncTimer装饰器内根据系统选用不同的计时器，这是考虑到time.clock()的精度因系统平台而异。在Unix/Linux系统中，该方法返回当前所耗的CPU时间；而在Windows系统中，该方法基于Win32函数 QueryPerformanceCounter()，返回从首次调用待计时函数起所经历的挂钟时间(wall clock time)，精度较time.time()更高。相比而言，timeit方法中使用的缺省计时器总是测量挂钟时间，这也意味着关于某函数的计时可能会受到同一计算机上运行的其他进程的影响。

time.clock()计时器的平台差异性参考以下示例(假定所在脚本名为Timing.py)：

@FuncTimer(5)

def SqrtTiming(loops):

    import math

    try:

        from math import fsum  #Python2.6+

        return fsum([math.sqrt(x) for x in range(loops)])

    except ImportError:        #Python2.5-

        return sum([math.sqrt(x) for x in range(loops)])

@FuncTimer(1)

def SleepTiming():

    time.sleep(2)

    file = open(r'out.txt', "w+")

    for i in range(10000):

        file.write('hello world!')

    SqrtTiming(100000)

SleepTiming()

在Windows系统控制台和IDLE Shell里的运行结果如下：

E:\PyTest>Timing.py

SqrtTiming() => Time Elasped: 150.124 msec, repeated 5 time(s).

SleepTiming() => Time Elasped: 2155.140 msec, repeated 1 time(s).

__main__.SqrtTiming() => Time Elasped: 151.809 msec, repeated 5 time(s).

__main__.SleepTiming() => Time Elasped: 2185.594 msec, repeated 1 time(s).

>>> import Timing

Timing.SqrtTiming() => Time Elasped: 148.892 msec, repeated 5 time(s).

Timing.SleepTiming() => Time Elasped: 2223.157 msec, repeated 1 time(s).

在Linux系统中运行结果与之类似。若将timerFunc改为time.clock()，则计时输出为：

[wangxiaoyuan_@localhost ~]$ time python Timing.py

__main__.SqrtTiming() => Time Elasped: 320.000 msec, repeated 5 time(s).

__main__.SleepTiming() => Time Elasped: 330.000 msec, repeated 1 time(s).

real    0m2.381s

user    0m0.332s

sys     0m0.019s

可见，time.sleep(2)并未计入SleepTiming()耗时，导致计时结果与real时间相差很大。

对于代码片段计时，以上下文管理器为例：

import contextlib, sys, time

@contextlib.contextmanager

def BlockTimer(label='Block'):

    if sys.platform == "win32": timerFunc = time.clock

    else: timerFunc = time.time

    startTime = timerFunc()

    try:

        yield

    finally:

        endTime = timerFunc()

        print '%s =>' %label,

        print 'Time Elasped: %.3f msec.' \

               %((endTime-startTime)*1000.0)

基于BlockTimer测量代码片段的示例如下：

with BlockTimer('cPickle'):

    from cPickle import dumps, loads

    s = dumps([x*2.4 for x in range(100000)])

    loads(s)

with BlockTimer('json'):

    from json import dumps, loads

    s = dumps([x*2.4 for x in range(100000)])

    loads(s)

运行结果如下：

cPickle => Time Elasped: 237.569 msec.

json => Time Elasped: 181.714 msec.

可见，对于浮点型对象，json模块执行速度比cPickle模块更快。

当然，借助timeit模块也可对代码片段计时。例如：

from timeit import timeit

sep = 'from cPickle import dumps, loads'

stp = 's=dumps([x*2 for x in range(100000)]); loads(s)'

print 'cPickle: %.6f sec' %timeit(stmt=stp, setup=sep, number=1)

sej = 'from json import dumps, loads'

stj = 's=dumps([x*2 for x in range(100000)]); loads(s)'

print 'json:    %.6f sec' %timeit(stmt=stj, setup=sej, number=1)

本例改为整型对象，且模块导入语句不计入总耗时。运行结果如下：

cPickle: 0.100775 sec

json:    0.064752 sec

可见，对于整型对象，json模块执行速度也比cPickle模块快。

1.3 单条语句计时

上下文管理器和timeit.timeit()方法也适合单条语句计时。除此之外，也可通过命令行执行timeit计时，命令选项详见官方帮助。

通过命令行执行timeit计时，可方便地测量和对比相同功能不同实现表达式的性能：

E:\PyTest>python -m timeit -r 5 "'-'.join(str(n) for n in range(100))"

10000 loops, best of 5: 32.8 usec per loop

E:\PyTest>python -m timeit -r 5 "'-'.join([str(n) for n in range(100)])"

10000 loops, best of 5: 29.8 usec per loop

E:\PyTest>python -m timeit -r 5 "'-'.join(map(str, range(100)))"

10000 loops, best of 5: 21.8 usec per loop

以上三条待计时的Python表达式均被重复测试5次，每次测试中调用该表达式10000次。相同机器上运行的其他程序可能会影响计时测量，因此重复测量若干次再选取最佳结果(best time)，以保证计时的准确性。考虑到后台进程等因素通常会增加总耗时，因此最佳结果取重复测量的最小值(添加-v选项可查看各次测量值)，而非平均值。由本例输出结果可知，map快于列表解析，而列表解析快于生成器(生成器表达式用作参数时可省略外部的括号)。注意，这三者的快慢对比会因场景而异。例如，[x for x in range(10000)]比map(lambda x:x, range(10000))快三倍。因此，应根据Profiler工具的实测值而非"经验"来分析性能。

此外，需注意双引号和单引号的内外位置。在作者的Windows主机上外单内双会触发"EOL while scanning string literal"的语法错误，而在Linux主机上外单内双和外双内单均可正确解析。

再举一例：

E:\PyTest>python -m timeit -r 5 -s "import math" "math.sqrt(255)"

10000000 loops, best of 5: 0.188 usec per loop

E:\PyTest>

E:\PyTest>python -m timeit -r 5 -s "from math import sqrt" "sqrt(255)"

10000000 loops, best of 5: 0.143 usec per loop

类似地，此处Windows主机只接受双引号，而Linux主机单双"通吃"。由输出结果可知，使用from语句导入比常规导入更快。同时也可看到，通过导入目标模块，timeit命令行也可对其函数甚至整个模块(存在入口函数时)计时。

最后是两点忠告：

进行性能评估时，要牢记任何测量结果只是一个估算值。即使相当可靠的计时工具，多次测量的结果也很可能有所出入。谨慎选取最佳结果，才能尽可能保证评估的准确性。
尽量使用标准库提供的计时工具，而不要自己编写计时函数。因为普通开发者难以全面考虑处理器时间片、后台持续或间歇运行的程序(如定时更新或杀毒)、网络连接、结果缓存、编译优化等多方面因素，导致自定义计时工具精度不高，某些情况下甚至完全偏离真实数据。

二. 性能优化

《Python实现C代码统计工具(三)》一文提到BCLineCounter.py统计可靠但性能偏低。本节旨在简单高效地提升BCLineCounter.py性能，而不借助C扩展或just-in-time(JIT)编译器等复杂方案。

众所周知，过早优化乃万恶之源。此外，优化应关注热点(hotspot)，寻找程序中最耗时的地方去优化。既然性能偏低，此时着手优化恰逢其时。然而，性能瓶颈CalcLines()函数优化难度较高，故考虑另辟蹊径：存储当前统计结果，下次遇到相同统计需求(命令行参数相同)时直接读取结果输出，从而免去重新统计的耗时。

Python标准库cPickle/pickle和json模块提供持久性机制，正好满足上述存取要求。因此，本节将先后使用cPickle(pickle的C版本)和json模块进行统计结果的存储和读取。

cPickle/pickle模块的详细介绍参见官方帮助或博文《python pickle模块》。json是与pickle类似的轻量级数据持久化方式，也是网络数据传输时的常见标准化格式(通用性更强)。此外，json可读性更强(human-readable)，使用者可在文本编辑器中打开json文件并检视数据。然而，json只能序列化(将内存中的变量转换为可存储和传输的流或文件)一些内置的Python对象，而pickle可存储几乎所有对象。

要实现缓存机制，除rawCountInfo和detailCountInfo外，还需存储命令行参数。接着，使用cPickle模块对Python对象序列化和反序列化：

CACHE_FILE = 'Counter.dump'

CACHE_DUMPER, CACHE_LOADER = None, None

from cPickle import dump, load

#from json import dump, load

def CounterDump(data):

    global CACHE_DUMPER

    if CACHE_DUMPER == None:

        CACHE_DUMPER = open(CACHE_FILE, 'wb')

    dump(data, CACHE_DUMPER)

def CounterLoad():

    global CACHE_LOADER

    if CACHE_LOADER == None:

        CACHE_LOADER = open(CACHE_FILE, 'rb')

    return load(CACHE_LOADER)

序列化的结果存入CACHE_FILE文件。因为打开json文件也可使用"b"模式，所以期望通过导入不同的模块(如from导入语句所示)，即可切换CounterDump()和CounterLoad()函数的持久化方式。然而，实测结果表明，使用json方式时，CounterLoad()会触发"ValueError: Extra data: line 1 column 32 - line 1 column 3386 (char 31 - 3385)"的异常。

这是因为json序列化的对象过于复杂，load()方法无法正确处理。对于这种可能包含无关数据(extraneous data)的json文件，可使用json.JSONDecoder.raw_decode()解码。如下所示：

from json import dump, JSONDecoder

#CounterDump()无需修改或将'wb'改为'r'即可

jsonEndPos = 0

def CounterLoad():

    global CACHE_LOADER, jsonEndPos

    if CACHE_LOADER == None:

        CACHE_LOADER = open(CACHE_FILE, 'r')

    CACHE_LOADER.seek(jsonEndPos)

    jsonData = CACHE_LOADER.read()

    pyObj, jsonEndPos = JSONDecoder().raw_decode(jsonData)

    return pyObj

首次执行CounterLoad()时，返回值pyObj为json文件存储的命令行参数，jsonEndPos之后为rawCountInfo和detailCountInfo列表组合。该组合也可用json.loads(jsonData[jsonEndPos:])反序列化，但为写法统一仍采用raw_decode()解码。

上述CounterLoad()也可改造为生成器的写法：

def ParseJson(jsonData):

    from json import JSONDecoder

    endPos = 0

    while True:

        jsonData = jsonData[endPos:].lstrip()

        try:

            pyObj, endPos = JSONDecoder().raw_decode(jsonData)

            yield pyObj

        except ValueError:

            break

CACHE_GEN = None

def CounterLoad():

    global CACHE_GEN

    if CACHE_GEN == None:

        CACHE_GEN = ParseJson(open(CACHE_FILE, 'r').read())

    try:

        return next(CACHE_GEN)

    except StopIteration, e:

        return []

基于CounterDump()和CounterLoad()函数，实现统计结果的存取：

def shouldUseCache(stay, detail, basename, cache, target):

    if not cache:  #未指定启用缓存

        return False

    try:

        (_stay, _detail, _basename, _target) = CounterLoad()

    except (IOError, EOFError, ValueError): #缓存文件不存在或内容为空或不合法

        return False

    if stay == _stay and detail == _detail and basename == _basename \

       and sorted(target) == sorted(_target):

        return True

    else:

        return False

@FuncTimer(1) #因序列化限制，只能调用一次

def main():

    global rawCountInfo, detailCountInfo

    (stay, detail, basename, cache, target) = ParseCmdArgs()

    cacheUsed = shouldUseCache(stay, detail, basename, cache, target)

    if cacheUsed:

        try:

            (rawCountInfo, detailCountInfo) = CounterLoad()

        except (EOFError, ValueError), e: #不太可能出现

            print 'Unexpected Cache Corruption(%s), Try Counting Directly.'%e

            CLineCounter(stay, not detail, basename, target)

    else:

       CLineCounter(stay, not detail, basename, target)

    ReportCounterInfo(not detail)

    if cacheUsed:

        return

    CounterDump((stay, detail, basename, target))

    CounterDump((rawCountInfo, detailCountInfo))

if __name__ == '__main__':

    main()

最后，在原来的ParseCmdArgs()命令行解析函数内增加-c选项，并返回该选项值：

    #... ...

    parser.add_argument('-c', '--cache', action='store_true',

           help='use cache to count faster(unreliable when files are modified)')

    #... ...

    return (args.stay, args.detail, args.basename, args.cache, args.target)

除-c选项外的命令行参数与前次相同时，将通过缓存直接输出统计结果。如帮助信息所示，当代码文件被修改时，前次统计缓存的结果是不可靠的，应去除该选项重新统计。

将引入缓存机制的BCLineCounter.py重命名为FBCLineCounter.py。首先使用cPickle持久化方式，通过cProfile命令分析其性能：

python -m cProfile -s tottime FBCLineCounter.py source -d -b > out.txt

注意，给定命令首次执行时，总会从头统计，此时-c选项可有可无。截取out.txt文件部分内容如下：

2503       1624       543           362         0.25            xtm_mgr.c

------------------------------------------------------------------------------------------

140872     93736      32106         16938       0.26            <Total:82 Code Files>

__main__.main() => Time Elasped: 4150.499 msec, repeated 1 time(s).

         286091 function calls (286057 primitive calls) in 4.161 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

   140872    3.604    0.000    3.740    0.000 FBCLineCounter.py:12(CalcLines)

       83    0.374    0.005    4.132    0.050 FBCLineCounter.py:41(CountFileLines)

141596/141588    0.136    0.000    0.136    0.000 {len}

       83    0.014    0.000    0.014    0.000 {open}

        1    0.004    0.004    0.008    0.008 argparse.py:62(<module>)

        1    0.003    0.003    0.004    0.004 collections.py:1(<module>)

      416    0.003    0.000    0.004    0.000 ntpath.py:96(splitdrive)

       84    0.003    0.000    0.003    0.000 {nt._isdir}

        1    0.002    0.002    4.161    4.161 FBCLineCounter.py:5(<module>)

注意，FuncTimer装饰器计时值略小于cProfile计时。

此时，生成的cPickle文件部分内容如下(几乎不具可读性)：

(I00

I01

I01

(lp1

S'source'

p2

atp3

...

添加-c选项后再次执行cProfile命令，截取out.txt文件部分内容如下：

2503       1624       543           362         0.25            xtm_mgr.c

------------------------------------------------------------------------------------------

140872     93736      32106         16938       0.26            <Total:82 Code Files>

__main__.main() => Time Elasped: 9.407 msec, repeated 1 time(s).

         3672 function calls (3567 primitive calls) in 0.029 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

        1    0.004    0.004    0.029    0.029 FBCLineCounter.py:5(<module>)

        1    0.003    0.003    0.007    0.007 argparse.py:62(<module>)

可见，耗时大为减少，几乎可以忽略不计。同时可看到，装饰器计时与cProfile存在较大偏差。

接着，切换至json对应的序列化和反序列化函数，仍通过cProfile命令评估程序统计source源码时的性能。首次执行(未缓存)时耗时4.152秒，此时，生成的json文件部分内容如下(可读性良好)：

[false, true, true, ["source"]][[0, 0, 0, 0, 0], [["access_dev_mgr.c", [2840, 1879, 689, 317]], ["app_webpop_mgr.c", [289, 163, 67, 64]], ["bbxrtrace_mgr.c", [472, 316, 142, 23]], ...

添加-c选项后再次执行cProfile命令，耗时为0.047秒，略大于cPickle方式。这是因为反序列化时未使用load()方法，而使用更复杂的JSONDecoder().raw_decode()方法。