python实现指定目录下批量文件的单词计数：并发版本

在文章《python实现指定目录下批量文件的单词计数：串行版本》中，总体思路是： A. 一次性获取指定目录下的所有符合条件的文件 -> B. 一次性获取所有文件的所有文件行
-> C. 解析所有文件行的单词计数 -> D. 按单词出现次数排序并输出TOPN。 A,B,C,D 是完全串行的

本文实现并发版本。并发版本的主要思路是： A. 每次获取一个符合条件的文件 -> B. 获取单个文件的所有文件行 -> C. 解析单个文件的所有单词计数 -> D. 聚合所有单词计数并排序，输出 TOPN。其中 A,B,C 是并发的，D 如果能够做到动态排序，后续也可以改成并发的。

一、线程化改造

首先对串行版本进行线程化改造。将原来普通类的功能变成线程，普通类之间的传值调用变为通过队列来传送。代码如下：

#-------------------------------------------------------------------------------

# Name:        wordstat_threading.py

# Purpose:     statistic words in java files of given directory by threading

#

# Author:      qin.shuq

#

# Created:     09/10/2014

# Copyright:   (c) qin.shuq 2014

# Licence:     <your licence>

#-------------------------------------------------------------------------------

import re

import os

import time

import logging

import threading, Queue

LOG_LEVELS = {

    'DEBUG': logging.DEBUG, 'INFO': logging.INFO,

    'WARN': logging.WARNING, 'ERROR': logging.ERROR,

    'CRITICAL': logging.CRITICAL

}

def initlog(filename) :

    logger = logging.getLogger()

    hdlr = logging.FileHandler(filename)

    formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")

    hdlr.setFormatter(formatter)

    logger.addHandler(hdlr)

    logger.setLevel(LOG_LEVELS['INFO'])

    return logger

errlog = initlog("error.log")

infolog = initlog("info.log")

timeoutInSecs = 0.05

class FileObtainer(threading.Thread):

    def __init__(self, dirpath, qOut, threadID, fileFilterFunc=None):

        threading.Thread.__init__(self)

        self.dirpath = dirpath

        self.fileFilterFunc = fileFilterFunc

        self.qOut = qOut

        self.threadID = threadID

        infolog.info('FileObtainer Initialized')

    def obtainFile(self, path):

        fileOrDirs = os.listdir(path)

        if len(fileOrDirs) == 0:

            return

        for name in fileOrDirs:

            fullPath = path + '/' + name

            if os.path.isfile(fullPath):

                if self.fileFilterFunc is None:

                    self.qOut.put(fullPath)

                elif self.fileFilterFunc(fullPath):

                    self.qOut.put(fullPath)

            elif os.path.isdir(fullPath):

                self.obtainFile(fullPath)

    def run(self):

        print threading.currentThread()

        starttime = time.time()

        self.obtainFile(self.dirpath)

        endtime = time.time()

        print 'ObtainFile cost: ', (endtime-starttime)*1000 , 'ms'

class WordReading(threading.Thread):

    def __init__(self, qIn, qOut, threadID):

        threading.Thread.__init__(self)

        self.qIn = qIn

        self.qOut = qOut

        self.threadID = threadID

        infolog.info('WordReading Initialized')

    def readFileInternal(self):

        lines = []

        try:

            filename = self.qIn.get(True, timeoutInSecs)

            #print filename

        except Queue.Empty, emp:

            errlog.error('In WordReading:' + str(emp))

            return None

        try:

            f = open(filename, 'r')

            lines = f.readlines()

            infolog.info('[successful read file %s]\n' % filename)

            f.close()

        except IOError, err:

            errorInfo = 'file %s Not found \n' % filename

            errlog.error(errorInfo)

        return lines

    def run(self):

        print threading.currentThread()

        starttime = time.time()

        while True:

            lines = self.readFileInternal()

            if lines is None:

                break

            self.qOut.put(lines)

        endtime = time.time()

        print 'WordReading cost: ', (endtime-starttime)*1000 , 'ms'

class WordAnalyzing(threading.Thread):

    '''

     return Map<Word, count>  the occurrence times of each word

    '''

    wordRegex = re.compile("[\w]+")

    def __init__(self, qIn, threadID):

        threading.Thread.__init__(self)

        self.qIn = qIn

        self.threadID = threadID

        self.result = {}

        infolog.info('WordAnalyzing Initialized')

    def run(self):

        print threading.currentThread()

        starttime = time.time()

        lines = []

        while True:

            try:

                start = time.time()

                lines = self.qIn.get(True, timeoutInSecs)

            except Queue.Empty, emp:

                errlog.error('In WordReading:' + str(emp))

                break

            linesContent = ''.join(lines)

            matches = WordAnalyzing.wordRegex.findall(linesContent)

            if matches:

                for word in matches:

                    if self.result.get(word) is None:

                        self.result[word] = 0

                    self.result[word] += 1

        endtime = time.time()

        print 'WordAnalyzing analyze cost: ', (endtime-starttime)*1000 , 'ms'

    def obtainResult(self):

        return self.result;

class PostProcessing(object):

    def __init__(self, resultMap):

        self.resultMap = resultMap

    def sortByValue(self):

        return sorted(self.resultMap.items(),key=lambda e:e[1], reverse=True)

    def obtainTopN(self, topN):

        sortedResult = self.sortByValue()

        sortedNum = len(sortedResult)

        topN = sortedNum if topN > sortedNum else topN

        for i in range(topN):

            topi = sortedResult[i]

            print topi[0], ' counts: ', topi[1]

if __name__ == "__main__":

    dirpath = "c:\\Users\\qin.shuq\\Desktop\\region_master\\src"

    if not os.path.exists(dirpath):

        print 'dir %s not found.' % dirpath

        exit(1)

    qFile = Queue.Queue()

    qLines = Queue.Queue()

    fileObtainer = FileObtainer(dirpath, qFile, "Thread-FileObtainer", lambda f: f.endswith('.java'))

    wr = WordReading(qFile, qLines, "Thread-WordReading")

    wa = WordAnalyzing(qLines, "Thread-WordAnalyzing")

    fileObtainer.start()

    wr.start()

    wa.start()

    wa.join()

    starttime = time.time()

    postproc = PostProcessing(wa.obtainResult())

    postproc.obtainTopN(30)

    endtime = time.time()

    print 'PostProcessing cost: ', (endtime-starttime)*1000 , 'ms'

    print 'exit the program.'

测量时间:

 $ time python wordstat_serial.py

           ObtainFile cost:  92.0000076294 ms

           WordReading cost:  504.00018692 ms

           WordAnalyzing cost:  349.999904633 ms

           PostProcessing cost:  16.0000324249 ms

           real    0m1.100s

           user    0m0.000s

           sys     0m0.046s

        $ time python wordstat_threading.py

           ObtainFile cost:  402.99987793 ms

           WordReading cost:  1477.99992561 ms

           WordAnalyzing analyze cost:  1528.00011635 ms

           PostProcessing cost:  16.0000324249 ms

           real    0m1.690s

           user    0m0.000s

          sys     0m0.046s

从时间测量的结果来看，并发版本甚至还不如串行版本，这主要是读取文件还是单线程的，同时队列之间传送消息是阻塞式的，会耗费一定时间。此外，并发版本尚未使用到多核优势，也是后续改进点。

注意到 WordAnalyzing 与 WordReading 所耗费的时间很接近，这表明两者是并发执行的。 PostProcessing 耗费时间几乎可以忽略，暂不做优化。下一步优化工作是 ObtainFile 和 WordReading.

二、使用多线程和多进程优化

1. 由于 Queue.put 会耗费一定时间（平均1ms左右），因此，大量文件名称的put必定耗费很多不必要的时间，改进版本中使用文件列表，减少put次数；

2. WordReading 采用多线程来读取大量文件；

3. WordAnalyzing 采用多进程来进行单词计数。

经过优化后的 WordReading 和 WordAnalyzing 耗费时间基本上与串行版本相同。瓶颈在 FileObtainer
上。目前对os.walk, for 循环进行了测量，但测量时间总是远小于ObtainFile cost，尚没有找出究竟耗费时间在哪里了。

#-------------------------------------------------------------------------------

# Name:        wordstat_threading_improved.py

# Purpose:     statistic words in java files of given directory by threading

#              improved

#

# Author:      qin.shuq

#

# Created:     09/10/2014

# Copyright:   (c) qin.shuq 2014

# Licence:     <your licence>

#-------------------------------------------------------------------------------

import re

import os

import time

import logging

import threading, Queue

from multiprocessing import Process, Pool, cpu_count

LOG_LEVELS = {

    'DEBUG': logging.DEBUG, 'INFO': logging.INFO,

    'WARN': logging.WARNING, 'ERROR': logging.ERROR,

    'CRITICAL': logging.CRITICAL

}

def initlog(filename) :

    logger = logging.getLogger()

    hdlr = logging.FileHandler(filename)

    formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")

    hdlr.setFormatter(formatter)

    logger.addHandler(hdlr)

    logger.setLevel(LOG_LEVELS['INFO'])

    return logger

errlog = initlog("error.log")

infolog = initlog("info.log")

timeoutInSecs = 0.1

class FileObtainer(threading.Thread):

    def __init__(self, dirpath, qOut, threadID, fileFilterFunc=None):

        threading.Thread.__init__(self)

        self.dirpath = dirpath

        self.fileFilterFunc = fileFilterFunc

        self.qOut = qOut

        self.threadID = threadID

        infolog.info('FileObtainer Initialized')

    def run(self):

        print threading.currentThread()

        starttime = time.time()

        for path, dirs, filenames in os.walk(self.dirpath):

            if len(filenames) > 0:

                files = []

                for filename in filenames:

                    start = time.time()

                    fullPath = path+'/'+filename

                    files.append(fullPath)

                    end = time.time()

                if self.fileFilterFunc is None:

                    self.qOut.put_nowait(files)

                else:

                    fileterFiles = filter(self.fileFilterFunc, files)

                    if len(fileterFiles) > 0:

                        self.qOut.put_nowait(fileterFiles)

        endtime = time.time()

        print 'ObtainFile cost: ', (endtime-starttime)*1000 , 'ms'

def readFile(filename, qOut):

    try:

        f = open(filename, 'r')

        lines = f.readlines()

        infolog.info('[successful read file %s]\n' % filename)

        f.close()

    except IOError, err:

        errorInfo = 'file %s Not found \n' % filename

        errlog.error(errorInfo)

    qOut.put(lines)

class WordReading(threading.Thread):

    def __init__(self, qIn, qOut, threadID):

        threading.Thread.__init__(self)

        self.qIn = qIn

        self.qOut = qOut

        self.threadID = threadID

        self.threads = []

        infolog.info('WordReading Initialized')

    def readFileInternal(self):

        try:

            filelist = self.qIn.get(True, timeoutInSecs)

            for filename in filelist:

                t = threading.Thread(target=readFile, args=(filename, self.qOut), name=self.threadID+'-'+filename)

                t.start()

                self.threads.append(t)

            return []

        except Queue.Empty, emp:

            errlog.error('In WordReading:' + str(emp))

            return None

    def run(self):

        print threading.currentThread()

        starttime = time.time()

        while True:

            lines = self.readFileInternal()

            if lines is None:

                break

        for t in self.threads:

            t.join()

        endtime = time.time()

        print 'WordReading cost: ', (endtime-starttime)*1000 , 'ms'

def processLines(lines):

    result = {}

    linesContent = ''.join(lines)

    matches = WordAnalyzing.wordRegex.findall(linesContent)

    if matches:

        for word in matches:

            if result.get(word) is None:

                result[word] = 0

            result[word] += 1

    return result

def mergeToSrcMap(srcMap, destMap):

    for key, value in destMap.iteritems():

        if srcMap.get(key):

            srcMap[key] = srcMap.get(key)+destMap.get(key)

        else:

            srcMap[key] = destMap.get(key)

    return srcMap

class WordAnalyzing(threading.Thread):

    '''

     return Map<Word, count>  the occurrence times of each word

    '''

    wordRegex = re.compile("[\w]+")

    def __init__(self, qIn, threadID):

        threading.Thread.__init__(self)

        self.qIn = qIn

        self.threadID = threadID

        self.resultMap = {}

        self.pool = Pool(cpu_count())

        infolog.info('WordAnalyzing Initialized')

    def run(self):

        print threading.currentThread()

        starttime = time.time()

        lines = []

        futureResult = []

        while True:

            try:

                lines = self.qIn.get(True, timeoutInSecs)

                futureResult.append(self.pool.apply_async(processLines, args=(lines,)))

            except Queue.Empty, emp:

                errlog.error('In WordReading:' + str(emp))

                break

        self.pool.close()

        self.pool.join()

        resultMap = {}

        for res in futureResult:

            mergeToSrcMap(self.resultMap, res.get())

        endtime = time.time()

        print 'WordAnalyzing analyze cost: ', (endtime-starttime)*1000 , 'ms'

    def obtainResult(self):

        #print len(self.resultMap)

        return self.resultMap

class PostProcessing(object):

    def __init__(self, resultMap):

        self.resultMap = resultMap

    def sortByValue(self):

        return sorted(self.resultMap.items(),key=lambda e:e[1], reverse=True)

    def obtainTopN(self, topN):

        sortedResult = self.sortByValue()

        sortedNum = len(sortedResult)

        topN = sortedNum if topN > sortedNum else topN

        for i in range(topN):

            topi = sortedResult[i]

            print topi[0], ' counts: ', topi[1]

if __name__ == "__main__":

    dirpath = "E:\\workspace\\java\\javastudy\\src"

    if not os.path.exists(dirpath):

        print 'dir %s not found.' % dirpath

        exit(1)

    qFile = Queue.Queue()

    qLines = Queue.Queue()

    fileObtainer = FileObtainer(dirpath, qFile, "Thread-FileObtainer", lambda f: f.endswith('.java'))

    wr = WordReading(qFile, qLines, "Thread-WordReading")

    wa = WordAnalyzing(qLines, "Thread-WordAnalyzing")

    fileObtainer.start()

    wr.start()

    wa.start()

    wa.join()

    starttime = time.time()

    postproc = PostProcessing(wa.obtainResult())

    postproc.obtainTopN(30)

    endtime = time.time()

    print 'PostProcessing cost: ', (endtime-starttime)*1000 , 'ms'

    print 'exit the program.'

【未完待续】

python实现指定目录下批量文件的单词计数：并发版本的更多相关文章

python实现指定目录下批量文件的单词计数：串行版本
直接上代码. 练习目标: 1. 使用 Python 面向对象的方法封装逻辑和表达 : 2. 使用异常处理和日志API : 3. 使用文件目录读写API : 4. 使用 list, map, t ...
[python] 在指定目录下找文件
import os # 查找当前目录下所有包含关键字的文件 def findFile(path, filekw): return[os.path.join(path,x) for x in os.li ...
python查找指定目录下所有文件，以及改文件名的方法
一: os.listdir(path) 把path目录下的所有文件保存在列表中: >>> import os>>> import re>>> pa ...
python实现指定目录下JAVA文件单词计数的多进程版本
要说明的是, 串行版本足够快了, 在我的酷睿双核 debian7.6 下运行只要 0.2s , 简直是难以超越. 多进程版本难以避免大量的进程创建和数据同步与传输开销, 性能反而不如串行版本, 只能作 ...
PHP 批量获取指定目录下的文件列表(递归，穿透所有子目录)
//调用 $dir = '/Users/xxx/www'; $exceptFolders = array('view','test'); $exceptFiles = array('BaseContr ...
python获取指定目录下所有文件名os.walk和os.listdir
python获取指定目录下所有文件名os.walk和os.listdir 觉得有用的话,欢迎一起讨论相互学习~Follow Me os.walk 返回指定路径下所有文件和子文件夹中所有文件列表其中文 ...
Python获取指定目录下所有子目录、所有文件名
需求给出制定目录,通过Python获取指定目录下的所有子目录,所有(子目录下)文件名: 实现 import os def file_name(file_dir): for root, dirs, f ...
PHP 获取指定目录下所有文件（包含子目录）
PHP 获取指定目录下所有文件(包含子目录) //glob — 寻找与模式匹配的文件路径 $filter_dir = array('CVS', 'templates_c', 'log', 'img', ...

随机推荐

VMware创建虚拟机教程详解及问题解决
关于VMware Workstation Pro虚拟机创建教程,本教程主要详细描述使用软件VMware Workstation Pro建虚拟系统过程中步骤详解,以及个人安装时所出现部分问题的解决方案. ...
Bootstrap篇：弹出框和提示框效果以及代码展示
前言:对于Web开发人员,弹出框和提示框的使用肯定不会陌生,比如常见的表格新增和编辑功能,一般常见的主要有两种处理方式:行内编辑和弹出框编辑.在增加用户体验方面,弹出框和提示框起着重要的作用,如果你 ...
JS-缓冲运动-对联型悬浮框
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
关于OS命令注入的闭合问题
1.在Windows下 windows下非常好办,只需要&肯定可以执行: C:\Users\xxx\Desktop>aaaa | 127.0.0.1 'aaaa' 不是内部或外部命令,也 ...
SVN常见问题及解决方法（转载）
svn常见符号黄色感叹号(有冲突):--这是有冲突了,冲突就是说你对某个文件进行了修改,别人也对这个文件进行了修改,别人抢在你提交之前先提交了,这时你再提交就会被提示发生冲突,而不允许你提交,防止你 ...
查看、关闭登录到linux的终端
基本概念: tty(终端设备的统称):tty一词源于Teletypes,原来指的是电传打字机,是通过串行线用打印机键盘阅读和发送信息的东西,后来这东西被键盘和显示器取代,所以现在叫终端比较合适.终端是 ...
Redis的简单了解以及主从复制
1.Redis的简单了解 Redis是一种高性能的分布式NoSql数据库,持久存储,高并发,数据类型丰富,通过现场申请内存空间,同时可以配置虚拟内存.五种数据类型:string(字符串,这种格式和me ...
EF的使用（DbContext对象的共用问题）
1.问题的引入对于某一个数据库的EF操作对象,当执行某一次请求的时候,可能会多次操作数据库,也就是可能创建很多MyDbContext(继承自DbContext对象,EF上下文对象) 2.代码创建当 ...
ZOJ 80ers' Memory
80ers' Memory Time Limit: 1 Second Memory Limit: 32768 KB I guess most of us are so called 80er ...
Oracle SQL开发之 Select语句完整的执行顺序
查询语句语法: Select 属性 From 表 Where 条件 Group by 分组条件 Having 分组选择条件 Order by 排序条件 1.from子句组装来自不同数据源的数据: 2. ...

python实现指定目录下批量文件的单词计数：并发版本

python实现指定目录下批量文件的单词计数：并发版本的更多相关文章

随机推荐

热门专题