python实现指定目录下批量文件的单词计数：串行版本

直接上代码。

练习目标：

1. 使用 Python 面向对象的方法封装逻辑和表达；

2. 使用异常处理和日志API ；

3. 使用文件目录读写API ；

4. 使用 list, map, tuple 三种数据结构；

5. lambda 、正则使用及其它。

下一篇将实现并发版本。

#-------------------------------------------------------------------------------

# Name:        wordstat_serial.py

# Purpose:     statistic words in java files of given directory by serial

#

# Author:      qin.shuq

#

# Created:     08/10/2014

# Copyright:   (c) qin.shuq 2014

# Licence:     <your licence>

#-------------------------------------------------------------------------------

import re

import os

import time

import logging

LOG_LEVELS = {

    'DEBUG': logging.DEBUG, 'INFO': logging.INFO,

    'WARN': logging.WARNING, 'ERROR': logging.ERROR,

    'CRITICAL': logging.CRITICAL

}

def initlog(filename) :

    logger = logging.getLogger()

    hdlr = logging.FileHandler(filename)

    formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")

    hdlr.setFormatter(formatter)

    logger.addHandler(hdlr)

    logger.setLevel(LOG_LEVELS['INFO'])

    return logger

errlog = initlog("error.log")

infolog = initlog("info.log")

class WordReading(object):

    def __init__(self, fileList):

        self.fileList = fileList

    def readFileInternal(self, filename):

        lines = []

        try:

            f = open(filename, 'r')

            lines = f.readlines()

            infolog.info('[successful read file %s]\n' % filename)

            f.close()

        except IOError, err:

            errorInfo = 'file %s Not found \n' % filename

            errlog.error(errorInfo)

        return lines

    def readFile(self):

        allLines = []

        for filename in self.fileList:

            allLines.extend(self.readFileInternal(filename))

        return allLines

class WordAnalyzing(object):

    '''

     return Map<Word, count>  the occurrence times of each word

    '''

    wordRegex = re.compile("[\w]+")

    def __init__(self, allLines):

        self.allLines = allLines

    def analyze(self):

        result = {}

        lineContent = ''.join(self.allLines)

        matches = WordAnalyzing.wordRegex.findall(lineContent)

        if matches:

            for word in matches:

                if result.get(word) is None:

                    result[word] = 0

                result[word] += 1

        return result

class FileObtainer(object):

    def __init__(self, dirpath, fileFilterFunc=None):

        self.dirpath = dirpath

        self.fileFilterFunc = fileFilterFunc

    def findAllFilesInDir(self):

        files = []

        for path, dirs, filenames in os.walk(self.dirpath):

            if len(filenames) > 0:

                for filename in filenames:

                    files.append(path+'/'+filename)

        if self.fileFilterFunc is None:

            return files

        else:

            return filter(self.fileFilterFunc, files)

class PostProcessing(object):

    def __init__(self, resultMap):

        self.resultMap = resultMap

    def sortByValue(self):

        return sorted(self.resultMap.items(),key=lambda e:e[1], reverse=True)

    def obtainTopN(self, topN):

        sortedResult = self.sortByValue()

        sortedNum = len(sortedResult)

        topN = sortedNum if topN > sortedNum else topN

        for i in range(topN):

            topi = sortedResult[i]

            print topi[0], ' counts: ', topi[1]

if __name__ == "__main__":

    dirpath = "c:\\Users\\qin.shuq\\Desktop\\region_master\\src"

    starttime = time.time()

    fileObtainer = FileObtainer(dirpath, lambda f: f.endswith('.java'))

    fileList = fileObtainer.findAllFilesInDir()

    endtime = time.time()

    print 'ObtainFile cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    wr = WordReading(fileList)

    allLines = wr.readFile()

    endtime = time.time()

    print 'WordReading cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    wa = WordAnalyzing(allLines)

    resultMap = wa.analyze()

    endtime = time.time()

    print 'WordAnalyzing cost: ', (endtime-starttime)*1000 , 'ms'

    starttime = time.time()

    postproc = PostProcessing(resultMap)

    postproc.obtainTopN(30)

    endtime = time.time()

    print 'PostProcessing cost: ', (endtime-starttime)*1000 , 'ms'

python实现指定目录下批量文件的单词计数：串行版本的更多相关文章

python实现指定目录下批量文件的单词计数：并发版本
在文章 <python实现指定目录下批量文件的单词计数:串行版本>中, 总体思路是: A. 一次性获取指定目录下的所有符合条件的文件 -> B. 一次性获取所有文件的所有文件行 - ...
[python] 在指定目录下找文件
import os # 查找当前目录下所有包含关键字的文件 def findFile(path, filekw): return[os.path.join(path,x) for x in os.li ...
python实现指定目录下JAVA文件单词计数的多进程版本
要说明的是, 串行版本足够快了, 在我的酷睿双核 debian7.6 下运行只要 0.2s , 简直是难以超越. 多进程版本难以避免大量的进程创建和数据同步与传输开销, 性能反而不如串行版本, 只能作 ...
python查找指定目录下所有文件，以及改文件名的方法
一: os.listdir(path) 把path目录下的所有文件保存在列表中: >>> import os>>> import re>>> pa ...
PHP 批量获取指定目录下的文件列表(递归，穿透所有子目录)
//调用 $dir = '/Users/xxx/www'; $exceptFolders = array('view','test'); $exceptFiles = array('BaseContr ...
python获取指定目录下所有文件名os.walk和os.listdir
python获取指定目录下所有文件名os.walk和os.listdir 觉得有用的话,欢迎一起讨论相互学习~Follow Me os.walk 返回指定路径下所有文件和子文件夹中所有文件列表其中文 ...
Python获取指定目录下所有子目录、所有文件名
需求给出制定目录,通过Python获取指定目录下的所有子目录,所有(子目录下)文件名: 实现 import os def file_name(file_dir): for root, dirs, f ...
PHP 获取指定目录下所有文件（包含子目录）
PHP 获取指定目录下所有文件(包含子目录) //glob — 寻找与模式匹配的文件路径 $filter_dir = array('CVS', 'templates_c', 'log', 'img', ...

随机推荐

CXF入门例子
1. WebService实现类:@WebService注解表示这个类发布为一个WebService服务. package com.coshaho.learn.cxf; import javax.jw ...
使用 Redis 实现分布式系统轻量级协调技术
http://www.ibm.com/developerworks/cn/opensource/os-cn-redis-coordinate/index.html 在分布式系统中,各个进程(本文使用进 ...
iOS 各尺寸iPhone分辨率
notepad++的环境变量
notepad++的环境变量:当前目录:$(CURRENT_DIRECTORY) cmd /k cd /d $(CURRENT_DIRECTORY)文件名:$(NAME_PART)路径名:$(CURR ...
php 如何进入mysql数据库
我是初学者,有两台电脑,进入数据库通用的方法直接找mysql.exe,如下例: F:\Program Files\wamp\bin\mysql\mysql5.5.20\bin
8、JavaScript深入浅出——数据类型
一.六种数据类型 Javascript是弱类型. 五种原始类型和一种对象类型: number String boolean null undefined Object 二.隐式转换 +与-的运算举例: ...
C读取文件
C读取文件,这种写法不会多一行. #include "stdafx.h" #include <vector> using namespace std; struct P ...
javascript原生dom操作方法
一.节点层次属性考虑空白符的相关层次关系属性: 1.childNodes属性包含 2.parentNode属性 3.previouseSibling属性和nextSibling属性 4.first ...
NSData
NSArray *pathArray = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES) ...
hdu 2528 Area
2014-07-30 http://acm.hdu.edu.cn/showproblem.php?pid=2528解题思路: 求多边形被一条直线分成两部分的面积分别是多少.因为题目给的直线一定能把多边 ...

python实现指定目录下批量文件的单词计数：串行版本

python实现指定目录下批量文件的单词计数：串行版本的更多相关文章

随机推荐

热门专题