使用Python对文档单词进行计数

做hacker.org上面的题目时，遇到了一个题目需要对RFC3280种长度为9的单词进行计数，并找出这些单词中出现次数最多的那个：Didactic Byte

RFC3280文档有7000多行，靠人工是数不出来的，解决这种是就得编程了，而且很明显，在处理此类问题时脚本式比较方便的，果断选择python

先是将题意理解错了，理解为出现次数最多的九个字母，然后等到程序运行好提交答案时才发现不对，唉，真是汗颜

既然程序写出来了，还是将其贴出来吧

文档中字母频率计数：

 # find the occurence of single  letter from a document
 #import fileinput
 
 test = ""
 dicts = {}
 
 # read content from a document
 # we could use readline,readlines etc.
 # also we could use fileinput.input(filename)
 
 text = open('RFC3280.txt')
 for line in text.readlines():
     test += line.lower().replace(" ","")
 
 # use dictionary to store the occurence and it's letter
 for i in xrange(97,123):
     letter = chr(i)
     count = test.count(chr(i))
     #print "the accounts of %c is %d " % (letter,count)
     dicts.setdefault(letter,count)
 
 # sort the dict by values,here I used a lambda expression
 # to sort by keys:
 # sorted_dict = sorted(dicts.iteritems(),key = lambda d:d[0])
 
 sorted_dict = sorted(dicts.iteritems(),key = lambda d:d[1],reverse = True)
 print sorted_dict

代码挺短的，注释也很详细，可以看出来脚本语言在处理这种事情上确实是比较方便

下面再看看，从文档中获取指定长度的单词的出现频率，并对其进行排序

 # this script is used to find all 9-letter word from specific document
 # after find them, pick the most commom one from them
 # as I used dictionary to store them, I can not output the one I want seperately
 # the dictionary object is difference from list or an array,it store it's key:value rondomly
 
 test = ""
 storage = []
 dicts = {}
 word_length = 9
 
 # read content from a document
 # as python script read lines from the document,it may contains '\n'
 # at the end of a sentence,there followed a '.',we also need a handle with tring.replace function
 
 text = open('RFC3280.txt')
 for line in text.readlines():
     test += line.lower().replace('\n','').replace('.','')
 
 # convert a string to a list
 lists = test.split(' ')
 
 # choose theses words which it's length is word_length,here is 9
 for i in xrange(0,len(lists)):
     if len(lists[i])==word_length:
         storage.append(lists[i])
 
 #print storage
 #print len(storage)
 
 # now use dictionary.setdefault to add elements to dictionary
 for n in xrange(0,len(storage)):
     word = storage[n]
     count = storage.count(word)
     dicts.setdefault(word,count)
 
 # sort the dictionary order by vaules
 sorted_dict = sorted(dicts.iteritems(),key = lambda d:d[1],reverse = True)
 print sorted_dict

两段代码，实现的思路上差不多，都是从文件读取内容存入一个字符串中，这里是按行读取，所以要对字符串拼接。存入字符串后还需要进行一下处理，因为原文文本中有大量

的空格、换行符，其次句末的'.'也会被当作单词的一部分，还好这里要处理的符号不多，不然恐怕就要劳驾正则表达式了。然后将其转换为list(列表)对象，list有count属性可

直接对元素计数。由于需要知道每个元素出现的次数，我是用dictionary(字典)对象存储“键=>值”对，这里使用setdefault属性进行元素添加。注意到代码中使用了lambda

表达式，如果对lambda表达式不太熟悉建议还是查阅一下资料，遗憾的是一段时间没用python，我对这些知识点也没太大映像了，手头也没用相应资料(manual很好，有时候想查到某个比较细小的东西时会不太方便，如果有安装bpython可能会方便很多)，可参考：

使用 lambda 函数

细品lambda函数

python:lambda function

官方文档：内建类型

python字典的应用

dict、list、string常用用法

上面两段代码可以得出想要的结果，但在阅读体验上一大堆字典元素打印在一起还是需要改进的，今后会对其进行整理。

最后在贴上部分测试文件的内容，这里直接读取文件是因为python脚本和该文本文件处于同一目录下

　　This memo profiles the X.509 v3 certificate and X.509 v2 Certificate
   Revocation List (CRL) for use in the Internet.  An overview of this
   approach and model are provided as an introduction.  The X.509 v3
   certificate format is described in detail, with additional
   information regarding the format and semantics of Internet name
   forms.  Standard certificate extensions are described and two
   Internet-specific extensions are defined.  A set of required
   certificate extensions is specified.  The X.509 v2 CRL format is
   described in detail, and required extensions are defined.  An
   algorithm for X.509 certification path validation is described.  An
   ASN.1 module and examples are provided in the appendices.

使用Python对文档单词进行计数的更多相关文章

python统计文档中词频
python统计文档中词频的小程序 python版本2.7 效果如下: 程序如下,测试文件与完整程序在我的github中 #统计空格数与单词数本函数只返回了空格数需要的可以自己返回多个值 def ...
如何在命令行模式下查看Python帮助文档---dir、help、__doc__
如何在命令行模式下查看Python帮助文档---dir.help.__doc__ 1.dir函数式可以查看对象的属性,使用方法很简单,举str类型为例,在Python命令窗口输入 dir(str) 即 ...
在命令行模式下查看Python帮助文档---dir、help、__doc__
在命令行模式下查看Python帮助文档---dir.help.__doc__ 1.dir函数式可以查看对象的属性,使用方法很简单,举str类型为例,在Python命令窗口输入 dir(str) 即 ...
Python帮助文档中Iteration iterator iterable 的理解
iteration这个单词,是循环,迭代的意思.也就是说,一次又一次地重复做某件事,叫做iteration.所以很多语言里面,循环的循环变量叫i,就是因为这个iteration. iteration指 ...
sphinx：python项目文档自动生成
Sphinx: 发音: DJ音标发音: [sfiŋks] KK音标发音: [sfɪŋks] 单词本身释义: an ancient imaginary creature with a lion's bo ...
三言两语聊Python模块–文档测试模块doctest
doctest是属于测试模块里的一种,对注释文档里的示例进行检测. 给出一个例子: splitter.pydef split(line, types=None, delimiter=None): &q ...
python 本地文档查看
本地安装Python文档本地查看,在命令行中运行: python -m pydoc -p 1234 在浏览器中访问如下链接,就可以访问到本地文档: http://localhost:1234/ 本地文 ...
使用Sphinx生成本地的Python帮助文档
第一步:安装Sphinx 首先我们需要安装Sphinx,如果已经安装了Anaconda,那么只需要使用如下命令即可安装,关于其中的参数 -c anaconda,可以在链接[1]中查看: conda i ...
Python asyncio文档阅读摘要
文档地址:https://docs.python.org/3/library/asyncio.html 文档第一句话说得很明白,asyncio是单线程并发,这种event loop架构是很多新型异步并 ...

随机推荐

Spring MVC 学习总结（一）——MVC概要与环境配置
一.MVC概要 MVC是模型(Model).视图(View).控制器(Controller)的简写,是一种软件设计规范,用一种将业务逻辑.数据.显示分离的方法组织代码,MVC主要作用是降低了视图与业务 ...
ASP.NET MVC路由解析
继续往下看<ASP.NET MVC5框架揭秘>. ASP.NET系统通过注册路由和现有的物理文件路径发生映射.而对于ASP.NET MVC来说,请求的是某个Controller中的具体的A ...
机器学习&数据挖掘笔记_14（GMM-HMM语音识别简单理解）
为了对GMM-HMM在语音识别上的应用有个宏观认识,花了些时间读了下HTK(用htk完成简单的孤立词识别)的部分源码,对该算法总算有了点大概认识,达到了预期我想要的.不得不说,网络上关于语音识别的通俗 ...
Web 前端开发人员和设计师必读精华文章【系列二十六】
<Web 前端开发精华文章推荐>2014年第5期(总第26期)和大家见面了.梦想天空博客关注前端开发技术,分享各类能够提升网站用户体验的优秀 jQuery 插件,展示前沿的 HTML5 ...
S Gallery – 很有特色的响应式 jQuery 相册插件
S Gallery 是一款响应式的 jQuery 相册插件.使用了 HTML5 全屏 API 以及 CSS3 动画和 CSS3 转换,所以只能在支持这些功能的浏览器中使用. 这款插件它有一个特色功能 ...
Windows Azure Platform Introduction (11) 了解Org ID、Windows Azure订阅、账户
<Windows Azure Platform 系列文章目录> 了解和掌握Windows Azure相关的基础知识是非常重要的. 问题1:什么叫做Org ID Org ID是Azure C ...
angularjs中的页面访问权限设置
11月在赶一个项目,这阵子比较忙,挤挤时间更一篇博客吧,如标题所述说说在ng中页面访问权限控制的问题,水平有限各位看官见谅: 在以往的项目中,前后端常见的配合方式是前端提供页面和ui加一点DuangD ...
char类型的说明
CREATE TABLE [dbo].[CharTest]( ) NULL, ) NULL, ) NULL, ) NULL ) insert into dbo.CharTest ( Char, Var ...
C#骏鹏自动售货机接口
MachineJP类: 第1部分:串口初始化,串口数据读写 using System; using System.Collections.Generic; using System.IO.Ports; ...
Dapper小型ORM的使用（随便吐槽下公司）
近来公司又有新项目要做,之前做项目用过蛮多ORM,包括ef,NetTiers,ServiceStack.OrmLite等ROM,每种ORM都有一定的坑(或者说是使用者的问题吧~~).用来用去都觉的有一 ...

使用Python对文档单词进行计数

使用Python对文档单词进行计数的更多相关文章

随机推荐

热门专题