awk词频统计

2018-01-03@中关村

有文本 a.log 如下，请做词频统计，统计出每个单词出现的频率并倒序排序。

The Zen of Python, by Tim Peters

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!

方法一

egrep -o "\b[[:alpha:]]+\b" a.log 

awk '{++count[$0]} END{for (word in count){ printf("%-20s%d\n",word,count[word]);}}'

sort -n -r -k2,2

- 首先通过egrep把文本内容拆成每行列出一个单词

　　- egrep -o 表示只打印匹配到的字符，由换行符分割

　　- \b 是正则表达式里的单词边界符

　　- [:alpha:] 是表示字母的字符类

- 其次通过awk统计每个单词出现的次数

root@standby [13:39:48]$ egrep -o "\b[[:alpha:]]+\b" a.log |awk '{++count[$0]} END{for (word in count){ printf("%-20s%d\n",word,count[word]);}}' |sort -n -r -k2,2 |head -20

is                  10

than                8

better              8

to                  5

the                 5

one                 3

of                  3

never               3

it                  3

idea                3

be                  3

Although            3

way                 2

should              2

s                   2

obvious             2

may                 2

implementation      2

If                  2

explain             2

root@standby [13:42:38]$

方法二

awk '{for(i=1;i<=NF;i++) count[$i]++} END{ for(patten in count) printf("%-20s%d\n",patten,count[patten])}

注意：这种情况统计的就不是单词，而是按照字段统计的

root@standby [15:45:06]$ awk '{for(i=1;i<=NF;i++) count[$i]++} END{ for(patten in count) printf("%-20s%d\n",patten,count[patten])}' a.log |sort -n -r -k2,2 |head -20

is                  10

than                8

better              8

to                  5

the                 5

of                  3

be                  3

Although            3

way                 2

should              2

one                 2

never               2

may                 2

implementation      2

If                  2

idea.               2

explain,            2

do                  2

a                   2

Zen                 1

root@standby [15:45:14]$

参考：https://www.cnblogs.com/Peter2014/p/7596128.html

参考：http://bbs.chinaunix.net/thread-4102008-1-1.html

awk词频统计的更多相关文章

awk词频统计功能
[root@test88 ~]# vim word_freq.sh #!/bin/bash if [ $# -ne 1 ];then echo "Usage: $0 filename&quo ...
Hadoop上的中文分词与词频统计实践（有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）
解决问题的方案 Hadoop上的中文分词与词频统计实践首先来推荐相关材料:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-c ...
作业3-个人项目<词频统计>
上了一天的课,现在终于可以静下来更新我的博客了. 越来越发现,写博客是一种享受.来看看这次小林老师的“作战任务”. 词频统计单词: 包含有4个或4个以上的字 ...
C语言实现词频统计——第二版
原需求 1.读取文件,文件内包可含英文字符,及常见标点,空格级换行符. 2.统计英文单词在本文件的出现次数 3.将统计结果排序 4.显示排序结果新需求: 1.小文件输入. 为表明程序能跑 2.支持命 ...
c语言实现词频统计
需求: 1.设计一个词频统计软件,统计给定英文文章的单词频率. 2.文章中包含的标点不计入统计. 3.将统计结果以从大到小的排序方式输出. 设计: 1.因为是跨专业0.0···并不会c++和java, ...
awk过滤统计不重复的行
awk以‘\t’为分隔符区分列 cat logs | grep IconsendRedirect | grep 1752 | awk -F'\t' '{print $8}'| wc -l awk过滤统 ...
软件工程第一次个人项目——词频统计by11061153柴泽华
一.预计工程设计时间明确要求: 15min: 查阅资料: 1h: 学习C++基础知识与特性: 4-5h: 主函数编写及输入输出部分: 0.5h: 文件的遍历: 1h: 编写两种模式的词频统计函数: ...
python瓦登尔湖词频统计
#瓦登尔湖词频统计: import string path = 'D:/python3/Walden.txt' with open(path,'r',encoding= 'utf-8') as tex ...
pyspark进行词频统计并返回topN
Part I:词频统计并返回topN 统计的文本数据: what do you do how do you do how do you do how are you from operator imp ...

随机推荐

gulp与webpack的区别
gulp gulp强调的是前端开发的工作流程,我们可以通过配置一系列的task,定义task处理的事务(例如文件压缩合并.雪碧图.启动server.版本控制等),然后定义执行顺序,来让gulp执行这 ...
通过pycharm将代码push到远程仓库
现在使用pycharm作为python编辑器的人还是不少,而且,也可以通过pycharm将代码push到远程仓库. 步骤见下面截图: 填上远程仓库地址及克隆到本地的目录输入远程仓库的账号和密码修改 ...
Win10 安装 Linux子系统 Ubuntu18.04 / Kali Linux 的体验
汇总系列:https://www.cnblogs.com/dunitian/p/4822808.html#linux 几年前就看到新闻,今天周末,突发奇想,家里电脑安装下子系统不就不用安装开发的那些环 ...
webpack入门（二）what is webpack
webpack is a module bundler.webpack是一个模块打包工具,为了解决上篇一提到的各种模块加载或者转换的问题. webpack takes modules with dep ...
spring boot 连接mysql mongodb with jpa
https://github.com/bigben0123/gs-accessing-data-mysql-mongo-jpa
HDU3032 Nim or not Nim?
解:使用sg函数打表发现规律,然后暴力异或起来即可. #include <bits/stdc++.h> typedef long long LL; ; int a[N]; inline L ...
真机控件获取 app-inspector
1.安装app-inspector:npm install app-inspector -g 若是要卸载原有的:npm uninstall app-inspector -g np ...
numpy学习之前的必要数学知识:线性代数
行列式主要内容 1.行列式的定义及性质 2.行列式的展开公式一.行列式的定义 1.排列和逆序排列:由n个数1,2,…,n组成的一个有序数组称为一个n级排列,n级排列共有n!个逆序:在一个排列中 ...
Laravel 下生成验证码的类
<?php namespace App\Tool\Validate; //验证码类 class ValidateCode { private $charset = 'abcdefghkmnprs ...
Luogu P4248 [AHOI2013]差异
题目链接 $Click$ $Here$ 神仙题.或者可能我太菜了没见过后缀数组的骚操作,然后就被秀了一脸$hhhhh$ \[\sum\limits_{1<=i < j < ...

awk词频统计

2018-01-03@中关村

有文本 a.log 如下，请做词频统计，统计出每个单词出现的频率并倒序排序。

方法一

方法二

awk词频统计的更多相关文章

随机推荐

热门专题