Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf’s Law states that the frequency of a word type is inversely proportional to its rank (i.e., f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
a. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf’s law? (Hint: it helps to use a logarithmic scale.) What is going on at the extreme ends of the plotted line?
b. Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the two plots. What do you make of Zipf’s Law in the light of this?

 from nltk.corpus import gutenberg as gb

 def validate_zipf(text,ranklimit):
fdist=nltk.FreqDist([w for w in text if w.isalpha()])
x=range(ranklimit)
freq=[]
for key in fdist.keys():
freq.append(fdist[key])
y=sorted(freq,reverse=True)[:ranklimit]
pylab.plot(x,y) def test():
text=gb.words(fileids=['shakespeare-hamlet.txt'])
validate_zipf(text,150)

运行的结果为:

Zipf’s Law的更多相关文章

  1. 齐夫定律, Zipf's law,Zipfian distribution

    齐夫定律(英语:Zipf's law,IPA英语发音:/ˈzɪf/)是由哈佛大学的语言学家乔治·金斯利·齐夫(George Kingsley Zipf)于1949年发表的实验定律. 它可以表述为: 在 ...

  2. Zipf's law

    w https://www.bing.com/knows/search?q=马太效应&mkt=zh-cn&FORM=BKACAI 马太效应(Matthew Effect),指强者愈强. ...

  3. Zipf定律

    http://www.360doc.com/content/10/0811/00/84590_45147637.shtml 英美在互联网具有绝对霸权 Zipf定律是美国学者G.K.齐普夫提出的.可以表 ...

  4. 幂次法则power law

    幂次法则分布和高斯分布是两种广泛存在的数学分布.可以预测和统计相关数据. pig中用其处理数据倾斜,实现负载均衡. 个体的规模和其名次之间存在着幂次方的反比关系,R(x)=ax(-b次方) 其中,x为 ...

  5. 齐普夫-Zipf定律

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  6. 【机器学习Machine Learning】资料大全

    昨天总结了深度学习的资料,今天把机器学习的资料也总结一下(友情提示:有些网站需要"科学上网"^_^) 推荐几本好书: 1.Pattern Recognition and Machi ...

  7. [IR] Information Extraction

    阶段性总结 Boolean retrieval 单词搜索 [Qword1 and Qword2]               O(x+y) [Qword1 and Qword2]- 改进: Gallo ...

  8. [IR] Compression

    关系:Vocabulary vs. collection size Heaps’ law: M = kTbM is the size of the vocabulary, T is the numbe ...

  9. TF/IDF(term frequency/inverse document frequency)

    TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...

随机推荐

  1. Ant 参考

    http://ant.apache.org/manual/Tasks/exec.html Exec Description Executes a system command. When the os ...

  2. FileSystemXmlApplicationContext方法的绝对路径问题

    public AgentServer(Socket c,String confDir) { this.client = c; ApplicationContext ac = new FileSyste ...

  3. LightOJ 1336 Sigma Function 算数基本定理

    题目大意:f(n)为n的因子和,给出 n 求 1~n 中f(n)为偶数的个数. 题目思路:算数基本定理: n=p1^e1*p2^e1 …… pn^en (p为素数): f(n)=(1+p1+p1^2+ ...

  4. Provisioning profile 浅析

    转载自:    http://blog.csdn.net/chenyufeng1991/article/details/48976245 一般在我们代码编写中不会用到Provisioning prof ...

  5. java socket网络编程(多线程技术)

    Client.java import java.io.*; import java.net.*; import java.util.*; public class Client { public st ...

  6. JavaScript(四)---- 函数

    函数主要用来封装具体的功能代码. 函数是由这样的方式进行声明的:关键字 function.函数名.一组参数,以及置于括号中的待执行代码. 格式: function 函数名(形参列表){         ...

  7. 关于mysql 删除数据后物理空间未释放(转载)

    转自 关于mysql 删除数据后物理空间未释放(转载) - NETDATA - 博客园http://www.cnblogs.com/shawnloong/archive/2013/02/07/2908 ...

  8. 고서--做完A之后做B, B受A影响

    1. 합격 소식을 듣고서 매우 기뻤어요.. 2. 친구하고 심하게 다투고서 마음이 안 좋았어요. 3. 급한 일을 먼저 끝내고서 이야기합시다.' 4. 창문을 열고서 상쾌한 공기를 마서 ...

  9. CentOS6.5 安装snort

    本机CentOS6.5最大化安装,以下安装所需组件也是最大化安装之后仍需自己安装的. 1.安装libpcap与libpcap-devel yum -y install libpcap* 2.安装lib ...

  10. 转:WebDriver(Selenium2)模拟鼠标经过事件

    在自动化测试过程中,由于javascript的使用,我们常常需要点击一些鼠标经过显示的菜单等元素,这时需要触发该元素的鼠标经过事件.使用WebDriver有以下两种实现. 1.使用Action pub ...