--  这篇文章是一个学习、分析的博客 ---

1.准备数据与预处理

首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。中文维基百科的打包文件地址为 
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用 process_wiki_data.py处理这个XML压缩文件,执行:python process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

 
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # process_wiki_data.py 用于解析XML,将XML的wiki数据转换为text格式胡2锦涛!
  4. import logging
  5. import os.path
  6. import sys
  7. from gensim.corpora import WikiCorpus
  8. if __name__ == '__main__':
  9. program = os.path.basename(sys.argv[0])
  10. logger = logging.getLogger(program)
  11. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  12. logging.root.setLevel(level=logging.INFO)
  13. logger.info("running %s" % ' '.join(sys.argv))
  14. # check and process input arguments
  15. if len(sys.argv) < 3:
  16. print globals()['__doc__'] % locals()
  17. sys.exit(1)
  18. inp, outp = sys.argv[1:3]
  19. space = " "
  20. i = 0
  21. output = open(outp, 'w')
  22. wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
  23. for text in wiki.get_texts():
  24. output.write(space.join(text) + "\n")
  25. i = i + 1
  26. if (i % 10000 == 0):
  27. logger.info("Saved " + str(i) + " articles")
  28. output.close()
  29. logger.info("Finished Saved " + str(i) + " articles")

得到信息:

 
  1. 2016-08-11 20:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
  2. 2016-08-11 20:40:08,329: INFO: Saved 10000 articles
  3. 2016-08-11 20:40:45,501: INFO: Saved 20000 articles
  4. 2016-08-11 20:41:23,659: INFO: Saved 30000 articles
  5. 2016-08-11 20:42:01,748: INFO: Saved 40000 articles
  6. 2016-08-11 20:42:33,779: INFO: Saved 50000 articles
  7. ......
  8. 2016-08-11 20:55:23,094: INFO: Saved 200000 articles
  9. 2016-08-11 20:56:14,692: INFO: Saved 210000 articles
  10. 2016-08-11 20:57:04,614: INFO: Saved 220000 articles
  11. 2016-08-11 20:57:57,979: INFO: Saved 230000 articles
  12. 2016-08-11 20:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
  13. 2016-08-11 20:58:16,622: INFO: Finished Saved 232894 articles

Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg 
接着用word2vec工具训练: 
python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector

 
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # train_word2vec_model.py用于训练模型
  4. import logging
  5. import os.path
  6. import sys
  7. import multiprocessing
  8. from gensim.corpora import WikiCorpus
  9. from gensim.models import Word2Vec
  10. from gensim.models.word2vec import LineSentence
  11. if __name__ == '__main__':
  12. program = os.path.basename(sys.argv[0])
  13. logger = logging.getLogger(program)
  14. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  15. logging.root.setLevel(level=logging.INFO)
  16. logger.info("running %s" % ' '.join(sys.argv))
  17. # check and process input arguments
  18. if len(sys.argv) < 4:
  19. print globals()['__doc__'] % locals()
  20. sys.exit(1)
  21. inp, outp1, outp2 = sys.argv[1:4]
  22. model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
  23. workers=multiprocessing.cpu_count())
  24. # trim unneeded model memory = use(much) less RAM
  25. #model.init_sims(replace=True)
  26. model.save(outp1)
  27. model.save_word2vec_format(outp2, binary=False)

运行信息

 
  1. 2016-08-12 09:50:02,586: INFO: running python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector
  2. 2016-08-12 09:50:02,592: INFO: collecting all words and their counts
  3. 2016-08-12 09:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
  4. 2016-08-12 09:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
  5. 2016-08-12 09:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
  6. 2016-08-12 09:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
  7. ...
  8. 2016-08-12 09:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
  9. 2016-08-12 09:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
  10. 2016-08-12 09:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
  11. 2016-08-12 09:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
  12. 2016-08-12 09:52:13,672: INFO: total 278291 word types after removing those with count<5
  13. 2016-08-12 09:52:13,673: INFO: constructing a huffman tree from 278291 words
  14. 2016-08-12 09:52:29,323: INFO: built huffman tree with maximum node depth 25
  15. 2016-08-12 09:52:29,683: INFO: resetting layer weights
  16. 2016-08-12 09:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
  17. 2016-08-12 09:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
  18. 2016-08-12 09:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
  19. 2016-08-12 09:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
  20. 2016-08-12 09:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
  21. 2016-08-12 09:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
  22. 2016-08-12 09:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
  23. 2016-08-12 09:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
  24. ......
  25. 2016-08-12 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
  26. 2016-08-12 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
  27. 2016-08-12 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
  28. 2016-08-12 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
  29. 2016-08-12 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
  30. 2016-08-12 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
  31. 2016-08-12 19:22:13,884: INFO: not storing attribute syn0norm
  32. 2016-08-12 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
  33. 2016-08-12 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
  34. 2016-08-12 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

测试模型效果:

 
    1. In [1]: import gensim
    2. In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
    3. In [3]: model.most_similar(u"足球")
    4. Out[3]:
    5. [(u'\u8054\u8d5b', 0.6553816199302673),
    6. (u'\u7532\u7ea7', 0.6530429720878601),
    7. (u'\u7bee\u7403', 0.5967546701431274),
    8. (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
    9. (u'\u4e59\u7ea7', 0.5840631723403931),
    10. (u'\u8db3\u7403\u961f', 0.5560152530670166),
    11. (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
    12. (u'allsvenskan', 0.5249762535095215),
    13. (u'\u4ee3\u8868\u961f', 0.5214947462081909),
    14. (u'\u7532\u7ec4', 0.5177896022796631)]
    15. In [4]: result = model.most_similar(u"足球")
    16. In [5]: for e in result:
    17. print e[0], e[1]
    18. ....:
    19. 联赛 0.65538161993
    20. 甲级 0.653042972088
    21. 篮球 0.596754670143
    22. 俱乐部 0.587228953838
    23. 乙级 0.58406317234
    24. 足球队 0.556015253067
    25. 亚足联 0.530800580978
    26. allsvenskan 0.52497625351
    27. 代表队 0.521494746208
    28. 甲组 0.51778960228
    29. In [6]: result = model.most_similar(u"男人")
    30. In [7]: for e in result:
    31. print e[0], e[1]
    32. ....:
    33. 女人 0.77537125349
    34. 家伙 0.617369174957
    35. 妈妈 0.567102909088
    36. 漂亮 0.560832381248
    37. 잘했어 0.540875017643
    38. 谎言 0.538448691368
    39. 爸爸 0.53660941124
    40. 傻瓜 0.535608053207
    41. 예쁘다 0.535151124001
    42. mc刘 0.529670000076
    43. In [8]: result = model.most_similar(u"女人")
    44. In [9]: for e in result:
    45. print e[0], e[1]
    46. ....:
    47. 男人 0.77537125349
    48. 我的某 0.589010596275
    49. 妈妈 0.576344847679
    50. 잘했어 0.562340974808
    51. 美丽 0.555426716805
    52. 爸爸 0.543958246708
    53. 新娘 0.543640494347
    54. 谎言 0.540272831917
    55. 妞儿 0.531066179276
    56. 老婆 0.528521537781
    57. In [10]: result = model.most_similar(u"青蛙")
    58. In [11]: for e in result:
    59. print e[0], e[1]
    60. ....:
    61. 老鼠 0.559612870216
    62. 乌龟 0.489831030369
    63. 蜥蜴 0.478990525007
    64. 猫 0.46728849411
    65. 鳄鱼 0.461885392666
    66. 蟾蜍 0.448014199734
    67. 猴子 0.436584025621
    68. 白雪公主 0.434905380011
    69. 蚯蚓 0.433413207531
    70. 螃蟹 0.4314712286
    71. In [12]: result = model.most_similar(u"姨夫")
    72. In [13]: for e in result:
    73. print e[0], e[1]
    74. ....:
    75. 堂伯 0.583935439587
    76. 祖父 0.574735701084
    77. 妃所生 0.569327116013
    78. 内弟 0.562012672424
    79. 早卒 0.558042645454
    80. 曕 0.553856015205
    81. 胤祯 0.553288519382
    82. 陈潜 0.550716996193
    83. 愔之 0.550510883331
    84. 叔父 0.550032019615
    85. In [14]: result = model.most_similar(u"衣服")
    86. In [15]: for e in result:
    87. print e[0], e[1]
    88. ....:
    89. 鞋子 0.686688780785
    90. 穿着 0.672499775887
    91. 衣物 0.67173999548
    92. 大衣 0.667605519295
    93. 裤子 0.662670075893
    94. 内裤 0.662210345268
    95. 裙子 0.659705817699
    96. 西装 0.648508131504
    97. 洋装 0.647238850594
    98. 围裙 0.642895817757
    99. In [16]: result = model.most_similar(u"公安局")
    100. In [17]: for e in result:
    101. print e[0], e[1]
    102. ....:
    103. 司法局 0.730189085007
    104. 公安厅 0.634275555611
    105. 公安 0.612798035145
    106. 房管局 0.597343325615
    107. 商业局 0.597183346748
    108. 军管会 0.59476184845
    109. 体育局 0.59283208847
    110. 财政局 0.588721752167
    111. 戒毒所 0.575558543205
    112. 新闻办 0.573395550251
    113. In [18]: result = model.most_similar(u"铁道部")
    114. In [19]: for e in result:
    115. print e[0], e[1]
    116. ....:
    117. 盛光祖 0.565509021282
    118. 交通部 0.548688530922
    119. 批复 0.546967327595
    120. 刘志军 0.541010737419
    121. 立项 0.517836689949
    122. 报送 0.510296344757
    123. 计委 0.508456230164
    124. 水利部 0.503531932831
    125. 国务院 0.503227233887
    126. 经贸委 0.50156635046
    127. In [20]: result = model.most_similar(u"清华大学")
    128. In [21]: for e in result:
    129. print e[0], e[1]
    130. ....:
    131. 北京大学 0.763922810555
    132. 化学系 0.724210739136
    133. 物理系 0.694550514221
    134. 数学系 0.684280991554
    135. 中山大学 0.677202701569
    136. 复旦 0.657914161682
    137. 师范大学 0.656435549259
    138. 哲学系 0.654701948166
    139. 生物系 0.654403865337
    140. 中文系 0.653147578239
    141. In [22]: result = model.most_similar(u"卫视")
    142. In [23]: for e in result:
    143. print e[0], e[1]
    144. ....:
    145. 湖南 0.676812887192
    146. 中文台 0.626506924629
    147. 収蔵 0.621356606483
    148. 黄金档 0.582251906395
    149. cctv 0.536769032478
    150. 安徽 0.536752820015
    151. 非同凡响 0.534517168999
    152. 唱响 0.533438682556
    153. 最强音 0.532605051994
    154. 金鹰 0.531676828861
    155. In [24]: result = model.most_similar(u"习1近平") //这里博客作了判断,不让包含 有国家领导人的信息
    156. In [25]: for e in result:
    157. print e[0], e[1]
    158. ....:
    159. 胡2锦涛 0.809472680092
    160. 江3泽民 0.754633367062
    161. 李4克强 0.739740967751
    162. 贾5庆林 0.737033963203
    163. 曾6庆红 0.732847094536
    164. 吴7邦国 0.726941585541
    165. 总书记 0.719057679176
    166. 李8瑞环 0.716384887695
    167. 温9家宝 0.711952567101
    168. 王10岐山 0.703570842743
    169. In [26]: result = model.most_similar(u"林丹")
    170. In [27]: for e in result:
    171. print e[0], e[1]
    172. ....:
    173. 黄综翰 0.538035452366
    174. 蒋燕皎 0.52646958828
    175. 刘鑫 0.522252976894
    176. 韩晶娜 0.516120731831
    177. 王晓理 0.512289524078
    178. 王适 0.508560419083
    179. 杨影 0.508159279823
    180. 陈跃 0.507353425026
    181. 龚智超 0.503159761429
    182. 李敬元 0.50262516737
    183. In [28]: result = model.most_similar(u"语言学")
    184. In [29]: for e in result:
    185. print e[0], e[1]
    186. ....:
    187. 社会学 0.632598280907
    188. 人类学 0.623406708241
    189. 历史学 0.618442356586
    190. 比较文学 0.604823827744
    191. 心理学 0.600066184998
    192. 人文科学 0.577783346176
    193. 社会心理学 0.575571238995
    194. 政治学 0.574541330338
    195. 地理学 0.573896467686
    196. 哲学 0.573873817921
    197. In [30]: result = model.most_similar(u"计算机")
    198. In [31]: for e in result:
    199. print e[0], e[1]
    200. ....:
    201. 自动化 0.674171924591
    202. 应用 0.614087462425
    203. 自动化系 0.611132860184
    204. 材料科学 0.607891201973
    205. 集成电路 0.600370049477
    206. 技术 0.597518980503
    207. 电子学 0.591316461563
    208. 建模 0.577238917351
    209. 工程学 0.572855889797
    210. 微电子 0.570086717606
    211. In [32]: model.similarity(u"计算机", u"自动化")
    212. Out[32]: 0.67417196002404789
    213. In [33]: model.similarity(u"女人", u"男人")
    214. Out[33]: 0.77537125129824813
    215. In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
    216. Out[34]: u'\u4e2d\u5fc3'
    217. In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
    218. 中心

来源:https://www.zybuluo.com/hanxiaoyang/note/472184

word2vec训练中文模型的更多相关文章

  1. 使用word2vec训练中文词向量

    https://www.jianshu.com/p/87798bccee48 一.文本处理流程 通常我们文本处理流程如下: 1 对文本数据进行预处理:数据预处理,包括简繁体转换,去除xml符号,将单词 ...

  2. Windows下基于python3使用word2vec训练中文维基百科语料(二)

    在上一篇对中文维基百科语料处理将其转换成.txt的文本文档的基础上,我们要将为文本转换成向量,首先都要对文本进行预处理 步骤四:由于得到的中文维基百科中有许多繁体字,所以我们现在就是将繁体字转换成简体 ...

  3. Windows下基于python3使用word2vec训练中文维基百科语料(三)

    对前两篇获取到的词向量模型进行使用: 代码如下: import gensim model = gensim.models.Word2Vec.load('wiki.zh.text.model') fla ...

  4. Windows下基于python3使用word2vec训练中文维基百科语料(一)

    在进行自然语言处理之前,首先需要一个语料,这里选择维基百科中文语料,由于维基百科是 .xml.bz2文件,所以要将其转换成.txt文件,下面就是相关步骤: 步骤一:下载维基百科中文语料 https:/ ...

  5. 文本分布式表示(二):用tensorflow和word2vec训练词向量

    看了几天word2vec的理论,终于是懂了一些.理论部分我推荐以下几篇教程,有博客也有视频: 1.<word2vec中的数学原理>:http://www.cnblogs.com/pegho ...

  6. tflearn 中文汉字识别,训练后模型存为pb给TensorFlow使用——模型层次太深,或者太复杂训练时候都不会收敛

    tflearn 中文汉字识别,训练后模型存为pb给TensorFlow使用. 数据目录在data,data下放了汉字识别图片: data$ ls0  1  10  11  12  13  14  15 ...

  7. word2vec + transE 知识表示模型

    本文主要工作是将文本方法 (word2vec) 和知识库方法 (transE) 相融合作知识表示,即将外部知识库信息(三元组)加入word2vec语言模型,作为正则项指导词向量的学习,将得到的词向量用 ...

  8. PocketSphinx语音识别系统语言模型的训练和声学模型的改进

    PocketSphinx语音识别系统语言模型的训练和声学模型的改进 zouxy09@qq.com http://blog.csdn.net/zouxy09 关于语音识别的基础知识和sphinx的知识, ...

  9. word2vec训练&IC分词(待)

    参考http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96 ...

随机推荐

  1. 6.2 dubbo在spring中自定义xml标签源码解析

    在6.1 如何在spring中自定义xml标签中我们看到了在spring中自定义xml标签的方式.dubbo也是这样来实现的. 一 META_INF/dubbo.xsd 比较长,只列出<dubb ...

  2. json传输二进制的方案【转】

    本文转自:http://wiyi.org/binary-to-string.html json 是一种很简洁的协议,但可惜的是,它只能传递基本的数型(int,long,string等),但不能传递by ...

  3. Spring系列:Scheduled注解学习笔记

    一.试验代码 //@Scheduled(fixedRate = 5000) //@Scheduled(fixedDelay = 5000) @Scheduled(cron ="*/5 * * ...

  4. Count and Say leetcode java

    题目: The count-and-say sequence is the sequence of integers beginning as follows: 1, 11, 21, 1211, 11 ...

  5. Reverse Words in a String leetcode java

    题目: Given an input string, reverse the string word by word. For example, Given s = "the sky is ...

  6. 阿里巴巴Java开发规约插件全球首发!(转)

    https://mp.weixin.qq.com/s?__biz=MzI0NTE4NjA0OQ==&mid=2658355901&idx=1&sn=3169172bfc6819 ...

  7. 简单介绍Ceph分布式存储集群

    在规划Ceph分布式存储集群环境的时候,对硬件的选择很重要,这关乎整个Ceph集群的性能,下面梳理到一些硬件的选择标准,可供参考: 1)CPU选择 Ceph metadata server会动态的重新 ...

  8. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(二十四)Structured Streaming:Encoder

    一般情况下我们在使用Dataset<Row>进行groupByKey时,你会发现这个方法最后一个参数需要一个encoder,那么这些encoder如何定义呢? 一般数据类型 static ...

  9. jssor/slider图片的问题

    用jssor/slider这个控件,在显示图片的时候,每张图片都被拉伸到最大的图片的宽度和高度,导致变形,怎么处理? [答案] Yes. With no u="image" ima ...

  10. hdu Boring count(BestCode round #11)

    Boring count Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) Tot ...