中英文维基百科语料上的Word2Vec实验

最近试了一下Word2Vec, GloVe 以及对应的python版本 gensim word2vec 和 python-glove，就有心在一个更大规模的语料上测试一下，自然而然维基百科的语料进入了视线。维基百科官方提供了一个很好的维基百科数据源：https://dumps.wikimedia.org，可以方便的下载多种语言多种格式的维基百科数据。此前通过gensim的玩过英文的维基百科语料并训练LSI，LDA模型来计算两个文档的相似度，所以想看看gensim有没有提供一种简便的方式来处理维基百科数据，训练word2vec模型，用于计算词语之间的语义相似度。感谢Google，在gensim的google group下，找到了一个很长的讨论帖：training word2vec on full Wikipedia ，这个帖子基本上把如何使用gensim在维基百科语料上训练word2vec模型的问题说清楚了，甚至参与讨论的gensim的作者Radim Řehůřek博士还在新的gensim版本里加了一点修正，而对于我来说，所做的工作就是做一下验证而已。虽然github上有一个wiki2vec的项目也是做得这个事，不过我更喜欢用python gensim的方式解决问题。

关于word2vec，这方面无论中英文的参考资料相当的多，英文方面既可以看官方推荐的论文，也可以看gensim作者Radim Řehůřek博士写得一些文章。而中文方面，推荐 @licstar的《Deep Learning in NLP （一）词向量和语言模型》，有道技术沙龙的《Deep Learning实战之word2vec》，@飞林沙的《word2vec的学习思路》, falao_beiliu 的《深度学习word2vec笔记之基础篇》和《深度学习word2vec笔记之算法篇》等。

一、英文维基百科的Word2Vec测试

首先测试了英文维基百科的数据，下载的是xml压缩后的最新数据（下载日期是2015年3月1号），大概11G，下载地址：

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

处理包括两个阶段，首先将xml的wiki数据转换为text格式，通过下面这个脚本(process_wiki.py)实现：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

 

import logging

import os.path

import sys

 

from gensim.corpora import WikiCorpus

 

if __name__ == '__main__':

    program = os.path.basename(sys.argv[0])

    logger = logging.getLogger(program)

 

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')

    logging.root.setLevel(level=logging.INFO)

    logger.info("running %s" % ' '.join(sys.argv))

 

    # check and process input arguments

    if len(sys.argv) < 3:

        print globals()['__doc__'] % locals()

        sys.exit(1)

    inp, outp = sys.argv[1:3]

    space = " "

    i = 0

 

    output = open(outp, 'w')

    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})

    for text in wiki.get_texts():

        output.write(space.join(text) + "\n")

        i = i + 1

        if (i % 10000 == 0):

            logger.info("Saved " + str(i) + " articles")

 

    output.close()

    logger.info("Finished Saved " + str(i) + " articles")

这里利用了gensim里的维基百科处理类WikiCorpus，通过get_texts将维基里的每篇文章转换位1行text文本，并且去掉了标点符号等内容，注意这里“wiki = WikiCorpus(inp, lemmatize=False, dictionary={})”将lemmatize设置为False的主要目的是不使用pattern模块来进行英文单词的词干化处理，无论你的电脑是否已经安装了pattern，因为使用pattern会严重影响这个处理过程，变得很慢。

执行”python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text”:

2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text

2015-03-07 15:11:12,860: INFO: Saved 10000 articles

2015-03-07 15:13:25,369: INFO: Saved 20000 articles

2015-03-07 15:15:19,771: INFO: Saved 30000 articles

2015-03-07 15:16:58,424: INFO: Saved 40000 articles

2015-03-07 15:18:12,374: INFO: Saved 50000 articles

2015-03-07 15:19:03,213: INFO: Saved 60000 articles

2015-03-07 15:19:47,656: INFO: Saved 70000 articles

2015-03-07 15:20:29,135: INFO: Saved 80000 articles

2015-03-07 15:22:02,365: INFO: Saved 90000 articles

2015-03-07 15:23:40,141: INFO: Saved 100000 articles

.....

2015-03-07 19:33:16,549: INFO: Saved 3700000 articles

2015-03-07 19:33:49,493: INFO: Saved 3710000 articles

2015-03-07 19:34:23,442: INFO: Saved 3720000 articles

2015-03-07 19:34:57,984: INFO: Saved 3730000 articles

2015-03-07 19:35:31,976: INFO: Saved 3740000 articles

2015-03-07 19:36:05,790: INFO: Saved 3750000 articles

2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words)

2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles

在我的macpro（4核16G机器）大约跑了4个半小时，处理了375万的文章后，我们得到了一个12G的text格式的英文维基百科数据wiki.en.text，格式类似这样的：

anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix…

有了这个数据后，无论用原始的word2vec binary版本还是gensim中的python word2vec版本，都可以用来训练word2vec模型，不过我们试了一下前者，发现很慢，所以还是采用google group 讨论帖中的gensim word2vec方式的训练脚本，不过做了一点修改，保留了vector text格式的输出，方便debug, 脚本train_word2vec_model.py如下：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

 

import logging

import os.path

import sys

import multiprocessing

 

from gensim.corpora import WikiCorpus

from gensim.models import Word2Vec

from gensim.models.word2vec import LineSentence

 

if __name__ == '__main__':

    program = os.path.basename(sys.argv[0])

    logger = logging.getLogger(program)

 

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')

    logging.root.setLevel(level=logging.INFO)

    logger.info("running %s" % ' '.join(sys.argv))

 

    # check and process input arguments

    if len(sys.argv) < 4:

        print globals()['__doc__'] % locals()

        sys.exit(1)

    inp, outp1, outp2 = sys.argv[1:4]

 

    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,

            workers=multiprocessing.cpu_count())

 

    # trim unneeded model memory = use(much) less RAM

    #model.init_sims(replace=True)

    model.save(outp1)

    model.save_word2vec_format(outp2, binary=False)

执行 “python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector”:

2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector

2015-03-09 22:48:29,593: INFO: collecting all words and their counts

2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types

2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types

2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types

2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types

2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types

2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types

2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types

2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types

2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types

2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types

2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word

types

......

2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types

2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types

2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types

2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types

2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types

2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types

2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences

2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5

2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words

2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29

2015-03-09 23:14:09,790: INFO: resetting layer weights

2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0

2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s

2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s

2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s

2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s

2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s

2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s

2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s

2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s

2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s

2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s

2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s

2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s

2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s

2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s

2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s

2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s

2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s

2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s

2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s

2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s

2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s

2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s

2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s

2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s

2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s

2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s

.......

2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s

2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s

2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s

2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s

2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s

2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs

2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s

2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s

2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s

2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s

2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None

2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm

2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy

2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy

2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector

大约跑了7个小时，我们得到了一个gensim中默认格式的word2vec model和一个原始c版本word2vec的vector格式的模型: wiki.en.text.vector，格式如下：

1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817…
…

在ipython中，我们通过gensim来加载和测试这个模型，因为这个模型大约有7G，所以加载的时间也稍长一些：

In [2]: import gensim

 

In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)

 

In [4]: model.most_similar("queen")

Out[4]:

[(u'princess', 0.5760838389396667),

 (u'hyoui', 0.5671186447143555),

 (u'janggyung', 0.5598698854446411),

 (u'king', 0.5556215047836304),

 (u'dollallolla', 0.5540223121643066),

 (u'loranella', 0.5522741079330444),

 (u'ramphaiphanni', 0.5310937166213989),

 (u'jeheon', 0.5298476219177246),

 (u'soheon', 0.5243583917617798),

 (u'coronation', 0.5217245221138)]

 

In [5]: model.most_similar("man")

Out[5]:

[(u'woman', 0.7120707035064697),

 (u'girl', 0.58659827709198),

 (u'handsome', 0.5637181997299194),

 (u'boy', 0.5425317287445068),

 (u'villager', 0.5084836483001709),

 (u'mustachioed', 0.49287813901901245),

 (u'mcgucket', 0.48355430364608765),

 (u'spider', 0.4804879426956177),

 (u'policeman', 0.4780033826828003),

 (u'stranger', 0.4750771224498749)]

 

In [6]: model.most_similar("woman")

Out[6]:

[(u'man', 0.7120705842971802),

 (u'girl', 0.6736541986465454),

 (u'prostitute', 0.5765659809112549),

 (u'divorcee', 0.5429972410202026),

 (u'person', 0.5276163816452026),

 (u'schoolgirl', 0.5102938413619995),

 (u'housewife', 0.48748138546943665),

 (u'lover', 0.4858251214027405),

 (u'handsome', 0.4773051142692566),

 (u'boy', 0.47445783019065857)]

 

In [8]: model.similarity("woman", "man")

Out[8]: 0.71207063453821218

 

In [10]: model.doesnt_match("breakfast cereal dinner lunch".split())

Out[10]: 'cereal'

 

In [11]: model.similarity("woman", "girl")

Out[11]: 0.67365416785207421

 

In [13]: model.most_similar("frog")

Out[13]:

[(u'toad', 0.6868536472320557),

 (u'barycragus', 0.6607867479324341),

 (u'grylio', 0.626731276512146),

 (u'heckscheri', 0.6208407878875732),

 (u'clamitans', 0.6150864362716675),

 (u'coplandi', 0.612680196762085),

 (u'pseudacris', 0.6108512878417969),

 (u'litoria', 0.6084023714065552),

 (u'raniformis', 0.6044802665710449),

 (u'watjulumensis', 0.6043726205825806)]

一切ok，但是当加载gensim默认的基于numpy格式的模型时，却遇到了问题：

In [1]: import gensim

 

In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model")

 

In [3]: model.most_similar("man")

... RuntimeWarning: invalid value encountered in divide

  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)

 

Out[3]:

[(u'ahsns', nan),

 (u'ny\xedl', nan),

 (u'indradeo', nan),

 (u'jaimovich', nan),

 (u'addlepate', nan),

 (u'jagello', nan),

 (u'festenburg', nan),

 (u'picatic', nan),

 (u'tolosanum', nan),

 (u'mithoo', nan)]

这也是我修改前面这个脚本的原因所在，这个脚本在训练小一些的数据，譬如前10万条text的时候没任何问题，无论原始格式还是gensim格式，但是当跑完这个英文维基百科的时候，却存在这个问题，试了一些方法解决，还没有成功，如果大家有好的建议或解决方案，欢迎提出。

二、中文维基百科的Word2Vec测试

测试完英文维基百科之后，自然想试试中文的维基百科数据，与英文处理过程相似，也分两个步骤，不过这里需要对中文维基百科数据特殊处理一下，包括繁简转换，中文分词，去除非utf-8字符等。中文数据的下载地址是：https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2。

中文维基百科的数据比较小，整个xml的压缩文件大约才1G，相对英文数据小了很多。首先用 process_wiki.py处理这个XML压缩文件，执行：python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

2015-03-11 17:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

2015-03-11 17:40:08,329: INFO: Saved 10000 articles

2015-03-11 17:40:45,501: INFO: Saved 20000 articles

2015-03-11 17:41:23,659: INFO: Saved 30000 articles

2015-03-11 17:42:01,748: INFO: Saved 40000 articles

2015-03-11 17:42:33,779: INFO: Saved 50000 articles

......

2015-03-11 17:55:23,094: INFO: Saved 200000 articles

2015-03-11 17:56:14,692: INFO: Saved 210000 articles

2015-03-11 17:57:04,614: INFO: Saved 220000 articles

2015-03-11 17:57:57,979: INFO: Saved 230000 articles

2015-03-11 17:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)

2015-03-11 17:58:16,622: INFO: Finished Saved 232894 articles

得到了大约23万多篇中文语料的text格式的语料:wiki.zh.text，大概750多M。不过查看之后发现，除了加杂一些英文词汇外，还有很多繁体字混迹其中，这里还是参考了 @licstar 《维基百科简体中文语料的获取》中的方法，安装opencc，然后将wiki.zh.text中的繁体字转化位简体字：

opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

然后就是分词处理了，这次我用基于MeCab训练的一套中文分词系统来进行中文分词，目前虽还没有达到实用的状态，但是性能和分词结果基本能达到这次的使用要求：

mecab -d ../data/ -O wakati wiki.zh.text.jian -o wiki.zh.text.jian.seg -b 10000000

注意这里data目录下是给mecab训练好的分词模型和词典文件等，详细可参考《用MeCab打造一套实用的中文分词系统》。

有了中文维基百科的分词数据，还以为就可以执行word2vec模型训练了：

python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector

不过仍然遇到了问题，提示的错误是：

UnicodeDecodeError: ‘utf8’ codec can’t decode bytes in position 5394-5395: invalid continuation byte

google了一下，大致是文件中包含非utf-8字符，又用iconv处理了一下这个问题：

iconv -c -t UTF-8 < wiki.zh.text.jian.seg > wiki.zh.text.jian.seg.utf-8

这样基本上就没问题了，执行：

python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector

2015-03-11 18:50:02,586: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector

2015-03-11 18:50:02,592: INFO: collecting all words and their counts

2015-03-11 18:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types

2015-03-11 18:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types

2015-03-11 18:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types

2015-03-11 18:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types

...

2015-03-11 18:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types

2015-03-11 18:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types

2015-03-11 18:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types

2015-03-11 18:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences

2015-03-11 18:52:13,672: INFO: total 278291 word types after removing those with count<5

2015-03-11 18:52:13,673: INFO: constructing a huffman tree from 278291 words

2015-03-11 18:52:29,323: INFO: built huffman tree with maximum node depth 25

2015-03-11 18:52:29,683: INFO: resetting layer weights

2015-03-11 18:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0

2015-03-11 18:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s

2015-03-11 18:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s

2015-03-11 18:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s

2015-03-11 18:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s

2015-03-11 18:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s

2015-03-11 18:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s

2015-03-11 18:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s

......

2015-03-11 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s

2015-03-11 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s

2015-03-11 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s

2015-03-11 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s

2015-03-11 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s

2015-03-11 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None

2015-03-11 19:22:13,884: INFO: not storing attribute syn0norm

2015-03-11 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy

2015-03-11 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy

2015-03-11 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

让我们看一下训练好的中文维基百科word2vec模型“wiki.zh.text.vector”的效果：

In [1]: import gensim

 

In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")

 

In [3]: model.most_similar(u"足球")

Out[3]:

[(u'\u8054\u8d5b', 0.6553816199302673),

 (u'\u7532\u7ea7', 0.6530429720878601),

 (u'\u7bee\u7403', 0.5967546701431274),

 (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),

 (u'\u4e59\u7ea7', 0.5840631723403931),

 (u'\u8db3\u7403\u961f', 0.5560152530670166),

 (u'\u4e9a\u8db3\u8054', 0.5308005809783936),

 (u'allsvenskan', 0.5249762535095215),

 (u'\u4ee3\u8868\u961f', 0.5214947462081909),

 (u'\u7532\u7ec4', 0.5177896022796631)]

 

In [4]: result = model.most_similar(u"足球")

 

In [5]: for e in result:

    print e[0], e[1]

   ....:

联赛 0.65538161993

甲级 0.653042972088

篮球 0.596754670143

俱乐部 0.587228953838

乙级 0.58406317234

足球队 0.556015253067

亚足联 0.530800580978

allsvenskan 0.52497625351

代表队 0.521494746208

甲组 0.51778960228

 

In [6]: result = model.most_similar(u"男人")

 

In [7]: for e in result:

    print e[0], e[1]

   ....:

女人 0.77537125349

家伙 0.617369174957

妈妈 0.567102909088

漂亮 0.560832381248

잘했어 0.540875017643

谎言 0.538448691368

爸爸 0.53660941124

傻瓜 0.535608053207

예쁘다 0.535151124001

mc刘 0.529670000076

 

In [8]: result = model.most_similar(u"女人")

 

In [9]: for e in result:

    print e[0], e[1]

   ....:

男人 0.77537125349

我的某 0.589010596275

妈妈 0.576344847679

잘했어 0.562340974808

美丽 0.555426716805

爸爸 0.543958246708

新娘 0.543640494347

谎言 0.540272831917

妞儿 0.531066179276

老婆 0.528521537781

 

In [10]: result = model.most_similar(u"青蛙")

 

In [11]: for e in result:

    print e[0], e[1]

   ....:

老鼠 0.559612870216

乌龟 0.489831030369

蜥蜴 0.478990525007

猫 0.46728849411

鳄鱼 0.461885392666

蟾蜍 0.448014199734

猴子 0.436584025621

白雪公主 0.434905380011

蚯蚓 0.433413207531

螃蟹 0.4314712286

 

In [12]: result = model.most_similar(u"姨夫")

 

In [13]: for e in result:

    print e[0], e[1]

   ....:

堂伯 0.583935439587

祖父 0.574735701084

妃所生 0.569327116013

内弟 0.562012672424

早卒 0.558042645454

曕 0.553856015205

胤祯 0.553288519382

陈潜 0.550716996193

愔之 0.550510883331

叔父 0.550032019615

 

In [14]: result = model.most_similar(u"衣服")

 

In [15]: for e in result:

    print e[0], e[1]

   ....:

鞋子 0.686688780785

穿着 0.672499775887

衣物 0.67173999548

大衣 0.667605519295

裤子 0.662670075893

内裤 0.662210345268

裙子 0.659705817699

西装 0.648508131504

洋装 0.647238850594

围裙 0.642895817757

 

In [16]: result = model.most_similar(u"公安局")

 

In [17]: for e in result:

    print e[0], e[1]

   ....:

司法局 0.730189085007

公安厅 0.634275555611

公安 0.612798035145

房管局 0.597343325615

商业局 0.597183346748

军管会 0.59476184845

体育局 0.59283208847

财政局 0.588721752167

戒毒所 0.575558543205

新闻办 0.573395550251

 

In [18]: result = model.most_similar(u"铁道部")

 

In [19]: for e in result:

    print e[0], e[1]

   ....:

盛光祖 0.565509021282

交通部 0.548688530922

批复 0.546967327595

刘志军 0.541010737419

立项 0.517836689949

报送 0.510296344757

计委 0.508456230164

水利部 0.503531932831

国务院 0.503227233887

经贸委 0.50156635046

 

In [20]: result = model.most_similar(u"清华大学")

 

In [21]: for e in result:

    print e[0], e[1]

   ....:

北京大学 0.763922810555

化学系 0.724210739136

物理系 0.694550514221

数学系 0.684280991554

中山大学 0.677202701569

复旦 0.657914161682

师范大学 0.656435549259

哲学系 0.654701948166

生物系 0.654403865337

中文系 0.653147578239

 

In [22]: result = model.most_similar(u"卫视")

 

In [23]: for e in result:

    print e[0], e[1]

   ....:

湖南 0.676812887192

中文台 0.626506924629

収蔵 0.621356606483

黄金档 0.582251906395

cctv 0.536769032478

安徽 0.536752820015

非同凡响 0.534517168999

唱响 0.533438682556

最强音 0.532605051994

金鹰 0.531676828861

 

 

In [26]: result = model.most_similar(u"林丹")

 

In [27]: for e in result:

    print e[0], e[1]

   ....:

黄综翰 0.538035452366

蒋燕皎 0.52646958828

刘鑫 0.522252976894

韩晶娜 0.516120731831

王晓理 0.512289524078

王适 0.508560419083

杨影 0.508159279823

陈跃 0.507353425026

龚智超 0.503159761429

李敬元 0.50262516737

 

In [28]: result = model.most_similar(u"语言学")

 

In [29]: for e in result:

    print e[0], e[1]

   ....:

社会学 0.632598280907

人类学 0.623406708241

历史学 0.618442356586

比较文学 0.604823827744

心理学 0.600066184998

人文科学 0.577783346176

社会心理学 0.575571238995

政治学 0.574541330338

地理学 0.573896467686

哲学 0.573873817921

 

In [30]: result = model.most_similar(u"计算机")

 

In [31]: for e in result:

    print e[0], e[1]

   ....:

自动化 0.674171924591

应用 0.614087462425

自动化系 0.611132860184

材料科学 0.607891201973

集成电路 0.600370049477

技术 0.597518980503

电子学 0.591316461563

建模 0.577238917351

工程学 0.572855889797

微电子 0.570086717606

 

In [32]: model.similarity(u"计算机", u"自动化")

Out[32]: 0.67417196002404789

 

In [33]: model.similarity(u"女人", u"男人")

Out[33]: 0.77537125129824813

 

In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())

Out[34]: u'\u4e2d\u5fc3'

 

In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())

中心

有好的也有坏的case，甚至bad case可能会更多一些，这和语料库的规模有关，还和分词器的效果有关等等，不过这个实验暂且就到这里了。至于word2vec有什么用，目前除了用来来计算词语相似度外，业界更关注的是word2vec在具体的应用任务中的效果，这个才是更有意思的东东，也欢迎大家一起探讨。

出处“我爱自然语言处理”：www.52nlp.cn

本文链接地址：http://www.52nlp.cn/中英文维基百科语料上的word2v

中英文维基百科语料上的Word2Vec实验的更多相关文章

Windows下基于python3使用word2vec训练中文维基百科语料(二)
在上一篇对中文维基百科语料处理将其转换成.txt的文本文档的基础上,我们要将为文本转换成向量,首先都要对文本进行预处理步骤四:由于得到的中文维基百科中有许多繁体字,所以我们现在就是将繁体字转换成简体 ...
Windows下基于python3使用word2vec训练中文维基百科语料(一)
在进行自然语言处理之前,首先需要一个语料,这里选择维基百科中文语料,由于维基百科是 .xml.bz2文件,所以要将其转换成.txt文件,下面就是相关步骤: 步骤一:下载维基百科中文语料 https:/ ...
wikipedia 维基百科语料获取与提取处理 by python3.5
英文维基百科 https://dumps.wikimedia.org/enwiki/ 中文维基百科 https://dumps.wikimedia.org/zhwiki/ 全部语言的列表 https: ...
Windows下基于python3使用word2vec训练中文维基百科语料(三)
对前两篇获取到的词向量模型进行使用: 代码如下: import gensim model = gensim.models.Word2Vec.load('wiki.zh.text.model') fla ...
jQuery请求维基百科[历史上的今天]
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
python+xpath+requests爬取维基百科历史上的今天
import requests import urllib.parse import datetime from lxml import etree fhout = open("result ...
搜索和浏览离线 Wikipedia 维基百科（中/英）数据工具
为什么使用离线维基百科?一是因为最近英文维基百科被封,无法访问:二是不受网络限制,使用方便,缺点是不能及时更新,可能会有不影响阅读的乱码. 目前,主要有两种工具用来搜索和浏览离线维基百科数据:Kiwi ...
开源共享一个训练好的中文词向量（语料是维基百科的内容，大概1G多一点）
使用gensim的word2vec训练了一个词向量. 语料是1G多的维基百科,感觉词向量的质量还不错,共享出来,希望对大家有用. 下载地址是: http://pan.baidu.com/s/1boPm ...
使用word2vec对中文维基百科数据进行处理
一.下载中文维基百科数据https://dumps.wikimedia.org/zhwiki/并使用gensim中的wikicorpus解析提取xml中的内容二.利用opencc繁体转简体三.利用 ...

随机推荐

[spring源码学习]一、IOC简介
一.程序实例假设一个简单地实例,我们有一个人,人可能有姓名,年龄等属性,每天上下班的时候需要坐车,他可能做小轿车,suv等,这样一个场景.我们很容易想到如下代码: 1.人的对象类,包括两个属性,姓名 ...
JS 生成GUID 方法
var Guid={NewGuid: function () { var guid = (this._G() + this._G() +"-"+ this._G() +" ...
Python for Infomatics 第13章网页服务一（译）
注:文章原文为Dr. Charles Severance 的 <Python for Informatics>.文中代码用3.4版改写,并在本机测试通过. 一旦利用程序通过HTTP协议获得 ...
【BZOJ】3751: [NOIP2014]解方程
题意求\(\sum_{i=0}^{n} a_i x^i = 0\)在\([1, m]\)内的整数解.(\(0 < n \le 100, |a_i| \le 10^{10000}, a_n \n ...
【Linux】lsof 命令，记一次端口占用查询
3月21日测试时,发现测试服务器启,总是报端口占用情况,察看端口占用情况 1-使用命令 netstat -tunlp |grep 端口号差看下这个端口被那个进程占用我当前使用的 JBOSS 端口 ...
GO语言练习：反射
列举几个反射的例子:1)简单类型反射,2)复杂类型反射,3)对反射回来的数据的可修改属性 1.简单类型反射 1.1)代码 package main import ( "fmt" & ...
Ubuntu不显示壁纸，桌面右键无反应解决
用ubuntu tweak调整ubuntu的桌面图标显示，导致桌面无法显示壁纸，桌面点击右键无发应。解决办法：Ubuntu Tweak中“调整”选项卡-》”显示桌面图标“的选项一定要打开，处于ON状 ...
ios 单例设计模式
单例模式的意思就是只有一个实例.单例模式确保某一个类只有一个实例,而且自行实例化并向整个系统提供这个实例.这个类称为单例类.单例可用性非常高,用于登录用户管理等可供全局调用. + (AccountMa ...
js中addEventListener中第3个参数
addEventListener中的第三个参数是useCapture, 一个bool类型.当为false时为冒泡获取(由里向外),true为capture方式(由外向里). <div id=& ...
常见bug及解决方案
1.外边距叠加一.发生在一个div内 <!DOCTYPE> <html> <head> <meta http-equiv=Content-Type cont ...

中英文维基百科语料上的Word2Vec实验

中英文维基百科语料上的Word2Vec实验的更多相关文章

随机推荐

热门专题