word2vec初探

在自然语言处理入门里我们提到了词向量的概念,tf-idf的概念,并且在实际的影评正负面预测项目中使用了tf-idf，取得了还算不错的效果.
这一篇,我们来尝试一下使用来自google的大名鼎鼎的word2vec。

gensim是一个常用的python自然语言处理库.其中封装了c语言版本的word2vec。

gensim的安装很简单,pip install gensim即可.

直接进入主题,看一下word2vec的API。官方link戳这里,值得好好看看.

class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)¶

Parameters:

Parameters:	sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See `BrownCorpus`, `Text8Corpus` or `LineSentence` in `word2vec` module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way. corpus_file (str, optional) – Path to a corpus file in `LineSentence` format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them). size (int, optional) – Dimensionality of the word vectors. window (int, optional) – Maximum distance between the current and predicted word within a sentence. min_count (int, optional) – Ignores all words with total frequency lower than this. workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines). sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW. hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used. negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used. ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications. cbow_mean ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used. alpha (float, optional) – The initial learning rate. min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses. seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization). max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit. max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required. sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5). hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility. iter (int, optional) – Number of iterations (epochs) over the corpus. trim_rule (function, optional) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to `keep_vocab_item()`), or a callable that accepts parameters (word, count, min_count) and returns either`gensim.utils.RULE_DISCARD`, `gensim.utils.RULE_KEEP` or `gensim.utils.RULE_DEFAULT`. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model. The input parameters are of the following types: word (str) - the word we are examining count (int) - the word’s frequency count in the corpus min_count (int) - the minimum count threshold. sorted_vocab ({0, 1}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See `sort_vocab()`. batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.) compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using`get_latest_training_loss()`. callbacks (iterable of `CallbackAny2Vec`, optional) – Sequence of callbacks to be executed at specific stages during training.

sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them).
size (int, optional) – Dimensionality of the word vectors.
window (int, optional) – Maximum distance between the current and predicted word within a sentence.
min_count (int, optional) – Ignores all words with total frequency lower than this.
workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.
cbow_mean ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
alpha (float, optional) – The initial learning rate.
min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required.
sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility.
iter (int, optional) – Number of iterations (epochs) over the corpus.
trim_rule (function, optional) –
Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns eithergensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
The input parameters are of the following types:
- word (str) - the word we are examining
- count (int) - the word’s frequency count in the corpus
- min_count (int) - the minimum count threshold.
sorted_vocab ({0, 1}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See sort_vocab().
batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved usingget_latest_training_loss().
callbacks (iterable of CallbackAny2Vec, optional) – Sequence of callbacks to be executed at specific stages during training.

sentences: 我们要分析的语料。

min_count:词频低于这个的词将被忽略.默认为5.

size:词向量化以后的维度.即特征个数.默认100.

window: 词向量上下文最大距离。默认值为5

其他的一些参数,与词向量模型训练的具体算法有关,暂时还不太清楚具体含义,使用的时候暂时取默认值.待日后有了更深理解后补充这篇博文.

一些比较重要的属性如下：

wv

Word2VecKeyedVectors – This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways. See the module level docstring for examples.

vocabulary

Word2VecVocab – This object represents the vocabulary (sometimes called Dictionary in gensim) of the model. Besides keeping track of all unique words, this object provides extra functionality, such as constructing a huffman tree (frequent words are closer to the root), or discarding extremely rare words.

trainables

Word2VecTrainables – This object represents the inner shallow neural network used to train the embeddings. The semantics of the network differ slightly in the two available training modes (CBOW or SG) but you can think of it as a NN with a single projection and hidden layer which we train on the corpus. The weights are then used as our embeddings (which means that the size of the hidden layer is equal to the number of features self.size).

这里注意一下下面的问题,在第一次用word2vec api的时候我踩了坑了.

第一个参数sentences是一系列sentence,每一个sentence又是一系列word。

比如sentences = [['first', 'sentence'], ['second', 'sentence']]

则经过word2vec以后,得到'first', 'sentence','second'几个词的词向量.

如果sentences = [['first sentence'], ['second sentence']]，

则经过word2vec以后,得到'first sentence', 'second sentence'几个词的词向量.这里word2vec把'first sentence','second sentence'视为是一个词.

如果sentences = ['first', 'sentence']，则'firsst'被认为是一个句子,‘sentence’被认为是一个句子,‘first’对应的words为‘f’,'i','r','s','t'，经过word2vec以后得到的词向量中的词是‘f’,'i','r','s','t'....而没有'first'。具体参考stackoverflow的这个回答.

获取词向量的具体用法如下：

%%time

X_all = train_words + test_words

model = word2vec.Word2Vec(X_all,min_count=1,window=5,size=100)

model.save('words.model')

其中X_all形如[ ['i','love','you'], ['do','you','know'] ]。这样我们就把X_all中涉及到的words转换成了对应的向量.

我们可以通过model.wv['love']这样的方式来得到一个词对应的向量. wv是一个k-v结构,表示word-->vector。

求得词向量后有一些常用的方法如下：

print(model.wv.similar_by_word('family'))      #求出与'family'最相近的10个词.

print(model.wv.similarity('family','parents')) ##求出相似程度

print(model.wv.doesnt_match(['family','father','wife','dog']))#求出给定词中有别于其他词的词

[('parents', 0.6177123785018921), ('father', 0.5987046957015991), ('families', 0.5883874297142029), ('mother', 0.5699872970581055), ('children', 0.5613149404525757), ('parent', 0.5575612783432007), ('community', 0.5537818074226379), ('friendship', 0.5431720018386841), ('life', 0.5359925627708435), ('wife', 0.5311812162399292)]

0.6177124

dog

word2vec还支持从文件中加载已经训练好的模型.用法如下：

　　model = word2vec.Word2Vec.load('./words.model') ##载入词向量模型

这样我们就可以直接下载别人训练好的词向量模型文件直接使用了,节省了训练的时间.

并且可以恢复训练.例如我们有了更多的语料,我们想训练出新的model。则可以

model = gensim.models.Word2Vec.load('/tmp/mymodel')

model.train(more_sentences)

至此,我们把每个词转换为了一个100维的向量.网上搜到的绝大部分有关word2vec的资料就到此为止了,并没讲得到词向量以后怎么继续获得样本的特征矩阵.那之后我们要怎么做呢？

注意,我们之前用词的tf-idf作为词的特征.每一个句子中的每个词用tf-idf替代,则将一个句子转换为一个N维向量.

而使用word2vec的话,假如一个句子有50个词,假设经过word2vec以后,每个词转变为一个100维的向量. 直接替换的话,那每个句子就变成了5000个特征,样本就变成了了M*5000的矩阵.维度太高了,机器学习的训练速度将大大降低,显然不能这么做.
我们采用取均值的方法,可以这么理解：我们有一个N维空间,每个词就是N维空间里的一个点(或者说向量).一个句子有50个词,也就是说这个句子由N维空间里的50个点组成,现在我们想用N维空间中的某一个点来表示这个句子,则我们把这N个点的向量加起来,(向量的加法,各个维度相加)，再取平均.这样我们就用N维空间里的一个点把这个句子表达了出来.

X_all_new = []
for sent in X_all:
　　X_all_new.append(np.mean([model.wv[w] for w in sent if w in model.wv],axis=0))

这样我们就得到了X_all_new，一个M*N的特征矩阵.然后继续上我们的机器学习算法.

不过我用了word2vec之后再用逻辑回归预测的结果,并没有比tf-idf更好,只取得了0.85的准确率,并没有比直接用tf-idf取的更好的结果.当然,这不能说明word2vec效果不好,毕竟数据集比较小,而且也只用了一直机器学习算法，只是说在这个影评预测比赛中tf-idf的效果还不错.

完整代码见：戳这里.

word2vec初探的更多相关文章

word2vec初探（用python简单实现）
为什么要用这个? 因为看论文和博客的时候很常见,不论是干嘛的,既然这么火,不妨试试. 如何安装从网上爬数据下来对数据进行过滤.分词用word2vec进行近义词查找等操作完整的工程传到了我的gi ...
AI安全初探——利用深度学习检测DNS隐蔽通道
AI安全初探——利用深度学习检测DNS隐蔽通道目录 AI安全初探——利用深度学习检测DNS隐蔽通道 1.DNS 隐蔽通道简介 2. 算法前的准备工作——数据采集 3. 利用深度学习进行DNS隐蔽通道 ...
初探领域驱动设计（2）Repository在DDD中的应用
概述上一篇我们算是粗略的介绍了一下DDD,我们提到了实体.值类型和领域服务,也稍微讲到了DDD中的分层结构.但这只能算是一个很简单的介绍,并且我们在上篇的末尾还留下了一些问题,其中大家讨论比较多的, ...
CSharpGL(8)使用3D纹理渲染体数据 (Volume Rendering) 初探
CSharpGL(8)使用3D纹理渲染体数据 (Volume Rendering) 初探 2016-08-13 由于CSharpGL一直在更新,现在这个教程已经不适用最新的代码了.CSharpGL源码 ...
word2vec 中的数学原理详解
word2vec 是 Google 于 2013 年开源推出的一个用于获取 word vector 的工具包,它简单.高效,因此引起了很多人的关注.由于 word2vec 的作者 Tomas Miko ...
Java豆瓣电影爬虫——使用Word2Vec分析电影短评数据
在上篇实现了电影详情和短评数据的抓取.到目前为止,已经抓了2000多部电影电视以及20000多的短评数据. 数据本身没有规律和价值,需要通过分析提炼成知识才有意义.抱着试试玩的想法,准备做一个有关情感 ...
从273二手车的M站点初探js模块化编程
前言这几天在看273M站点时被他们的页面交互方式所吸引,他们的首页是采用三次加载+分页的方式.也就说分为大分页和小分页两种交互.大分页就是通过分页按钮来操作,小分页是通过下拉(向下滑动)时异步加载数 ...
JavaScript学习（一） —— 环境搭建与JavaScript初探
1.开发环境搭建本系列教程的开发工具,我们采用HBuilder. 可以去网上下载最新的版本,然后解压一下就能直接用了.学习JavaScript,环境搭建是非常简单的,或者说,只要你有一个浏览器,一个 ...
.NET文件并发与RabbitMQ（初探RabbitMQ）
本文版权归博客园和作者吴双本人共同所有.欢迎转载,转载和爬虫请注明原文地址:http://www.cnblogs.com/tdws/p/5860668.html 想必MQ这两个字母对于各位前辈们和老司 ...

随机推荐

Convert Spaces to Tabs
:set tabstop=2 " To match the sample file :set noexpandtab " Use tabs, not spaces :%retab! ...
学习Python第四天
关于剩下的数据类型:字符串字符串是有序的,不可变的(不可变的意思是指将变量a重新赋值后不会覆盖原来的值,而是在内存中开辟了一块新的内存地址,存储变量的值) 字符串的各种方法: 1,将字符串中的大写变 ...
sort()方法的应用（二）
引用:函数作为参数 var fn_by = function(id) { return function(o, p) { var a, b; if (typeof o === "object ...
JavaScript基础学习笔记整理
1.关于JS: (1)脚本语言——不需要编译的语言(常见有cmd,t-sql)----解释性语言; (2)动态类型的语言——1.代码只有执行到那个位置才知道那个变量中存储的是什么 2.对象中没有某个属 ...
ETC（电子不停车收费系统）的发展演变
ETC引进中国是在上世纪的90年代中期,当时中国部分经济发达地区的高速公路车流量激增,从而导致了收费口的交通堵塞.高速公路堵车现象时有发生,拥堵严重的路段可能会天天堵,有时候一堵好几天.高速公路管理手 ...
大叔学ML第二：线性回归
目录基本形式求解参数$\vec\theta$ 梯度下降法正规方程导法调用函数库基本形式线性回归非常直观简洁,是一种常用的回归模型,大叔总结如下: 设有样本$X$形如: \[\beg ...
python爬虫学习之正则表达式的基本使用
一.正则表达式 1. 正则表达式是字符串处理的有力工具和技术. 2. 正则表达式使用某种预定义的模式去匹配一类具有共同特征的字符串,主要用于处理字符串,可以快速.准确地完成复杂的查找.替换等处理要求, ...
每天学点SpringCloud（十二）：Zipkin全链路监控
Zipkin是SpringCloud官方推荐的一款分布式链路监控的组件,使用它我们可以得知每一个请求所经过的节点以及耗时等信息,并且它对代码无任何侵入,我们先来看一下Zipkin给我们提供的UI界面都 ...
微信小程序合法域名配置-不校验合法域名、web-view（业务域名）、TLS 版本以及 HTTPS 证书
微信小程序合法域名配置-不校验合法域名.web-view(业务域名).TLS 版本以及 HTTPS 证书很多教程说按照以上方式调用即可.但是当我们在程序中实际调用以上程序时,就会报错, http:/ ...
JSTL 和 EL
EL表达式 Expression Language 语法${作用域中的值} 使用EL表达式时,需要在page标签中写上isELIgnored="false",否则EL表达式不生 ...

word2vec初探

word2vec初探的更多相关文章

随机推荐

热门专题