pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》

论文来自Mikolov等人的《Efficient Estimation of Word Representations in Vector Space》

论文地址： 66666

论文介绍了2个方法，原理不解释...

skim code and comment https://github.com/graykode/nlp-tutorial:

# -*- coding: utf-8 -*-

# @time : 2019/11/9  12:53

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

from torch.autograd import Variable

import matplotlib.pyplot as plt

dtype = torch.FloatTensor

# 3 Words Sentence

sentences = [ "i like dog", "i like cat", "i like animal",

              "dog cat animal", "apple cat dog like", "dog fish milk like",

              "dog cat eyes like", "i like apple", "apple i hate",

              "apple i movie book music like", "cat dog hate", "cat dog like"]

word_sequence = " ".join(sentences).split()

word_list = " ".join(sentences).split()

word_list = list(set(word_list))

word_dict = {w: i for i, w in enumerate(word_list)}

# Word2Vec Parameter

batch_size = 20  # To show 2 dim embedding graph

embedding_size = 2  # To show 2 dim embedding graph

voc_size = len(word_list)

# 产生 batch_size个，每个都是一个input和label, both are ont-hot vector

def random_batch(data, size):

    random_inputs = []

    random_labels = []

    random_index = np.random.choice(range(len(data)), size, replace=False)

    for i in random_index:

        random_inputs.append(np.eye(voc_size)[data[i][0]])  # target

        random_labels.append(data[i][1])  # context word

    return random_inputs, random_labels

# Make skip gram of one size window

skip_grams = []

# 从第2个word_sequence开始(index=1),预测index=0和index=2，也就是[index=1,index=0]和[index=1,index=2]的添加到skim_grams中

for i in range(1, len(word_sequence) - 1):

    target = word_dict[word_sequence[i]]

    context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]

    for w in context:

        skip_grams.append([target, w])

# Model

class Word2Vec(nn.Module):

    def __init__(self):

        super(Word2Vec, self).__init__()

        # W and WT is not Traspose relationship

        self.W = nn.Parameter(-2 * torch.rand(voc_size, embedding_size) + 1).type(dtype) # voc_size > embedding_size Weight

        self.WT = nn.Parameter(-2 * torch.rand(embedding_size, voc_size) + 1).type(dtype) # embedding_size > voc_size Weight

    def forward(self, X):

        # X : [batch_size, voc_size]

        hidden_layer = torch.matmul(X, self.W) # hidden_layer : [batch_size, embedding_size]

        output_layer = torch.matmul(hidden_layer, self.WT) # output_layer : [batch_size, voc_size]

        return output_layer

model = Word2Vec()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training

for epoch in range(5000):

    input_batch, target_batch = random_batch(skip_grams, batch_size)

    input_batch = Variable(torch.Tensor(input_batch))

    target_batch = Variable(torch.LongTensor(target_batch))

    optimizer.zero_grad()

    output = model(input_batch)

    # output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)

    loss = criterion(output, target_batch)

    if (epoch + 1)%1000 == 0:

        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()

    optimizer.step()

# because

# input_size is [batch_size,voc_size] , ( a word is one-hot voctor(lenght is voc_size) )

# W is [voc_size,emmedding_size]

# a word*W ,result is same as:

# [1,0,0]*[w1,w4

#          w2,w5

#          w3,w6]

# so one word embedding vector is [w1,w4]

# 即: W[i][0],W[i][1]

for i, label in enumerate(word_list):

    W, WT = model.parameters()

    x,y = float(W[i][0]), float(W[i][1])

    plt.scatter(x, y)

    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')

plt.show()

pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》的更多相关文章

Efficient Estimation of Word Representations in Vector Space 论文笔记
Mikolov T , Chen K , Corrado G , et al. Efficient Estimation of Word Representations in Vector Space ...
一天一经典Efficient Estimation of Word Representations in Vector Space
摘要本文提出了两种从大规模数据集中计算连续向量表示(Continuous Vector Representation)的计算模型架构.这些表示的有效性是通过词相似度任务(Word Similarit ...
Efficient Estimation of Word Representations in Vector Space (2013)论文要点
论文链接:https://arxiv.org/pdf/1301.3781.pdf 参考: A Neural Probabilistic Language Model (2003)论文要点 https ...
【Deep Learning学习笔记】Efficient Estimation of Word Representations in Vector Space_google2013
标题:Efficient Estimation of Word Representations in Vector Space 作者:Tomas Mikolov 发表于:ICLR 2013 主要内容: ...
论文翻译——Deep contextualized word representations
Abstract We introduce a new type of deep contextualized word representation that models both (1) com ...
Word Representations 词向量
常用的词向量方法word2vec. 一.Word2vec 1.参考资料: 1.1) 总览 https://zhuanlan.zhihu.com/p/26306795 1.2) 基础篇: 深度学习wo ...
word2vec 理论与实践
导读本文简单的介绍了Google 于 2013 年开源推出的一个用于获取 word vector 的工具包(word2vec),并且简单的介绍了其中的两个训练模型(Skip-gram,CBOW),以 ...
TensorFlow v2.0实现Word2Vec算法
使用TensorFlow v2.0实现Word2Vec算法计算单词的向量表示,这个例子是使用一小部分维基百科文章来训练的. 更多信息请查看论文: Mikolov, Tomas et al. " ...
文本深度表示模型Word2Vec
简介 Word2vec 是 Google 在 2013 年年中开源的一款将词表征为实数值向量的高效工具, 其利用深度学习的思想,可以通过训练,把对文本内容的处理简化为 K 维向量空间中的向量运算,而向 ...

随机推荐

c++快读与快输模板
快读 inline int read() { ; ; char ch=getchar(); ; ch=getchar();} )+(X<<)+ch-'; ch=getchar();} if ...
dp-最长递增子序列（LIS）
首先引出一个例子问题 : 给你一个长度为 6 的数组 , 数组元素为 { 1 ,4,5,6,2,3,8 } , 则其最长单调递增子序列为 { 1 , 4 , 5 , 6 , 8 } , 并且长度为 ...
Git高级之配置多个SSH key
最近我们在代码托管平台上使用SSH的方式下拉代码,通常是用一个ssh key来拉取所有托管平台的代码,如码云,GitHub.GitLab等,但是总用一个不是太好.会有安全风险,这就需要为每个托管平台设 ...
2018徐州现场赛A
题目链接:http://codeforces.com/gym/102012/problem/A 题目给出的算法跑出的数据是真的水 #include<iostream> #include&l ...
真机调试报The executable was signed with invalid entitlements.错误
真机运行时,提示The executable was signed with invalid entitlements.(The entitlements specified in your appl ...
DataFrame分组和聚合
一.分组 1.语法 grouped= df.groupby(by='columns name') # grouped是一个DataFrameGroupBy对象,是可迭代的(遍历) # grouped中 ...
Wordpress4.9.6 任意文件删除漏洞复现分析
第一章漏洞简介及危害分析 1.1漏洞介绍 WordPress可以说是当今最受欢迎的(我想说没有之一)基于PHP的开源CMS,其目前的全球用户高达数百万,并拥有超过4600万次的超高下载量.它是一个开 ...
pc端的弹性布局适配方案
方案及原理:使用rem单位,通过window.onresize来监听浏览器窗口,获取窗口宽度,并改变跟字体大小来达到弹性适配效果. function adaptor(){ //为了便于计算,这里以19 ...
Qt常用UI控件读取、写入方法
本文用途:快速备忘,方便调用,写熟了自然就记下了. [1.标签label] 读取:ui->label->text() 写入:ui->label->setText("p ...
openpyxl库实现对excel文档进行编辑（追加写入）
首先,这个库只支持xlsx格式的excel文件预期,对”excel_test.xlsx“的A1单元格写入”hello word“ 1.安装”openpyxl“库,pip install openpy ...

pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》

pytorch --- word2vec 实现 --《Efficient Estimation of Word Representations in Vector Space》的更多相关文章

随机推荐

热门专题