词向量：编码词汇级别的信息

url:http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight=lookup

词嵌入

词嵌入是稠密向量，每个都代表了一个单词表里面的一个单词。NLP中每个Feature都是单词，但是怎么在电脑中表示单词呢？？

ascii知识告诉我们每个单词是啥，没告诉我们是什么意思。还有就是，怎么融合这些表示呢？

第一步：通过one-hot编码。w=[0,0,1,0,0]。其中1是表示w的独一无二的维度。

但是缺点就是没有语义信息。正交表示就没有语义信息。

"出现在相似位置和相似语境中的单词具有语义相关性！"这就是分布式假设

例子

假设每个维度是代表某种属性（而不是one-hot中每种属性都是一个单词），那么通过在每个维度上各种属性的"调和"，

就能够获取一个单词，相似的单词在某几个"属性上"类似，就会在向量空间距离变近。不相似的单词夹角就会很大。

避免了每个维度大量出现0（one-hot的缺陷）

那么问题就来了，每个维度代表什么属性怎么设计，太难了，就让神经网络自己设计，不需要程序员设计了。

为啥不让词嵌入作为模型参数呢？？在训练中自己去更新！正是我们做的事情。

但是词嵌入可解释性不强，也就是说，训练出来，每个维度代表什么含义，不清楚。

但结果就是，近义词在潜在语义维度上确实相近，却难以解释。

总而言之，词嵌入是单词的语义解释。是高效的语义信息编码。

当然可以去"embedding"其他任何事情：词性标签，解析树。

特征嵌入的思想是这个领域的核心。

pytorch

通过PyTorch来进行Embedding。

类似于通过one-hot来对单词进行索引，我们要使用Embedding去给每个单词定义索引。

这是lookup table的关键所在。

这样，embedding被存入|V|*D的矩阵，D是embedding的维度，就比如单词的索引被存入矩阵的第i行。

在下面的代码中，单词到索引的映射是一个叫做word_to_ix的字典。

模块是nn.Embedding，参数是词汇表的size，和嵌入的维度

而且要注意，对于表的索引，使用torch.LongTensor，因为索引是整数，不是浮点数

CONTEXT_SIZE = 2

EMBEDDING_DIM = 10

# We will use Shakespeare Sonnet 2

test_sentence = """When forty winters shall besiege thy brow,

And dig deep trenches in thy beauty's field,

Thy youth's proud livery so gazed on now,

Will be a totter'd weed of small worth held:

Then being asked, where all thy beauty lies,

Where all the treasure of thy lusty days;

To say, within thine own deep sunken eyes,

Were an all-eating shame, and thriftless praise.

How much more praise deserv'd thy beauty's use,

If thou couldst answer 'This fair child of mine

Shall sum my count, and make my old excuse,'

Proving his beauty by succession thine!

This were to be new made when thou art old,

And see thy blood warm when thou feel'st it cold.""".split()

# we should tokenize the input, but we will ignore that for now

# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)

trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])

            for i in range(len(test_sentence) - 2)]

# print the first 3, just so you can see what they look like

print(trigrams[:3])

vocab = set(test_sentence)#得到单词的数量，编码的基础

word_to_ix = {word: i for i, word in enumerate(vocab)}#首先对单词进行最简单的编码

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):

        super(NGramLanguageModeler, self).__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)#其实就是词嵌入矩阵而已

        self.linear1 = nn.Linear(context_size * embedding_dim, 128)#由于是利用前面两个词进行预测的，因此需要将得到的单词拼接起来,其实也可以加和，求平均什么的

        self.linear2 = nn.Linear(128, vocab_size)#去分类

    def forward(self, inputs):

        embeds = self.embeddings(inputs).view((1, -1))#通过嵌入矩阵并且合并

        out = F.relu(self.linear1(embeds))

        out = self.linear2(out)

        log_probs = F.log_softmax(out)

        return log_probs

losses = []

loss_function = nn.NLLLoss()

model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)

optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):

    total_loss = torch.Tensor([0])

    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words

        # into integer indices and wrap them in variables)

        context_idxs = [word_to_ix[w] for w in context]

        context_var = autograd.Variable(torch.LongTensor(context_idxs))

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a

        # new instance, you need to zero out the gradients from the old

        # instance

        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next

        # words

        log_probs = model(context_var)

        # Step 4. Compute your loss function. (Again, Torch wants the target

        # word wrapped in a variable)

        loss = loss_function(log_probs, autograd.Variable(

            torch.LongTensor([word_to_ix[target]])))

        # Step 5. Do the backward pass and update the gradient

        loss.backward()

        optimizer.step()

        total_loss += loss.data

    losses.append(total_loss)

print(losses)  # The loss decreased every iteration over the training data!

总结

之前还是一直没有搞清词向量的本质，拿捏不定，其实词向量训练的方法很多，但本质的思想是类似的。就是设置一个查询表，也就是词嵌入矩阵。通过这个查询表，将原始稀疏的one-hot编码成稠密向量。而查询表需要通过训练得到，也就是网络中的参数。

那么最终使用的就是这个权重矩阵的每行来表示对应位置的词向量，和原始的索引相乘只是一个查表的操作。

Word Embeddings: Encoding Lexical Semantics（译文）的更多相关文章

Word Embeddings: Encoding Lexical Semantics
Word Embeddings: Encoding Lexical Semantics Getting Dense Word Embeddings Word Embeddings in Pytorch ...
翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings
翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的 ...
[C5W2] Sequence Models - Natural Language Processing and Word Embeddings
第二周自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 词汇表征(Word Representation) 上周我们学习了 RN ...
deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings
1. Word representation One-hot representation的缺点:把每个单词独立对待,导致对相关词的泛化能力不强.比如训练出“I want a glass of ora ...
论文阅读笔记 Word Embeddings A Survey
论文阅读笔记 Word Embeddings A Survey 收获 Word Embedding 的定义 dense, distributed, fixed-length word vectors, ...
课程五(Sequence Models)，第二周（Natural Language Processing & Word Embeddings） —— 1.Programming assignments：Operations on word vectors - Debiasing
Operations on word vectors Welcome to your first assignment of this week! Because word embeddings ar ...
[IR] Word Embeddings
From: https://www.youtube.com/watch?v=pw187aaz49o Ref: http://blog.csdn.net/abcjennifer/article/deta ...
Word Embeddings
能够充分意识到W的这些属性不过是副产品而已是很重要的.我们没有尝试着让相似的词离得近.我们没想把类比编码进不同的向量里.我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的.这些属性大概也就是 ...
Papers of Word Embeddings
首先解释一下什么叫做embedding.举个例子:地图就是对于现实地理的embedding,现实的地理地形的信息其实远远超过三维但是地图通过颜色和等高线等来最大化表现现实的地理信息. embeddi ...

随机推荐

记录安装Python第三方包“tesserocr”的方法和遇到的坑
1. 环境: 系统环境:Win7 32 位系统 Python版本: 3.6.5 虚拟环境为:Miniconda3 2. 共需要安装的模块: a. tesserocr b. tessera ...
洛谷p-1522又是Floyd
挺简单一个题,可惜当时没想到,有点巧妙丫! #include<cstdio> #include<iostream> #include<cstring> #inclu ...
【一起学源码-微服务】Nexflix Eureka 源码十一：EurekaServer自我保护机制竟然有这么多Bug？
前言前情回顾上一讲主要讲了服务下线,已经注册中心自动感知宕机的服务. 其实上一讲已经包含了很多EurekaServer自我保护的代码,其中还发现了1.7.x(1.9.x)包含的一些bug,但这些问 ...
ng-zorro-antd中踩过的坑
ng-zorro-antd中踩过的坑前端项目中,我们经常会使用阿里开源的组件库:ant-design,其提供的组件已经足以满足多数的需求,拿来就能直接用,十分方便,当然了,有些公司会对组件库进行二次 ...
【生活】记第一次参加CCF CSP认证
2018年03月18日 CCF CSP认证三月份的这次csp认证,我之前是没报名的,一来自己还没什么准备,二来去年的那次认证我也没参加,开考前的一个礼拜,从朋友那得知,这次学校团体报名的名额还没报满 ...
【转】python中查询某个函数的使用方法
使用help(),例查询sum函数的用法使用官方文档: 1)打开python的IDLE: 2)点击help,选择python doc(这是python的官方文档,或者你也可以直接按f1键) 3)在调 ...
【转】python get-pip.py could not find a version that satisfies
转:https://blog.csdn.net/yanlisuo/article/details/81357305 转:https://blog.csdn.net/dyrlovewc/article/ ...
「newbee-mall新蜂商城开源啦」1000 Star Get !仓库Star数破千！记录一下
新蜂商城已经开源了 3 个多月左右的时间,在 2019 年的年末,仓库的 Star 数量冲破了 1000,整理本篇文章的时间是 2020 年 1 月 12 日,目前的 Star 数量是 1180 左右 ...
【转】Java面试题：多继承
招聘和面试对开发经理来说是一个无尽头的工作,虽然有时你可以从HR这边获得一些帮助,但是最后还是得由你来拍板,或者就像另一篇文章“Java 面试题:写一个字符串的反转”所说: 面试开发人员不仅辛苦而且乏 ...
原生JavaScript实现评分效果
一.实现原理: 1.要设置一个“大总管变量”,用于记录点击时的星星下标,只声明不赋值. 2.移入每个星星时,先把所有的星星恢复到默认状态:再把当前星星及在它之前的星星设为选中状态. 3.移出每个星星时 ...

Word Embeddings: Encoding Lexical Semantics（译文）

词向量：编码词汇级别的信息

词嵌入

例子

pytorch

总结

Word Embeddings: Encoding Lexical Semantics（译文）的更多相关文章

随机推荐

热门专题