Word Embeddings: Encoding Lexical Semantics
Word Embeddings: Encoding Lexical Semantics
- Getting Dense Word Embeddings
- Word Embeddings in Pytorch
- An Example: N-Gram Language Modeling
- Exercise: Computing Word Embeddings: Continuous Bag-of-Words
Word Embeddings in Pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim torch.manual_seed(1) word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)
Out:
tensor([[ 0.6614, 0.2669, 0.0617, 0.6213, -0.4519]],
grad_fn=<EmbeddingBackward>)
An Example: N-Gram Language Modeling
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
for i in range(len(test_sentence) - 2)] vocab = set(test_sentence) #the element in set is distinct
word_to_ix = {word: i for i, word in enumerate(vocab)} class NGramLanguageModeler(nn.Module): def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramLanguageModeler, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size) def forward(self, inputs):
embeds = self.embeddings(inputs).view((1, -1))
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out, dim=1)
return log_probs losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001) for epoch in range(10):
total_loss = 0
for context, target in trigrams: context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long) model.zero_grad() log_probs = model(context_idxs) loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long)) loss.backward()
optimizer.step() total_loss += loss.item()
losses.append(total_loss)
print(losses)
Exercise: Computing Word Embeddings: Continuous Bag-of-Words
CONTEXT_SIZE=2
raw_text= """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split() # By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab) word_to_ix={word:i for i,word in enumerate(vocab)}
data=[]
for i in range(2,len(raw_text)-2):
context=[raw_text[i-2],raw_text[i-1],raw_text[i+1],raw_text[i+2]]
target=raw_text[i]
data.append((context,target))
print(data[:5]) class CBOW(nn.Module):
def __init__(self):
pass def forward(self,inputs):
pass def make_context_vector(context,word_to_ix):
idxs=[word_to_ix[w] for w in context]
return torch.tensor(idxs,dtype=torch.long) make_context_vector(data[0][0],word_to_ix)
Word Embeddings: Encoding Lexical Semantics的更多相关文章
- Word Embeddings: Encoding Lexical Semantics(译文)
词向量:编码词汇级别的信息 url:http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight= ...
- [C5W2] Sequence Models - Natural Language Processing and Word Embeddings
第二周 自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 词汇表征(Word Representation) 上周我们学习了 RN ...
- deeplearning.ai 序列模型 Week 2 NLP & Word Embeddings
1. Word representation One-hot representation的缺点:把每个单词独立对待,导致对相关词的泛化能力不强.比如训练出“I want a glass of ora ...
- 翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings
翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的 ...
- 论文阅读笔记 Word Embeddings A Survey
论文阅读笔记 Word Embeddings A Survey 收获 Word Embedding 的定义 dense, distributed, fixed-length word vectors, ...
- 课程五(Sequence Models),第二 周(Natural Language Processing & Word Embeddings) —— 1.Programming assignments:Operations on word vectors - Debiasing
Operations on word vectors Welcome to your first assignment of this week! Because word embeddings ar ...
- [IR] Word Embeddings
From: https://www.youtube.com/watch?v=pw187aaz49o Ref: http://blog.csdn.net/abcjennifer/article/deta ...
- Word Embeddings
能够充分意识到W的这些属性不过是副产品而已是很重要的.我们没有尝试着让相似的词离得近.我们没想把类比编码进不同的向量里.我们想做的不过是一个简单的任务,比如预测一个句子是不是成立的.这些属性大概也就是 ...
- Papers of Word Embeddings
首先解释一下什么叫做embedding.举个例子:地图就是对于现实地理的embedding,现实的地理地形的信息其实远远超过三维 但是地图通过颜色和等高线等来最大化表现现实的地理信息. embeddi ...
随机推荐
- 【u015】兽径管理
[问题描述] 约翰农场的牛群希望能够在 N 个(1<=N<=200)草地之间任意移动.草地的编号由 1到N.草地之间有树林隔开.牛群希望能够选择草地间的路径,使牛群能够从任一 片草地移动到 ...
- Android selector背景以及透明色
selector可以设置图片或layout的点击效果: <?xml version="1.0" encoding="utf-8"?> <sel ...
- Android 离线语音用法(讯飞语音)
这次给大家带来的是项目的离线语音功能. 讯飞开放平台中的离线语音 首先创建开放平台的账号.这个不必多说 然后创建新应用 选择我的应用,例如以下图,注意下我打马赛克的地方,这个appId非常重要 点击进 ...
- Android自定义控件View(一)
虽然Android API给我们提供了众多控件View来使用,但是鉴于Android的开发性,自然少不了根据需求自定义控件View了.比如说QQ头像是圆形的,但是纵观整个Android控件也找不到一个 ...
- NOIP 模拟 序列操作 - 无旋treap
题意: 一开始有n个非负整数h[i],接下来会进行m次操作,第i次会给出一个数c[i],要求选出c[i]个大于0的数并将它们-1,问最多可以进行多少次? 分析: 首先一个显然的贪心就是每次都将最大的c ...
- 移动CMPP3.0接口
前段时间准备上线期,同事接了个联调CMPP3.0短信接口的任务,但是一直不成功,抽时间给解决了一下,记录下其中几个要点: 1.短信网关厂家需要提供参数: #网关IP地址 ismgIp=1.1.1.1# ...
- oracle 全部查询和表空间,以及其关系
select * from dba_users; 查看数据库里面全部用户,前提是你是有dba权限的帐号.如sys,system select * from all_users; 查看你能管 ...
- 【poj3690】Constellations 哈希
传送门 题目分析 考虑将大矩阵的每个1*q矩阵哈希值求出,然后让小矩阵的第一行在大矩阵中找,如果找到,并且能匹配所有行则出现过.否则没出现过. 在初始化1*q矩阵时可以进行优化:假设该行为123456 ...
- Arcgis api for javascript学习笔记(4.5版本)-三维地图的飞行效果
其实就只是用到了 view.goTo() 函数,再利用 window.setInterval() 函数(定时器)定时执行goTo().代码如下: <!DOCTYPE html> < ...
- python 判断一个数为?
1. 判断一个变量是否数字(整数.浮点数)? instance('a', (int, long, float)) True isinstance('a', (int, long, float)) Fa ...