基本原理

损失函数

(线性链)CRF通常用于序列标注任务,对于输入序列\(x\)和标签序列\(y\),定义匹配分数:

\[s(x,y) = \sum_{i=0}^l T(y_i, y_{i+1}) + \sum_{i=1}^l U(x_i, y_i)
\]

这里\(l\)是序列长度,\(T\)和\(U\)都是可学习的参数,\(T(y_i, y_{i+1})\)表示第\(i\)步的标签是\(y_i\),第\(i+1\)步标签是\(y_{i+1}\)的转移分数,\(U(x_i,y_i)\)表示第\(i\)步输入\(x_i\)对应的标签是\(y_i\)的发射分数。注意这里在计算转移分数\(T\)时,状态转移链为\(y_0\rightarrow y_1 \rightarrow \dots \rightarrow y_l \rightarrow y_{l+1}\),因为人为地加入了START_TAG和STOP_TAG标签。

为了解决标注偏置问题,CRF需要做全局归一化,具体而言就是输入\(x\)对应的标签序列为\(y\)的概率定义为:

\[P(y|x)=\frac{e^{s(x,y)}}{Z(x)} = \frac{e^{s(x,y)}}{\sum_{\tilde{y}\in Y_x}e^{s(x,y)}}
\]

因此这里最麻烦的就是计算配分函数(partition function)\(Z(x)\),因为它要遍历所有路径。

在训练过程中,我们希望最大化正确标签序列的对数概率,即:

\[\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = s(x,y) - \log Z(x) = s(x,y) - \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)})
\]

也就是最小化负对数似然,即损失函数为:

\[-\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)}) - s(x,y)
\]

配分函数计算

接下来我们来讨论怎么计算\(Z(x)\)。我们使用前向算法计算\(Z(x)\),伪码如下:

  1. 初始化,对于\(y_2\)的所有取值\(y_2^*\),定义

\[\alpha_1(y_2^*) = \sum_{y_1^*} \exp(U(x_1, y_1^*) + T(y_1^*, y_2^*))
\]

这里\(y_k\)表示\(k\)时刻的标签,它的取值空间是标签控件,如B,I,O等,某一个具体的取值记为\(y_k^*\)。\(\alpha_k(y_{k+1}^*)\)可以认为是时刻\(k\)时的非规范化概率。注意这里\(y_{k+1}^*\)我们只用了一个标签,其实我们要在整个标签空间遍历,对于\(y_{k+1}\)的每一个取值都算一遍。

2. 对于\(k = 2, 3, \dots, l-1\)以及\(y_{k+1}\)的所有取值\(y_{k+1}^*\),都有:

\[\log (\alpha_k(y_{k+1}^*)) = \log \sum_{y_k^*}\exp \left(U(x_k, y_k^*)+T(y_k^*, y_{k+1}^*) + \log(\alpha_{k-1}(y_k^*)) \right)
\]

这里\(y_k\)和\(y_{k+1}\)都是一个具体的取值,这意味着这一步的计算复杂度是\(O(N^2)\)的,其中\(N\)是标签数目。

3. 最终:

\[Z(x) = \sum_{y_l^*} \exp \left(U(x_l, y_l^*) + \log(\alpha_{l-1}(y_l^*)) \right)
\]

注意到伪码第二步就是所谓的logsumexp,这可能会导致问题。因为如果求指数特别大,可能会导致溢出。因此这里存在一个小trick使得计算时数值稳定:

\[\log \sum_k \exp(z_k) = \max (\mathbf{z}) + \log \sum_k \exp(z_k - \max(\mathbf{z}))
\]

证明如下:

\[\log \sum_k \exp(z_k) = \log \sum_k (\exp(z_k -c) \cdot \exp(c)) = \log[\exp(c) \cdot \sum_k \exp(z_k -c)] = c + \log \sum_k \exp(z_k -c) \qquad \text{令} \ c = \max({\mathbf{z}})
\]

代码实现

以下代码参考Pytorch关于Bi-LSTM+CRF的tutorial。首先导入需要的模块:

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim torch.manual_seed(1)

为了使模型易读,定义几个辅助函数:

def argmax(vec):
"""return the argmax as a python int"""
_, idx = torch.max(vec, 1)
return idx.item() def prepare_sequence(seq, to_ix):
"""word2id"""
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long) def log_sum_exp(vec):
"""Compute log sum exp in a numerically stable way for the forward algorithm
这个函数在Pytorch和TensorFlow其实都有,这里作者为了讲解又实现了一次
"""
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

接下来定义整个模型:

class BiLSTM_CRF(nn.Module):

    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
super(BiLSTM_CRF, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim
self.vocab_size = vocab_size
self.tag_to_ix = tag_to_ix
self.tagset_size = len(tag_to_ix) self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
num_layers=1, bidirectional=True) # 将LSTM的输出映射到标签空间
# 相当于公式中的发射矩阵U
self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size) # 转移矩阵,从标签i转移到标签j的分数
# tagset_size包含了人为加入的START_TAG和STOP_TAG
self.transitions = nn.Parameter(
torch.randn(self.tagset_size, self.tagset_size)) # 下面这两个约束不能转移到START_TAG,也不能从STOP_TAG开始转移
self.transitions.data[tag_to_ix[START_TAG], :] = -10000
self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000 self.hidden = self.init_hidden() def init_hidden(self):
"""初始化LSTM"""
return (torch.randn(2, 1, self.hidden_dim // 2),
torch.randn(2, 1, self.hidden_dim // 2)) def _forward_alg(self, feats):
"""计算配分函数Z(x)""" # 对应于伪码第一步
init_alphas = torch.full((1, self.tagset_size), -10000.)
# START_TAG has all of the score.
init_alphas[0][self.tag_to_ix[START_TAG]] = 0. # Wrap in a variable so that we will get automatic backprop
forward_var = init_alphas # 对应于伪码第二步的循环,迭代整个句子
for feat in feats:
alphas_t = [] # The forward tensors at this timestep
for next_tag in range(self.tagset_size):
# broadcast the emission score: it is the same regardless of
# the previous tag
emit_score = feat[next_tag].view(
1, -1).expand(1, self.tagset_size)
# the ith entry of trans_score is the score of transitioning to
# next_tag from i
trans_score = self.transitions[next_tag].view(1, -1)
# The ith entry of next_tag_var is the value for the
# edge (i -> next_tag) before we do log-sum-exp
# 这里对应了伪码第二步中三者求和
next_tag_var = forward_var + trans_score + emit_score
# The forward variable for this tag is log-sum-exp of all the scores.
alphas_t.append(log_sum_exp(next_tag_var).view(1))
forward_var = torch.cat(alphas_t).view(1, -1)
# 对应于伪码第三步,注意损失函数最终是要logZ(x),所以又是一个logsumexp
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
return alpha def _get_lstm_features(self, sentence):
"""调用LSTM获得每个token的隐状态,这里可以替换为任意的特征函数,
LSTM返回的特征就是公式中的x
"""
self.hidden = self.init_hidden()
embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
lstm_feats = self.hidden2tag(lstm_out)
return lstm_feats def _score_sentence(self, feats, tags):
"""计算给定输入序列和标签序列的匹配函数,即公式中的s函数"""
score = torch.zeros(1)
tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
for i, feat in enumerate(feats):
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
return score def _viterbi_decode(self, feats):
"""维特比解码,给定输入x和相关参数(发射矩阵和转移矩阵),或者概率最大的标签序列
"""
backpointers = [] # Initialize the viterbi variables in log space
init_vvars = torch.full((1, self.tagset_size), -10000.)
init_vvars[0][self.tag_to_ix[START_TAG]] = 0 # forward_var at step i holds the viterbi variables for step i-1
forward_var = init_vvars
for feat in feats:
bptrs_t = [] # holds the backpointers for this step
viterbivars_t = [] # holds the viterbi variables for this step for next_tag in range(self.tagset_size):
# next_tag_var[i] holds the viterbi variable for tag i at the
# previous step, plus the score of transitioning
# from tag i to next_tag.
# We don't include the emission scores here because the max
# does not depend on them (we add them in below)
next_tag_var = forward_var + self.transitions[next_tag]
best_tag_id = argmax(next_tag_var)
bptrs_t.append(best_tag_id)
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
# Now add in the emission scores, and assign forward_var to the set
# of viterbi variables we just computed
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
backpointers.append(bptrs_t) # Transition to STOP_TAG
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id] # Follow the back pointers to decode the best path.
best_path = [best_tag_id]
for bptrs_t in reversed(backpointers):
best_tag_id = bptrs_t[best_tag_id]
best_path.append(best_tag_id)
# Pop off the start tag (we dont want to return that to the caller)
start = best_path.pop()
assert start == self.tag_to_ix[START_TAG] # Sanity check
best_path.reverse()
return path_score, best_path def neg_log_likelihood(self, sentence, tags):
"""损失函数 = Z(x) - s(x,y)
"""
feats = self._get_lstm_features(sentence)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score def forward(self, sentence):
"""预测函数,注意这个函数和_forward_alg不一样
这里给定一个句子,预测最有可能的标签序列
"""
# Get the emission scores from the BiLSTM
lstm_feats = self._get_lstm_features(sentence) # Find the best path, given the features.
score, tag_seq = self._viterbi_decode(lstm_feats)
return score, tag_seq

最后,把上述模型拼起来得到一个完整的可运行实例,这里就不再讲解:

START_TAG = "<START>"
STOP_TAG = "<STOP>"
EMBEDDING_DIM = 5
HIDDEN_DIM = 4 # Make up some training data
training_data = [(
"the wall street journal reported today that apple corporation made money".split(),
"B I I I O O O B I O O".split()
), (
"georgia tech is a university in georgia".split(),
"B I O O O O B".split()
)] word_to_ix = {}
for sentence, tags in training_data:
for word in sentence:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix) tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4} model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4) # Check predictions before training
with torch.no_grad():
precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long)
print(model(precheck_sent)) # Make sure prepare_sequence from earlier in the LSTM section is loaded
for epoch in range(
300): # again, normally you would NOT do 300 epochs, it is toy data
for sentence, tags in training_data:
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad() # Step 2. Get our inputs ready for the network, that is,
# turn them into Tensors of word indices.
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long) # Step 3. Run our forward pass.
loss = model.neg_log_likelihood(sentence_in, targets) # Step 4. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
loss.backward()
optimizer.step() # Check predictions after training
with torch.no_grad():
precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
print(model(precheck_sent))
# We got it!

参考资料

[1]. https://towardsdatascience.com/implementing-a-linear-chain-conditional-random-field-crf-in-pytorch-16b0b9c4b4ea

[2]. https://zhuanlan.zhihu.com/p/27338210

线性链条件随机场(CRF)的原理与实现的更多相关文章

  1. 【Learning Notes】线性链条件随机场(CRF)原理及实现

    1. 概述条件随机场(Conditional Random Field, CRF)是概率图模型(Probabilistic Graphical Model)与区分性分类( Discriminative ...

  2. 条件随机场CRF(一)从随机场到线性链条件随机场

    条件随机场CRF(一)从随机场到线性链条件随机场 条件随机场CRF(二) 前向后向算法评估观察序列概率(TODO) 条件随机场CRF(三) 模型学习与维特比算法解码(TODO) 条件随机场(Condi ...

  3. 条件随机场CRF(三) 模型学习与维特比算法解码

    条件随机场CRF(一)从随机场到线性链条件随机场 条件随机场CRF(二) 前向后向算法评估标记序列概率 条件随机场CRF(三) 模型学习与维特比算法解码 在CRF系列的前两篇,我们总结了CRF的模型基 ...

  4. 条件随机场CRF(二) 前向后向算法评估标记序列概率

    条件随机场CRF(一)从随机场到线性链条件随机场 条件随机场CRF(二) 前向后向算法评估标记序列概率 条件随机场CRF(三) 模型学习与维特比算法解码 在条件随机场CRF(一)中我们总结了CRF的模 ...

  5. 长短时记忆网络LSTM和条件随机场crf

    LSTM 原理 CRF 原理 给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型.假设输出随机变量构成马尔科夫随机场(概率无向图模型)在标注问题应用中,简化成线性链条件随机场,对数线性判别 ...

  6. 条件随机场(CRF) - 2 - 定义和形式(转载)

    转载自:http://www.68idc.cn/help/jiabenmake/qita/20160530618218.html 参考书本: <2012.李航.统计学习方法.pdf> 书上 ...

  7. NLP --- 条件随机场CRF详解 重点 特征函数 转移矩阵

    上一节我们介绍了CRF的背景,本节开始进入CRF的正式的定义,简单来说条件随机场就是定义在隐马尔科夫过程的无向图模型,外加可观测符号X,这个X是整个可观测向量.而我们前面学习的HMM算法,默认可观测符 ...

  8. 条件随机场(CRF) - 2 - 定义和形式

    版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/xueyingxue001/article/details/51498968声明: 1,本篇为个人对& ...

  9. 条件随机场CRF简介

    http://blog.csdn.net/xmdxcsj/article/details/48790317 Crf模型 1.   定义 一阶(只考虑y前面的一个)线性条件随机场: 相比于最大熵模型的输 ...

随机推荐

  1. [系统]win10远程桌面其他电脑出现如下错误,由于数据加密错误,这个会话讲结束,请重新连接到远程计算机

    win10远程桌面其他电脑出现如下错误,由于数据加密错误,这个会话讲结束,请重新连接到远程计算机 这可能是由于credssp加密oracle修正的错误 HKEY_LOCAL_MACHINE\SOFTW ...

  2. SQL 清理日志

    USE[master] GO ALTER DATABASE 要清理的数据库名称 SET RECOVERY SIMPLE WITH NO_WAIT GO ALTER DATABASE 要清理的数据库名称 ...

  3. 【洛谷】P4167 [Violet]樱花

    题面 又懒得弄题面,开个传送门吧 分析 人生第一次切数学题,我们先把方程写出来 $$\frac {1}{x}+\frac {1}{y}=\frac {1}{n!}$$ 现在我们知道的条件是x,y都是正 ...

  4. Spring boot RSA 文件加密解密

    github项目地址 rsa_demo ##测试 加密D:/hello/test.pdf 文件,生成加密后的文件 testNeedDecode.pdf 对testNeedDecode.pdf 文件进行 ...

  5. SPM(Software Project Management)课程感想

    今天要说的是软件项目管理课程学习后的一些心得体会.这学期我选修了软件项目管理课程,进行了共8周的学习.   其实,进入大三后,我们开设了各种专业选修课,通过对各种课程的学习,我见识到了丰富多样的知识体 ...

  6. openssl从内存中读取私钥进行签名

    麻痹的找了好久,真恶心! #include <stdio.h> #include <stdlib.h> #ifdef WIN32 #include <windows.h& ...

  7. PHP判断文件大小是MB、GB、TB...

    <?php date_default_timezone_set ("PRC" ); function getFilePro($fileName){ if (!file_exi ...

  8. Leetcode: Longest Common Subsequence

    Given two strings text1 and text2, return the length of their longest common subsequence. A subseque ...

  9. ISO/IEC 9899:2011 条款6.7——声明

    6.7 声明 语法 1.declaration: declaration-specifiers    init-declarator-listopt    ; static_assert-declar ...

  10. 关于TCP粘包和拆包的终极解答

    关于TCP粘包和拆包的终极解答 程序员行业有一些奇怪的错误的观点(误解),这些误解非常之流行,而且持有这些错误观点的人经常言之凿凿,打死也不相信自己有错,实在让人啼笑皆非.究其原因,还是因为这些错误观 ...