基本原理

损失函数

（线性链）CRF通常用于序列标注任务，对于输入序列\(x\)和标签序列\(y\)，定义匹配分数：

\[s(x,y) = \sum_{i=0}^l T(y_i, y_{i+1}) + \sum_{i=1}^l U(x_i, y_i)
\]

这里\(l\)是序列长度，\(T\)和\(U\)都是可学习的参数，\(T(y_i, y_{i+1})\)表示第\(i\)步的标签是\(y_i\)，第\(i+1\)步标签是\(y_{i+1}\)的转移分数，\(U(x_i,y_i)\)表示第\(i\)步输入\(x_i\)对应的标签是\(y_i\)的发射分数。注意这里在计算转移分数\(T\)时，状态转移链为\(y_0\rightarrow y_1 \rightarrow \dots \rightarrow y_l \rightarrow y_{l+1}\)，因为人为地加入了START_TAG和STOP_TAG标签。

为了解决标注偏置问题，CRF需要做全局归一化，具体而言就是输入\(x\)对应的标签序列为\(y\)的概率定义为：

\[P(y|x)=\frac{e^{s(x,y)}}{Z(x)} = \frac{e^{s(x,y)}}{\sum_{\tilde{y}\in Y_x}e^{s(x,y)}}
\]

因此这里最麻烦的就是计算配分函数(partition function)\(Z(x)\)，因为它要遍历所有路径。

在训练过程中，我们希望最大化正确标签序列的对数概率，即：

\[\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = s(x,y) - \log Z(x) = s(x,y) - \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)})
\]

也就是最小化负对数似然，即损失函数为：

\[-\log P(y|x)=\log P(\frac{e^{s(x,y)}}{Z(x)}) = \log (\sum_{\tilde{y}\in Y_x}e^{s(x,y)}) - s(x,y)
\]

配分函数计算

接下来我们来讨论怎么计算\(Z(x)\)。我们使用前向算法计算\(Z(x)\)，伪码如下：

初始化，对于\(y_2\)的所有取值\(y_2^*\)，定义

\[\alpha_1(y_2^*) = \sum_{y_1^*} \exp(U(x_1, y_1^*) + T(y_1^*, y_2^*))
\]

这里\(y_k\)表示\(k\)时刻的标签，它的取值空间是标签控件，如B,I,O等，某一个具体的取值记为\(y_k^*\)。\(\alpha_k(y_{k+1}^*)\)可以认为是时刻\(k\)时的非规范化概率。注意这里\(y_{k+1}^*\)我们只用了一个标签，其实我们要在整个标签空间遍历，对于\(y_{k+1}\)的每一个取值都算一遍。

2. 对于\(k = 2, 3, \dots, l-1\)以及\(y_{k+1}\)的所有取值\(y_{k+1}^*\)，都有：

\[\log (\alpha_k(y_{k+1}^*)) = \log \sum_{y_k^*}\exp \left(U(x_k, y_k^*)+T(y_k^*, y_{k+1}^*) + \log(\alpha_{k-1}(y_k^*)) \right)
\]

这里\(y_k\)和\(y_{k+1}\)都是一个具体的取值，这意味着这一步的计算复杂度是\(O(N^2)\)的，其中\(N\)是标签数目。

3. 最终：

\[Z(x) = \sum_{y_l^*} \exp \left(U(x_l, y_l^*) + \log(\alpha_{l-1}(y_l^*)) \right)
\]

注意到伪码第二步就是所谓的logsumexp，这可能会导致问题。因为如果求指数特别大，可能会导致溢出。因此这里存在一个小trick使得计算时数值稳定：

\[\log \sum_k \exp(z_k) = \max (\mathbf{z}) + \log \sum_k \exp(z_k - \max(\mathbf{z}))
\]

证明如下：

\[\log \sum_k \exp(z_k) = \log \sum_k (\exp(z_k -c) \cdot \exp(c)) = \log[\exp(c) \cdot \sum_k \exp(z_k -c)] = c + \log \sum_k \exp(z_k -c) \qquad \text{令} \ c = \max({\mathbf{z}})
\]

代码实现

以下代码参考Pytorch关于Bi-LSTM+CRF的tutorial。首先导入需要的模块：

import torch

import torch.autograd as autograd

import torch.nn as nn

import torch.optim as optim

torch.manual_seed(1)

为了使模型易读，定义几个辅助函数：

def argmax(vec):

    """return the argmax as a python int"""

    _, idx = torch.max(vec, 1)

    return idx.item()

def prepare_sequence(seq, to_ix):

    """word2id"""

    idxs = [to_ix[w] for w in seq]

    return torch.tensor(idxs, dtype=torch.long)

def log_sum_exp(vec):

    """Compute log sum exp in a numerically stable way for the forward algorithm

    这个函数在Pytorch和TensorFlow其实都有，这里作者为了讲解又实现了一次

    """

    max_score = vec[0, argmax(vec)]

    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])

    return max_score + \

        torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

接下来定义整个模型：

class BiLSTM_CRF(nn.Module):

    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):

        super(BiLSTM_CRF, self).__init__()

        self.embedding_dim = embedding_dim

        self.hidden_dim = hidden_dim

        self.vocab_size = vocab_size

        self.tag_to_ix = tag_to_ix

        self.tagset_size = len(tag_to_ix)

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,

                            num_layers=1, bidirectional=True)

        # 将LSTM的输出映射到标签空间

        # 相当于公式中的发射矩阵U

        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

        # 转移矩阵，从标签i转移到标签j的分数

        # tagset_size包含了人为加入的START_TAG和STOP_TAG

        self.transitions = nn.Parameter(

            torch.randn(self.tagset_size, self.tagset_size))

        # 下面这两个约束不能转移到START_TAG，也不能从STOP_TAG开始转移

        self.transitions.data[tag_to_ix[START_TAG], :] = -10000

        self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000

        self.hidden = self.init_hidden()

    def init_hidden(self):

        """初始化LSTM"""

        return (torch.randn(2, 1, self.hidden_dim // 2),

                torch.randn(2, 1, self.hidden_dim // 2))

    def _forward_alg(self, feats):

        """计算配分函数Z(x)"""

        # 对应于伪码第一步

        init_alphas = torch.full((1, self.tagset_size), -10000.)

        # START_TAG has all of the score.

        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

        # Wrap in a variable so that we will get automatic backprop

        forward_var = init_alphas

        # 对应于伪码第二步的循环，迭代整个句子

        for feat in feats:

            alphas_t = []  # The forward tensors at this timestep

            for next_tag in range(self.tagset_size):

                # broadcast the emission score: it is the same regardless of

                # the previous tag

                emit_score = feat[next_tag].view(

                    1, -1).expand(1, self.tagset_size)

                # the ith entry of trans_score is the score of transitioning to

                # next_tag from i

                trans_score = self.transitions[next_tag].view(1, -1)

                # The ith entry of next_tag_var is the value for the

                # edge (i -> next_tag) before we do log-sum-exp

                # 这里对应了伪码第二步中三者求和

                next_tag_var = forward_var + trans_score + emit_score

                # The forward variable for this tag is log-sum-exp of all the scores.

                alphas_t.append(log_sum_exp(next_tag_var).view(1))

            forward_var = torch.cat(alphas_t).view(1, -1)

        # 对应于伪码第三步，注意损失函数最终是要logZ(x)，所以又是一个logsumexp

        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]

        alpha = log_sum_exp(terminal_var)

        return alpha

    def _get_lstm_features(self, sentence):

        """调用LSTM获得每个token的隐状态，这里可以替换为任意的特征函数，

        LSTM返回的特征就是公式中的x

        """

        self.hidden = self.init_hidden()

        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)

        lstm_out, self.hidden = self.lstm(embeds, self.hidden)

        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)

        lstm_feats = self.hidden2tag(lstm_out)

        return lstm_feats

    def _score_sentence(self, feats, tags):

        """计算给定输入序列和标签序列的匹配函数，即公式中的s函数"""

        score = torch.zeros(1)

        tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])

        for i, feat in enumerate(feats):

            score = score + \

                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]

        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]

        return score

    def _viterbi_decode(self, feats):

        """维特比解码，给定输入x和相关参数(发射矩阵和转移矩阵)，或者概率最大的标签序列

        """

        backpointers = []

        # Initialize the viterbi variables in log space

        init_vvars = torch.full((1, self.tagset_size), -10000.)

        init_vvars[0][self.tag_to_ix[START_TAG]] = 0

        # forward_var at step i holds the viterbi variables for step i-1

        forward_var = init_vvars

        for feat in feats:

            bptrs_t = []  # holds the backpointers for this step

            viterbivars_t = []  # holds the viterbi variables for this step

            for next_tag in range(self.tagset_size):

                # next_tag_var[i] holds the viterbi variable for tag i at the

                # previous step, plus the score of transitioning

                # from tag i to next_tag.

                # We don't include the emission scores here because the max

                # does not depend on them (we add them in below)

                next_tag_var = forward_var + self.transitions[next_tag]

                best_tag_id = argmax(next_tag_var)

                bptrs_t.append(best_tag_id)

                viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))

            # Now add in the emission scores, and assign forward_var to the set

            # of viterbi variables we just computed

            forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)

            backpointers.append(bptrs_t)

        # Transition to STOP_TAG

        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]

        best_tag_id = argmax(terminal_var)

        path_score = terminal_var[0][best_tag_id]

        # Follow the back pointers to decode the best path.

        best_path = [best_tag_id]

        for bptrs_t in reversed(backpointers):

            best_tag_id = bptrs_t[best_tag_id]

            best_path.append(best_tag_id)

        # Pop off the start tag (we dont want to return that to the caller)

        start = best_path.pop()

        assert start == self.tag_to_ix[START_TAG]  # Sanity check

        best_path.reverse()

        return path_score, best_path

    def neg_log_likelihood(self, sentence, tags):

        """损失函数 = Z(x) - s(x,y)

        """

        feats = self._get_lstm_features(sentence)

        forward_score = self._forward_alg(feats)

        gold_score = self._score_sentence(feats, tags)

        return forward_score - gold_score

    def forward(self, sentence):

        """预测函数，注意这个函数和_forward_alg不一样

        这里给定一个句子，预测最有可能的标签序列

        """

        # Get the emission scores from the BiLSTM

        lstm_feats = self._get_lstm_features(sentence)

        # Find the best path, given the features.

        score, tag_seq = self._viterbi_decode(lstm_feats)

        return score, tag_seq

最后，把上述模型拼起来得到一个完整的可运行实例，这里就不再讲解：

START_TAG = "<START>"

STOP_TAG = "<STOP>"

EMBEDDING_DIM = 5

HIDDEN_DIM = 4

# Make up some training data

training_data = [(

    "the wall street journal reported today that apple corporation made money".split(),

    "B I I I O O O B I O O".split()

), (

    "georgia tech is a university in georgia".split(),

    "B I O O O O B".split()

)]

word_to_ix = {}

for sentence, tags in training_data:

    for word in sentence:

        if word not in word_to_ix:

            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Check predictions before training

with torch.no_grad():

    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)

    precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long)

    print(model(precheck_sent))

# Make sure prepare_sequence from earlier in the LSTM section is loaded

for epoch in range(

        300):  # again, normally you would NOT do 300 epochs, it is toy data

    for sentence, tags in training_data:

        # Step 1. Remember that Pytorch accumulates gradients.

        # We need to clear them out before each instance

        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is,

        # turn them into Tensors of word indices.

        sentence_in = prepare_sequence(sentence, word_to_ix)

        targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

        # Step 3. Run our forward pass.

        loss = model.neg_log_likelihood(sentence_in, targets)

        # Step 4. Compute the loss, gradients, and update the parameters by

        # calling optimizer.step()

        loss.backward()

        optimizer.step()

# Check predictions after training

with torch.no_grad():

    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)

    print(model(precheck_sent))

# We got it!

参考资料

[1]. https://towardsdatascience.com/implementing-a-linear-chain-conditional-random-field-crf-in-pytorch-16b0b9c4b4ea

[2]. https://zhuanlan.zhihu.com/p/27338210

线性链条件随机场(CRF)的原理与实现的更多相关文章

【Learning Notes】线性链条件随机场（CRF）原理及实现
1. 概述条件随机场(Conditional Random Field, CRF)是概率图模型(Probabilistic Graphical Model)与区分性分类( Discriminative ...
条件随机场CRF(一)从随机场到线性链条件随机场
条件随机场CRF(一)从随机场到线性链条件随机场条件随机场CRF(二) 前向后向算法评估观察序列概率(TODO) 条件随机场CRF(三) 模型学习与维特比算法解码(TODO) 条件随机场(Condi ...
条件随机场CRF(三) 模型学习与维特比算法解码
条件随机场CRF(一)从随机场到线性链条件随机场条件随机场CRF(二) 前向后向算法评估标记序列概率条件随机场CRF(三) 模型学习与维特比算法解码在CRF系列的前两篇,我们总结了CRF的模型基 ...
条件随机场CRF(二) 前向后向算法评估标记序列概率
条件随机场CRF(一)从随机场到线性链条件随机场条件随机场CRF(二) 前向后向算法评估标记序列概率条件随机场CRF(三) 模型学习与维特比算法解码在条件随机场CRF(一)中我们总结了CRF的模 ...
长短时记忆网络LSTM和条件随机场crf
LSTM 原理 CRF 原理给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型.假设输出随机变量构成马尔科夫随机场(概率无向图模型)在标注问题应用中,简化成线性链条件随机场,对数线性判别 ...
条件随机场(CRF) - 2 - 定义和形式（转载）
转载自:http://www.68idc.cn/help/jiabenmake/qita/20160530618218.html 参考书本: <2012.李航.统计学习方法.pdf> 书上 ...
NLP --- 条件随机场CRF详解重点特征函数转移矩阵
上一节我们介绍了CRF的背景,本节开始进入CRF的正式的定义,简单来说条件随机场就是定义在隐马尔科夫过程的无向图模型,外加可观测符号X,这个X是整个可观测向量.而我们前面学习的HMM算法,默认可观测符 ...
条件随机场(CRF) - 2 - 定义和形式
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/xueyingxue001/article/details/51498968声明: 1,本篇为个人对& ...
条件随机场CRF简介
http://blog.csdn.net/xmdxcsj/article/details/48790317 Crf模型 1. 定义一阶(只考虑y前面的一个)线性条件随机场: 相比于最大熵模型的输 ...

随机推荐

combox使用自定义的model列表中无元素显示
自定义的model(stationModel)中有 name 和point两种属性名. 初始化stationModel Combobox{ textRole: 'name' model:station ...
UOJ46 【清华集训2014】玄学【时间线段树】
题目链接:UOJ 这题的时间线段树非常的妙. 对时间建立线段树,修改的时候在后面加,每当填满一个节点之后就合并进它的父亲. 对于一个节点维护序列,发现这是一个分段函数,合并就是归并排序.于是就形成了差 ...
数据结构实验之排序五：归并求逆序数（SDUT 3402）
归并排序详解(戳我). 以下是搬了别人的. #include<stdio.h> #include<stdlib.h> long long sum = 0; int a[1000 ...
解释下Http请求头和常见响应状态码
Accept-Charset:指出浏览器可以接受的字符编码.英文浏览器的默认值是ISO-8859-1.ccept:指浏览器或其他客户可以接爱的MIME文件格式.可以根据它判断并返回适当的文件格式. A ...
datagrid其中某列需要动态隐藏或显示的mvvm绑定方式，也可以用在其他表格类型控件上
版权归原作者所有. 引用地址 [WPF] HOW TO BIND TO DATA WHEN THE DATACONTEXT IS NOT INHERITED MARCH 21, 2011 THOMAS ...
CRMEB中因为重写规则导致的服务器异常和404之解决办法
问题描述:安装CRMEB后,只能通过https://域名//index.php/admin访问到后台,而不能直接通过https://域名/admin访问到后台,以至于导致进入系统后台出现有的功能界面可 ...
Linux/Centos下安装部署phantomjs
PhantomJS 是一个基于 WebKit 的服务器端 JavaScript API.它全面支持web而不需浏览器支持,其快速,原生支持各种Web标准: DOM 处理, CSS 选择器, JSON, ...
C语言--二维数组
一.PTA实验作业题目1:7-2 求整数序列中出现次数最多的数 1. 本题PTA提交列表 2. 设计思路定义变量n表示输入整数个数,count表示每个数出现次数,i.j表示循环变量,k表示次数最多 ...
ubuntu18源码包安装openresty
author: headsen chen date : 2019-07-30 15:42:24 #在ubuntu18.04 环境下,openresty的依赖库有:PCRE.OpenSSL.zlib, ...
python制作简单excel统计报表2之操作excel的模块openpyxl简单用法
python制作简单excel统计报表2之操作excel的模块openpyxl简单用法 # coding=utf-8 from openpyxl import Workbook, load_workb ...

线性链条件随机场(CRF)的原理与实现