内容介绍

这篇博客主要面向对Bert系列在Pytorch上应用感兴趣的同学，将涵盖的主要内容是：Bert系列有关的论文，Huggingface的实现，以及如何在不同下游任务中使用预训练模型。

看过这篇博客，你将了解：

Transformers实现的介绍，不同的Tokenizer和Model如何使用。
如何利用HuggingFace的实现自定义你的模型，如果你想利用这个库实现自己的下游任务，而不想过多关注其实现细节的话，那么这篇文章将会成为很好的参考。

所需的知识

安装Huggface库(需要预先安装pytorch)

在阅读这篇文章之前，如果你能将以下资料读一遍，或者看一遍的话，在后续的阅读过程中将极大地减少你陷入疑惑的概率。

视频类内容：根据排序观看更佳

或者，你更愿意去看论文的话：

相关论文：根据排序阅读更佳
- arXiv:1810.04805, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- arXiv:1901.02860, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Authors: Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le and Ruslan Salakhutdinov.
- XLNet论文
- ALBERT论文

HuggingFace模型加载+下游任务使用

项目组件

一个完整的transformer模型主要包含三部分：

Config，控制模型的名称、最终输出的样式、隐藏层宽度和深度、激活函数的类别等。将Config类导出时文件格式为 json格式，就像下面这样：

{

  "attention_probs_dropout_prob": 0.1,

  "hidden_act": "gelu",

  "hidden_dropout_prob": 0.1,

  "hidden_size": 768,

  "initializer_range": 0.02,

  "intermediate_size": 3072,

  "max_position_embeddings": 512,

  "num_attention_heads": 12,

  "num_hidden_layers": 12,

  "type_vocab_size": 2,

  "vocab_size": 30522

}

当然，也可以通过config.json来实例化Config类，这是一个互逆的过程。

Tokenizer，这是一个将纯文本转换为编码的过程。注意，Tokenizer并不涉及将词转化为词向量的过程，仅仅是将纯文本分词，添加[MASK]标记、[SEP]、[CLS]标记，并转换为字典索引。Tokenizer类导出时将分为三个文件，也就是：
- vocab.txt
  
  词典文件，每一行为一个词或词的一部分
- special_tokens_map.json 特殊标记的定义方式
```
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]",

 "cls_token": "[CLS]", "mask_token": "[MASK]"}
```
- tokenizer_config.json 配置文件，主要存储特殊的配置。
Model，也就是各种各样的模型。除了初始的Bert、GPT等基本模型，针对下游任务，还定义了诸如BertForQuestionAnswering等下游任务模型。模型导出时将生成config.json和pytorch_model.bin参数文件。前者就是1中的配置文件，这和我们的直觉相同，即config和model应该是紧密联系在一起的两个类。后者其实和torch.save()存储得到的文件是相同的，这是因为Model都直接或者间接继承了Pytorch的Module类。从这里可以看出，HuggingFace在实现时很好地尊重了Pytorch的原生API。

导入Bert系列基本模型的方法

通过官网自动导入

官方文档中初始教程提供的方法为：

# Load pre-trained model (weights)

# model = BertModel.from_pretrained('bert-base-uncased')

这个方法需要从官方的s3数据库下载模型配置、参数等信息（代码中已配置好位置）。这个方法虽然简单，但是在国内并不可用。当然你可以先尝试一下，不过会有很大的概率无法下载模型。

手动下载模型信息并导入

在HuggingFace官方模型库上找到需要下载的模型，点击模型链接，这个例子使用的是bert-base-uncased模型
点击List all files in model，将其中的文件一一下载到同一目录中。例如，对于XLNet:
```
# List of model files

config.json	782.0B

pytorch_model.bin	445.4MB

special_tokens_map.json	202.0B

spiece.model	779.3KB

tokenizer_config.json	2.0B
```
但是这种方法有时也会不可用。如果您可以将Transformers预训练模型上传到迅雷等网盘的话，请在评论区告知，我会添加在此博客中，并为您添加博客友链。

通过下载好的路径导入模型：

import transformers

MODEL_PATH = r"D:\transformr_files\bert-base-uncased/"

# a.通过词典导入分词器

tokenizer = transformers.BertTokenizer.from_pretrained(r"D:\transformr_files\bert-base-uncased\bert-base-uncased-vocab.txt")

# b. 导入配置文件

model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)

# 修改配置

model_config.output_hidden_states = True

model_config.output_attentions = True

# 通过配置和路径导入模型

model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)

利用分词器分词

利用分词器进行编码

对于单句：

# encode仅返回input_ids

tokenizer.encode("i like you")

Out : [101, 1045, 2066, 2017, 102]

对于多句：

# encode_plus返回所有编码信息

tokenizer.encode_plus("i like you", "but not him")

Out :

    {'input_ids': [101, 1045, 2066, 2017, 102, 2021, 2025, 2032, 102],

     'token_type_ids': [0, 0, 0, 0, 0, 1, 1, 1, 1],

     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

模型的所有分词器都是在PreTrainedTokenizer中实现的，分词的结果主要有以下内容：

{

input_ids: list[int],

token_type_ids: list[int] if return_token_type_ids is True (default)

attention_mask: list[int] if return_attention_mask is True (default)

overflowing_tokens: list[int] if a max_length is specified and 		return_overflowing_tokens is True

num_truncated_tokens: int if a max_length is specified and return_overflowing_tokens is True

special_tokens_mask: list[int] if add_special_tokens if set to True and return_special_tokens_mask is True

}

编码解释：

'input_ids'：顾名思义，是单词在词典中的编码
'token_type_ids'，区分两个句子的编码
'attention_mask', 指定对哪些词进行self-Attention操作
'overflowing_tokens', 当指定最大长度时，溢出的单词
'num_truncated_tokens', 溢出的token数量
'return_special_tokens_mask'，如果添加特殊标记，则这是[0，1]的列表，其中0指定特殊添加的标记，而1指定序列标记

将分词结果输入模型，得到编码

# 添加batch维度并转化为tensor

input_ids = torch.tensor([input_ids])

token_type_ids = torch.tensor([token_type_ids])

# 将模型转化为eval模式

model.eval()

# 将模型和数据转移到cuda, 若无cuda,可更换为cpu

device = 'cuda'

tokens_tensor = input_ids.to(device)

segments_tensors = token_type_ids.to(device)

model.to(device)

# 进行编码

with torch.no_grad():

    # See the models docstrings for the detail of the inputs

    outputs = model(tokens_tensor, token_type_ids=segments_tensors)

    # Transformers models always output tuples.

    # See the models docstrings for the detail of all the outputs

    # In our case, the first element is the hidden state of the last layer of the Bert model

    encoded_layers = outputs

# 得到最终的编码结果encoded_layers

Bert最终输出的结果为：

sequence_output, pooled_output, (hidden_states), (attentions)

以输入序列长度为14为例

index	名称	维度	描述
0	sequence_output	torch.Size([1, 14, 768])	输出序列
1	pooled_output	torch.Size([1, 768])	对输出序列进行pool操作的结果
2	(hidden_states)	tuple,13*torch.Size([1, 14, 768])	隐藏层状态(包括Embedding层)，取决于modelconfig中output_hidden_states
3	(attentions)	tuple,12*torch.Size([1, 12, 14, 14])	注意力层，取决于参数中output_attentions

Bert总结

这一节我们以Bert为例对模型整体的流程进行了了解。之后的很多模型都基于Bert，并基于Bert进行了少量的调整。其中的输出和输出参数也有很多重复的地方。

利用预训练模型在下游任务上微调

如开头所说，这篇文章重点在于"如何进行模型的调整以及输入输出的设定", 以及"Transformer的实现进行简要的提及", 所以，我们不会去介绍、涉及如何写train循环等话题，而仅仅专注于模型。也就是说，我们将止步于跑通一个模型，而不计批量数据预处理、训练、验证等过程。

同时，这里更看重如何基于Bert等初始模型在实际任务上进行微调，所以我们不会仅仅地导入已经在下游任务上训练好的模型参数，因为在这些模型上使用的方法和上一章的几乎完全相同。

这里的输入和输入以模型的预测过程为例。

问答任务 via Bert

模型的构建：

from transformers import BertTokenizer, BertForQuestionAnswering

import torch

MODEL_PATH = r"D:\transformr_files\bert-base-uncased/"

# 实例化tokenizer

tokenizer = BertTokenizer.from_pretrained(r"D:\transformr_files\bert-base-uncased\bert-base-uncased-vocab.txt")

# 导入bert的model_config

model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)

# 首先新建bert_model

bert_model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)

# 最终有两个输出，初始位置和结束位置（下面有解释）

model_config.num_labels = 2

# 同样根据bert的model_config新建BertForQuestionAnswering

model = BertForQuestionAnswering(model_config)

model.bert = bert_model

一般情况下，一个基本模型对应一个Tokenizer, 所以并不存在对应于具体下游任务的Tokenizer。这里通过bert_model初始化BertForQuestionAnswering。

任务输入：问题句，答案所在的文章 "Who was Jim Henson?", "Jim Henson was a nice puppet"

任务输出：答案 "a nice puppet"

# 设定模式

model.eval()

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

# 获取input_ids编码

input_ids = tokenizer.encode(question, text)

# 手动进行token_type_ids编码，可用encode_plus代替

token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]

# 得到评分,

start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

# 进行逆编码，得到原始的token

all_tokens = tokenizer.convert_ids_to_tokens(input_ids)

#['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', 'henson', 'was', 'a', 'nice', 'puppet', '[SEP]']

模型输入：inputids, token_type_ids

模型输出：start_scores, end_scores 形状都为torch.Size([1, 14]),其中14为序列长度，代表每个位置是开始/结束位置的概率。

将模型输出转化为任务输出：

# 对输出的答案进行解码的过程

answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])

# assert answer == "a nice puppet"

# 这里因为没有经过微调，所以效果不是很好，输出结果不佳。

print(answer)

# 'was jim henson ? [SEP] jim henson was a nice puppet [SEP]'

文本分类任务(情感分析等) via XLNet

模型的构建：

from transformers import XLNetConfig, XLNetModel, XLNetTokenizer, XLNetForSequenceClassification

import torch

# 定义路径，初始化tokenizer

XLN_PATH = r"D:\transformr_files\XLNetLMHeadModel"

tokenizer = XLNetTokenizer.from_pretrained(XLN_PATH)

# 加载配置

model_config = XLNetConfig.from_pretrained(XLN_PATH)

# 设定类别数为3

model_config.num_labels = 3

# 直接从xlnet的config新建XLNetForSequenceClassification(和上一节方法等效)

cls_model = XLNetForSequenceClassification.from_pretrained(XLN_PATH, config=model_config)

任务输入：句子 "i like you, what about you"

任务输出：句子所属的类别 class1

# 设定模式

model.eval()

token_codes = tokenizer.encode_plus("i like you, what about you")

模型输入：inputids, token_type_ids

模型输出：logits, hidden states，其中logits形状为torch.Size([1, 3]), 其中的3对应的是类别的数量。当训练时，第一项为loss。

其他的任务，将继续更新

其他的模型和之前的两个大致是相同的，你可以自己发挥。我会继续在相关的库上进行实验，如果发现用法不一样的情况，将会添加在这里。

参考

本文章主要对HuggingFace库进行了简要介绍。具体安装等过程请参见官方github仓库。

本文主要参考于官方文档

同时，在模型的理解过程中参考了一些kaggle上的notebooks, 主要是这一篇，作者是Abhishek Thakur

HuggingFace-transformers系列的介绍以及在下游任务中的使用的更多相关文章

FrameBuffer系列之介绍
1. 来由 FrameBuffer是出现在2.2.xx内核当中的一种驱动程序接口.Linux工作在保护模式下,所以用户态进程是无法象 DOS 那样使用显卡 BIOS里提供的中断调用来实现直接写 ...
windows下mongodb基础玩法系列一介绍与安装
windows下mongodb基础玩法系列 windows下mongodb基础玩法系列一介绍与安装 windows下mongodb基础玩法系列二CURD操作(创建.更新.读取和删除) windows下 ...
openssl之EVP系列之13---EVP_Open系列函数介绍
openssl之EVP系列之13---EVP_Open系列函数介绍 ---依据openssl doc/crypto/EVP_OpenInit.pod翻译和自己的理解写成 (作者:Dra ...
Intel 5 6 7 8系列芯片组介绍
Intel 5 6 7 8系列芯片组介绍 Iknow.2015-11-05 22:40|知识编号:122257 操作步骤: [Inetl 5.6.7.8系列芯片组介绍] 芯片组是主板电路的核心.一定意 ...
openssl之EVP系列之12---EVP_Seal系列函数介绍
openssl之EVP系列之12---EVP_Seal系列函数介绍 ---依据openssl doc/crypto/EVP_SealInit.pod翻译和自己的理解写成 (作者:Dra ...
openssl之EVP系列之11---EVP_Verify系列函数介绍
openssl之EVP系列之11---EVP_Verify系列函数介绍 ---依据openssl doc/crypto/EVP_VerifyInit.pod翻译和自己的理解写成 (作者 ...
openssl之EVP系列之10---EVP_Sign系列函数介绍
openssl之EVP系列之10---EVP_Sign系列函数介绍 ---依据openssl doc/crypto/EVP_SignInit.pod翻译 (作者:DragonKing, ...
faster-rcnn系列原理介绍及概念讲解
faster-rcnn系列原理介绍及概念讲解 faster-rcnn系列原理介绍及概念讲解2 转:作者:马塔链接:https://www.zhihu.com/question/42205480/an ...
openresty开发系列12--lua介绍及常用数据类型简介
openresty开发系列12--lua介绍及常用数据类型简介 lua介绍 1993 年在巴西里约热内卢天主教大学(Pontifical Catholic University of Rio de ...

随机推荐

证明与计算(7): 有限状态机(Finite State Machine)
什么是有限状态机(Finite State Machine)? 什么是确定性有限状态机(deterministic finite automaton, DFA )? 什么是非确定性有限状态机(nond ...
强化学习之二：Q-Learning原理及表与神经网络的实现（Q-Learning with Tables and Neural Networks）
本文是对Arthur Juliani在Medium平台发布的强化学习系列教程的个人中文翻译.(This article is my personal translation for the tutor ...
微信公众平台分享关注 js功能代码
转上一篇文章微信很火,微信推出的公众平台也吸引了一部分市场宣传推广团队,像冷笑话大全这种微博养粉大户在微信的公众平台也是异常火爆. 因工作需求,最近为我们的市场部做了几个微信公共平台下的页面,其中涉 ...
一份从入门到精通NLP的完整指南｜ NLPer
该小博主介绍本人:笔名zenRRan,方向自然语言处理,方法主要是深度学习. 未来的目标:人工智能之自然语言处理博士. 写公众号目的:将知识变成开源,让每个渴求知识而难以入门人工智能的小白以及想进阶 ...
POJ - 3468 线段树单点查询，单点修改区间查询，区间修改模板（求和）
题意: 给定一个数字n,表示这段区间的总长度.然后输入n个数,然后输入q,然后输入a,b,表示查询a,b,区间和,或者输入c 再输入三个数字a,b,c,更改a,b区间为c 思路: 线段树首先就是递归建 ...
iOS 项目发布
一.Apple开发者账号 1.1 开发者账号类型个人级公司级企业级公司和企业的可多人协作. 在苹果的开发者平台登录后,可在 People 界面邀请其他人员协作开发,邀请的人需要注册一个 app ...
centos7 编译安装mysql5.7
mysql源码可以到官网下载安装依赖包 yum -y install gcc gcc-c++ ncurses ncurses-devel bison libgcrypt perl make cmak ...
NKOJ3772 看电影
问题描述共有m部电影,编号为1~m,第i部电影的好看值为w[i]. 在n天之中(从1~n编号)每天会放映一部电影,第i天放映的是第f[i]部. 你可以选择l,r(1<=l<=r<= ...
python中的两个高阶函数map（）和reduce（）
1.map()传入的有两个参数,函数和可迭代对象(Itreable),map()是把传入的函数依次作用于序列的每个元素,结果返回的是一个新的可迭代对象(Iterable). map()代码如下: # ...
EntityFramework Core 3.x上下文构造函数可以注入实例呢？
前言今天讨论的话题来自一位微信好友遇到问题后请求我的帮助,当然他的意图并不是本文标题,只是我将其根本原因进行了一个概括,接下来我们一起来探索标题的问号最终的答案是怎样的呢? 上下文构造函数是否可以注 ...

HuggingFace-transformers系列的介绍以及在下游任务中的使用