直接把自己的工作文档导入的,由于是在外企工作,所以都是英文写的

chinese and english tokens result

input: "我爱中国",tokens:["我","爱","中","国"]

input: "I love china habih", tokens:["I","love","china","ha","##bi","##h"] (here "##bi","##h" are all in vocabulary)

Implementation

chinese and english text would call two tokens,one is basic_tokenizer and other one is wordpiece_tokenizer as you can see from the the code below.

basic_tokenizer

if the input is chinese, the _tokenize_chinese_chars would add whitespace between chinese charater, and then call whitespace_tokenizer which separate text with whitespace,so

if the input is the query "我爱中国",would return ["我","爱","中","国"],if the input is the english query "I love china", would return ["I","love","china"]

wordpiece_tokenizer

if the input is chinese ,if would iterate tokens from basic_tokenizer result, if the character is in vocabulary, just keep the same character ,otherwise append unk_token.

if the input is english, we would iterate over one word ,for example: the word is "shabi", while it is not in vocabulary,so the end index would rollback until it found "sh" in vocabulary,

in the following process, once it found a substr in vocabulary ,it would append "##" then append it to output tokens, so we can get ["sh","##ab","##i"] finally.

#65 How are out of vocabulary words handled for Chinese?

The top 8000 characters are character-tokenized, other characters are mapped to [UNK]. I should've commented this section better but it's here

Basically that's saying if it tries to apply WordPiece tokenization (the character tokenization happens previously), and it gets to a single character that it can't find, it maps it to unk_token.

#62 Why Chinese vocab contains ##word?

This is the character used to denote WordPieces, it's just an artifact of the WordPiece vocabulary generator that we use, but most of those words were never actually used during training (for Chinese). So you can just ignore those tokens. Note that for the English characters that appear in Chinese text they are actually used.

bert中的分词的更多相关文章

  1. 一文读懂BERT中的WordPiece

    1. 前言 2018年最火的论文要属google的BERT,不过今天我们不介绍BERT的模型,而是要介绍BERT中的一个小模块WordPiece. 2. WordPiece原理 现在基本性能好一些的N ...

  2. 广告行业中那些趣事系列8:详解BERT中分类器源码

    最新最全的文章请关注我的微信公众号:数据拾光者. 摘要:BERT是近几年NLP领域中具有里程碑意义的存在.因为效果好和应用范围广所以被广泛应用于科学研究和工程项目中.广告系列中前几篇文章有从理论的方面 ...

  3. Python中结巴分词使用手记

    手记实用系列文章: 1 结巴分词和自然语言处理HanLP处理手记 2 Python中文语料批量预处理手记 3 自然语言处理手记 4 Python中调用自然语言处理工具HanLP手记 5 Python中 ...

  4. Elasticsearch中的分词器比较及使用方法

    Elasticsearch 默认分词器和中分分词器之间的比较及使用方法 https://segmentfault.com/a/1190000012553894 介绍:ElasticSearch 是一个 ...

  5. ES中的分词器

    基本概念: 全文搜索引擎会用某种算法对要建索引的文档进行分析, 从文档中提取出若干Token(词元), 这些算法称为Tokenizer(分词器), 这些Token会被进一步处理, 比如转成小写等, 这 ...

  6. 开源自然语言处理工具包hanlp中CRF分词实现详解

     CRF简介 CRF是序列标注场景中常用的模型,比HMM能利用更多的特征,比MEMM更能抵抗标记偏置的问题. [gerative-discriminative.png] CRF训练 这类耗时的任务,还 ...

  7. sklearn中的分词函数countVectorizer()的改动--保留长度为1的字符串

    1简述问题 使用countVectorizer()将文本向量化时发现,文本中长度唯一的字符串会被自动过滤掉,这对于我在做的情感分析来讲,一些表较重要的表达情感倾向的词汇被过滤掉,比如文本'没用的东西, ...

  8. nlp任务中的传统分词器和Bert系列伴生的新分词器tokenizers介绍

    layout: blog title: Bert系列伴生的新分词器 date: 2020-04-29 09:31:52 tags: 5 categories: nlp mathjax: true ty ...

  9. 中文分词中的战斗机-jieba库

    英文分词的第三方库NLTK不错,中文分词工具也有很多(盘古分词.Yaha分词.Jieba分词等).但是从加载自定义字典.多线程.自动匹配新词等方面来看. 大jieba确实是中文分词中的战斗机. 请随意 ...

随机推荐

  1. 斯坦福大学公开课机器学习:Neural Networks,representation: non-linear hypotheses(为什么需要做非线性分类器)

    如上图所示,如果用逻辑回归来解决这个问题,首先需要构造一个包含很多非线性项的逻辑回归函数g(x).这里g仍是s型函数(即 ).我们能让函数包含很多像这的多项式,当多项式足够多时,那么你也许能够得到可以 ...

  2. MATLAB:图像的移动(move函数)

    图像移动涉及到move函数,实现过程如下: close all; %关闭当前所有图形窗口,清空工作空间变量,清除工作空间所有变量 clear all; clc; I=imread('lenna.bmp ...

  3. bin内置函数

    光棍们对1总是那么敏感,因此每年的11.11被戏称为光棍节.小Py光棍几十载,光棍自有光棍的快乐.让我们勇敢地面对光棍的身份吧,现在就证明自己:给你一个整数a,数出a在二进制表示下1的个数,并输出. ...

  4. range与xrange的区别

    一.Python中range()与xrange()有什么区别 range([start,] stop[, step]),根据start与stop指定的范围以及step设定的步长,生成一个序列 rang ...

  5. 无法使用备份文件 'D:\20160512.bak',因为原先格式化该文件时所用扇区大小为 512,而目前所在设备的扇区大小为 4096

    删除原先备份的记录 这里再加一条,如果你备份的文件还原有兼容性的问题,那就用低版本的sql做备份,这样的话哪里都能用

  6. mysql alter 用法,修改表,字段等信息

    一: 修改表信息 1.修改表名 alter table test_a rename to sys_app; 2.修改表注释 alter table sys_application comment '系 ...

  7. JavaScript 无刷新修改浏览器URL地址栏

    //发现地址栏已改为:newUrlvar stateObject = {}; var title = "Wow Title"; var newUrl = "/my/awe ...

  8. luogu P4161 [SCOI2009]游戏

    传送门 我们发现整个大置换中,会由若干形如\((a_1\rightarrow a_2,a_2\rightarrow a_3,...a_{n-1}\rightarrow a_n,a_n\rightarr ...

  9. 第16月第15天 glut

    1. https://tokoik.github.io/opengl/libglut.html https://github.com/wistaria/wxtest/tree/master/C htt ...

  10. POJ 1035 Spell checker (模拟)

    题目链接 Description You, as a member of a development team for a new spell checking program, are to wri ...