Comparison of FastText and Word2Vec
Comparison of FastText and Word2Vec
Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.
Download data
import nltk
nltk.download()
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training # Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
for word in nltk.corpus.brown.words():
f.write('{word} '.format(word=word))
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
Train models
If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -
!./fasttext skipgram -input brown_corp.txt -output brown_ft
!./fasttext skipgram -input text8.txt -output text8_ft
For training the gensim models -
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO) MODELS_DIR = 'models/' brown_gs = Word2Vec(brown.sents())
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec') text8_gs = Word2Vec(Text8Corpus('text8'))
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
Download models
In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -
# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
Once you have downloaded or trained the models (make sure they're in the models/ directory, or that you've appropriately changed MODELS_DIR) and downloaded questions-words.txt, you're ready to run the comparison.
Comparisons
from gensim.models import Word2Vec def print_accuracy(model, questions_file):
print('Evaluating...\n')
acc = model.accuracy(questions_file)
for section in acc:
correct = len(section['correct'])
total = len(section['correct']) + len(section['incorrect'])
total = total if total else 1
accuracy = 100*float(correct)/total
print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total)) syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total)) MODELS_DIR = 'models/' word_analogies_file = 'questions-words.txt'
print('\nLoading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file) print('\nLoading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
27/182, 14.84%, Section: family
539/702, 76.78%, Section: gram1-adjective-to-adverb
106/132, 80.30%, Section: gram2-opposite
656/1056, 62.12%, Section: gram3-comparative
136/210, 64.76%, Section: gram4-superlative
439/650, 67.54%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
165/1260, 13.10%, Section: gram7-past-tense
327/552, 59.24%, Section: gram8-plural
245/342, 71.64%, Section: gram9-plural-verbs
2640/5086, 51.91%, Section: total Semantic: 27/182, Accuracy: 14.84%
Syntactic: 2613/4904, Accuracy: 53.28% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 0/1, 0.00%, Section: capital-common-countries
0/1, 0.00%, Section: capital-world
0/1, 0.00%, Section: currency
0/1, 0.00%, Section: city-in-state
53/182, 29.12%, Section: family
8/702, 1.14%, Section: gram1-adjective-to-adverb
0/132, 0.00%, Section: gram2-opposite
75/1056, 7.10%, Section: gram3-comparative
0/210, 0.00%, Section: gram4-superlative
16/650, 2.46%, Section: gram5-present-participle
0/1, 0.00%, Section: gram6-nationality-adjective
30/1260, 2.38%, Section: gram7-past-tense
4/552, 0.72%, Section: gram8-plural
8/342, 2.34%, Section: gram9-plural-verbs
194/5086, 3.81%, Section: total Semantic: 53/182, Accuracy: 29.12%
Syntactic: 141/4904, Accuracy: 2.88%
Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.
Let me explain that better.
According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently would include information from both character n-grams apparent and ly (as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.
Example analogy:
amazing amazingly calm calmly
This analogy is marked correct if:
embedding(amazing) - embedding(amazingly) = embedding(calm) - embedding(calmly)
Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.
A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).
Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.
print('Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file)
print('Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
Loading FastText embeddings
Accuracy for FastText:
Evaluating... 298/506, 58.89%, Section: capital-common-countries
625/1452, 43.04%, Section: capital-world
37/268, 13.81%, Section: currency
291/1511, 19.26%, Section: city-in-state
151/306, 49.35%, Section: family
567/756, 75.00%, Section: gram1-adjective-to-adverb
188/306, 61.44%, Section: gram2-opposite
809/1260, 64.21%, Section: gram3-comparative
303/506, 59.88%, Section: gram4-superlative
528/992, 53.23%, Section: gram5-present-participle
1291/1371, 94.16%, Section: gram6-nationality-adjective
451/1332, 33.86%, Section: gram7-past-tense
853/992, 85.99%, Section: gram8-plural
360/650, 55.38%, Section: gram9-plural-verbs
6752/12208, 55.31%, Section: total Semantic: 1402/4043, Accuracy: 34.68%
Syntactic: 5350/8165, Accuracy: 65.52% Loading Gensim embeddings
Accuracy for word2vec:
Evaluating... 138/506, 27.27%, Section: capital-common-countries
248/1452, 17.08%, Section: capital-world
28/268, 10.45%, Section: currency
158/1571, 10.06%, Section: city-in-state
227/306, 74.18%, Section: family
85/756, 11.24%, Section: gram1-adjective-to-adverb
54/306, 17.65%, Section: gram2-opposite
739/1260, 58.65%, Section: gram3-comparative
178/506, 35.18%, Section: gram4-superlative
297/992, 29.94%, Section: gram5-present-participle
718/1371, 52.37%, Section: gram6-nationality-adjective
325/1332, 24.40%, Section: gram7-past-tense
389/992, 39.21%, Section: gram8-plural
200/650, 30.77%, Section: gram9-plural-verbs
3784/12268, 30.84%, Section: total Semantic: 799/4103, Accuracy: 19.47%
Syntactic: 2985/8165, Accuracy: 36.56%
With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec
These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.
References
Comparison of FastText and Word2Vec的更多相关文章
- fastText训练word2vec并用于训练任务
最近测试OpenNRE,没有GPU服务器,bert的跑不动,于是考虑用word2vec,捡起fasttext 下载安装 先clone代码 git clone https://github.com/fa ...
- 超快的 FastText
Word2Vec 作者.脸书科学家 Mikolov 文本分类新作 fastText:方法简单,号称并不需要深度学习那样几小时或者几天的训练时间,在普通 CPU 上最快几十秒就可以训练模型,得到不错的结 ...
- DL4NLP——词表示模型(二)基于神经网络的模型:NPLM;word2vec(CBOW/Skip-gram)
本文简述了以下内容: 神经概率语言模型NPLM,训练语言模型并同时得到词表示 word2vec:CBOW / Skip-gram,直接以得到词表示为目标的模型 (一)原始CBOW(Continuous ...
- NLP︱高级词向量表达(二)——FastText(简述、学习笔记)
FastText是Facebook开发的一款快速文本分类器,提供简单而高效的文本分类和表征学习的方法,不过这个项目其实是有两部分组成的,一部分是这篇文章介绍的 fastText 文本分类(paper: ...
- 检索式chatbot:
小夕从7月份开始收到第一场面试邀请,到9月初基本结束了校招(面够了面够了T_T),深深的意识到今年的对话系统/chatbot方向是真的超级火呀.从微软主打情感计算的小冰,到百度主打智能家庭(与车联网? ...
- NLP获取词向量的方法(Glove、n-gram、word2vec、fastText、ELMo 对比分析)
自然语言处理的第一步就是获取词向量,获取词向量的方法总体可以分为两种两种,一个是基于统计方法的,一种是基于语言模型的. 1 Glove - 基于统计方法 Glove是一个典型的基于统计的获取词向量的方 ...
- 模型介绍之FastText
模型介绍一: 1. FastText原理及实践 前言----来源&特点 fastText是Facebook于2016年开源的一个词向量计算和文本分类工具,在学术上并没有太大创新.但是它的优点也 ...
- 文本分类需要CNN?No!fastText完美解决你的需求(后篇)
http://blog.csdn.net/weixin_36604953/article/details/78324834 想必通过前一篇的介绍,各位小主已经对word2vec以及CBOW和Skip- ...
- FastText算法原理解析
1. 前言 自然语言处理(NLP)是机器学习,人工智能中的一个重要领域.文本表达是 NLP中的基础技术,文本分类则是 NLP 的重要应用.fasttext是facebook开源的一个词向量与文本分类工 ...
随机推荐
- 【TCP】SYN攻击
TCP握手协议 在TCP/IP协议中,TCP协议提供可靠的连接服务,采用三次握手建立一个连接.第一次握手:建立连接时,客户端发送syn包(syn=j)到服务器,并进入SYN_SEND状态,等待服务器确 ...
- Linux shell脚本编程if语句的使用方法(条件判断)
if 语句格式if 条件then Commandelse Commandfi 别忘了这个结尾If语句忘了结尾fitest.sh: line 14: syntax error: unex ...
- Delphi实现程序只运行一次并激活已打开的程序
我们的程序有时候只允许运行一次,并且最好的情况是,如果程序第二次运行,就激活原来的程序.网上有很多的方法实现程序只运行一次,但对于激活原来的窗口却都不怎么好.关键就在于激活原来的程序,一般的做法是在工 ...
- leyou_04_使用vue.js搭建页面—使用ajax完成品牌的查询
1.使用vue.js搭建页面 1.1使用的模板插件Vuetify 中文UI组件官网:https://vuetifyjs.com/zh-Hans/getting-started/quick-start ...
- websocke和http的区别
同:建立在TCP之上,同http一样通过TCP来传输数据 不同: HTTP协议为单向协议,即浏览器只能向服务器请求资源,服务器才能将数据传送给浏览器,而服务器不能主动向浏览器传递数据.分为长连接和短连 ...
- 【spring】1.2、Spring Boot创建项目
Spring Boot创建项目 在1.1中,我们通过"Spring Starter Project"来创建了一个项目,实际上是使用了Pivotal团队提供的全新框架Spring B ...
- node.js是用来做什么的
Node.js 使用了一个事件驱动.非阻塞式 I/O 的模型,使其轻量又高效.(事件驱动:事件触发过程中,进行决策的一种策略,简单说就是跟随当前时间点上出现的事物,调用可用的资源进行解决该事物,使得不 ...
- 通过Module读取寄存器的值
1: int eax; 2: _asm_("nop":"=a"(eax)); 3: printk("Get Eax Value:\n"); ...
- PAT_A1130#Infix Expression
Source: PAT A1130 Infix Expression (25 分) Description: Given a syntax tree (binary), you are suppose ...
- 用python+tushare获取股票前复权后复权行情数据
接口名称 :pro_bar 接口说明 :复权行情通过通用行情接口实现,利用Tushare Pro提供的复权因子进行计算,目前暂时只在SDK中提供支持,http方式无法调取. Python SDK版本要 ...