Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

Text Classification的更多相关文章

  1. [转] Implementing a CNN for Text Classification in TensorFlow

    Github上的一个开源项目,文档讲得极清晰 Github - https://github.com/dennybritz/cnn-text-classification-tf 原文- http:// ...

  2. [Tensorflow] RNN - 04. Work with CNN for Text Classification

    Ref: Combining CNN and RNN for spoken language identification Ref: Convolutional Methods for Text [1 ...

  3. Implementing a CNN for Text Classification in TensorFlow

    参考: 1.Understanding Convolutional Neural Networks for NLP 2.Implementing a CNN for Text Classificati ...

  4. 论文列表——text classification

    https://blog.csdn.net/BitCs_zt/article/details/82938086 列出自己阅读的text classification论文的列表,以后有时间再整理相应的笔 ...

  5. CNN tensorflow text classification CNN文本分类的例子

    from:http://deeplearning.lipingyang.org/tensorflow-examples-text/ TensorFlow examples (text-based) T ...

  6. 将迁移学习用于文本分类 《 Universal Language Model Fine-tuning for Text Classification》

    将迁移学习用于文本分类 < Universal Language Model Fine-tuning for Text Classification> 2018-07-27 20:07:4 ...

  7. [Bayes] Maximum Likelihood estimates for text classification

    Naïve Bayes Classifier. We will use, specifically, the Bernoulli-Dirichlet model for text classifica ...

  8. #论文阅读# Universial language model fine-tuing for text classification

    论文链接:https://aclweb.org/anthology/P18-1031 对文章内容的总结 文章研究了一些在general corous上pretrain LM,然后把得到的model t ...

  9. 论文阅读:《Bag of Tricks for Efficient Text Classification》

    论文阅读:<Bag of Tricks for Efficient Text Classification> 2018-04-25 11:22:29 卓寿杰_SoulJoy 阅读数 954 ...

随机推荐

  1. 企业面试题|最常问的MySQL面试题集合(二)

    MySQL的关联查询语句 六种关联查询 交叉连接(CROSS JOIN) 内连接(INNER JOIN) 外连接(LEFT JOIN/RIGHT JOIN) 联合查询(UNION与UNION ALL) ...

  2. git 版本回退方法

    ORIG_HEAD 某些操作,例如 merage / reset 会把 merge 之前的 HEAD 保存到 ORIG_HEAD 中,以便在 merge 之后可以使用 ORIG_HEAD 来回滚到合并 ...

  3. handlebars杂记

      1.{{{caption}}}三个花括号,可以解析 空格 变成 ‘空格’.   2.数据是posts:[{  }]数组时候,可以用{{posts.length}}取得其数组长度   3.handl ...

  4. 【leetcode 136】136. Single Number

    要求:给定一个整数数组,除了其中1个元素之外,其他元素都会出现两次.找出这个只出现1次的元素. 例: array =[3,3,2,2,1]    找出元素1. 思路:最开始的想法是用两次for循环,拿 ...

  5. java数据结构复习01

    1.数组 package javaDataStruct.array01; public class MyArray { private int[] arr; // 表示有效数据的长度 private ...

  6. ipv4固定ip地址

    1.vi /etc/sysconfig/network-scripts/ifcfg-enp7s0f0    ##在后面添加ip和域名解析IPADDR="192.168.130.34" ...

  7. Ubuntu 16.04 orb-slam2配置

    说明:Ubuntu 16.04以及必要的基础软件安装完成之后进行: 1.OpenNI2安装(可选) 安装依赖项: sudo apt--dev freeglut3-dev doxygen graphvi ...

  8. Monty 大厅问题(Monty Hall Problem)也称作三门问题,出自美国大型游戏节目 Let's Make a Deal。

    Monty 大厅的问题陈述十分简单,但是它的答案看上去却是有悖常理.该问题不仅引起过很多争议,也经常出现在各种考试题中. Monty 大厅的游戏规则是这样的,如果你来参加这个节目,那么 (1)Mont ...

  9. 【bzoj2300】【Luogu P2521】 [HAOI2011]防线修建 动态凸包,平衡树,Set

    一句话题意:给你一个凸包,每次可以插入一个点或者询问周长. 动态凸包裸题嘛,用\(Set\)实现.最初每个点坐标做乘三处理,便于取初始三角形的重心作为凸包判定原点. #include <bits ...

  10. 费用流 Dijkstra 原始对偶方法(primal-dual method)

    简单叙述用Dijkstra求费用流 Dijkstra不能求有负权边的最短路. 类似于Johnson算法,我们也可以设计一个势函数,以满足在与原图等价的新图中的边权非负. 但是这个算法并不能处理有负圈的 ...