Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors:

Firstly, your data set may have an intrinsic number of topics, i.e., may derive

from some natural clusters that your data have. This number will in the best

case make your ppx minimal. A non-parametric approach like HDP would ideally

result in the same K as the one that minimises ppx for LDA.  The second type of

influence is that of the hyperparameters. If you fix the Dirichlet parameters

alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and

(phi | beta)), you bias the optimum K. For instance, larger alpha will force

more " "decisive" choices of z for each token, leading to a concentration of

theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

up vote1down vote favorite

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

machine-learning topic-models hyperparameter

share improve this question

edited Sep 15 '12 at 2:13

asked Sep 14 '12 at 5:22

abhinavkulkarni
2586

Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54

@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27

Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52

How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22

@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

1 Answer

active oldest votes

up vote2down vote

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

share improve this answer

lda topic number的更多相关文章

如何确定LDA的主题个数
本文参考自:https://www.zhihu.com/question/32286630 LDA中topic个数的确定是一个困难的问题. 当各个topic之间的相似度的最小的时候,就可以算是找到了合 ...
（转) Parameter estimation for text analysis 暨LDA学习小结
Reading Note : Parameter estimation for text analysis 暨LDA学习小结原文:http://www.xperseverance.net/blogs ...
Spark MLlib LDA 源代码解析
1.Spark MLlib LDA源代码解析 http://blog.csdn.net/sunbow0 Spark MLlib LDA 应该算是比較难理解的,当中涉及到大量的概率与统计的相关知识,并且 ...
【转】LDA数学八卦
转自LDA数学八卦在 Machine Learning 中,LDA 是两个常用模型的简称: Linear Discriminant Analysis 和 Latent Dirichlet Alloc ...
cvpr2015papers
@http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...
用python+selenium抓取微博24小时热门话题的前15个并保存到txt中
抓取微博24小时热门话题的前15个,抓取的内容请保存至txt文件中,需要抓取排行.话题和阅读数 #coding=utf-8 from selenium import webdriver import ...
【原创】Kakfa api包源代码分析
既然包名是api,说明里面肯定都是一些常用的Kafka API了. 一.ApiUtils.scala 顾名思义,就是一些常见的api辅助类,定义的方法包括: 1. readShortString: 从 ...
Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...
Citect:How do I translate Citect error messages?
http://www.opcsupport.com/link/portal/4164/4590/ArticleFolder/51/Citect To decode the error messag ...

随机推荐

IIS上部署MVC网站，打开后ExtensionlessUrlHandler-4.0
IIS上部署MVC网站,打开后ExtensionlessUrlHandler-Integrated-4.0解决方法IIS上部署MVC网站,打开后500错误 IS上部署MVC网站,打开后Extensio ...
[osg]osg背景图设置
转自:https://blog.csdn.net/qq_30754211/article/details/61190698 #include <osg/Geometry> #include ...
webpack2的配置属性说明entry,output,state,plugins,node,module,context
Webpack2配置属性详解 webpack说明 webpack是前端构建的一个核心所在,如果说后端构建就是把高级语言代码编译成机器码,那么前端的构建就是重新组合原有的代码,虽然并不编译成机器码,但实 ...
IPC 之 AIDL 的使用
一.AIDL 知识储备 1. AIDL 文件支持的数据类型: 基本数据类型 (int , long , char , boolean ,double 等): String 和 CharSequence ...
Django 基础介绍
Django 介绍 Python下有许多款不同的 Web 框架.Django是重量级选手中最有代表性的一位.许多成功的网站和APP都基于Django. Django是一个开放源代码的Web应用框架,由 ...
composer修改中文镜像
composer config -g repo.packagist composer https://packagist.phpcomposer.com
利用unittest+ddt进行接口测试(二)：使用yaml文件管理测试数据
知道ddt的基本使用方法之后,练习把之前用excel文件来维护的接口测试用例改用unittest+ddt来实现. 这里我选用yaml文件来管理接口参数,开始本来想用json,但是json无法添加注释, ...
表结构中updated_time设计为ON UPDATE CURRENT_TIMESTAMP时，使用过程的一个坑
一.mysql表结构中存在如下设计时表结构中updated_time设计为ON UPDATE CURRENT_TIMESTAMP时,如下 `updated_time` datetime NOT NU ...
java测试感想
package ATM; public class Account { private String accountID; private String accountname; private St ...
第一章：IPsecVPN
第一章一.VPN(virtual private Network,虚拟专用网)的基本概念 VPN连接模式分为两种,分别是传输模式和隧道模式传输模式:在整个VPN传输中,ip包头并没有被封装进去隧 ...

lda topic number

Trouble minimizing perplexity in LDA

1 Answer

lda topic number的更多相关文章

随机推荐

热门专题