Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors:
Firstly, your data set may have an intrinsic number of topics, i.e., may derive
from some natural clusters that your data have. This number will in the best
case make your ppx minimal. A non-parametric approach like HDP would ideally
result in the same K as the one that minimises ppx for LDA. The second type of
influence is that of the hyperparameters. If you fix the Dirichlet parameters
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and
(phi | beta)), you bias the optimum K. For instance, larger alpha will force
more " "decisive" choices of z for each token, leading to a concentration of
theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

asked Sep 14 '12 at 5:22
 
1  
Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54
1  
@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27
1  
Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52
 
How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22
 
@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

 

lda topic number的更多相关文章

  1. 如何确定LDA的主题个数

    本文参考自:https://www.zhihu.com/question/32286630 LDA中topic个数的确定是一个困难的问题. 当各个topic之间的相似度的最小的时候,就可以算是找到了合 ...

  2. (转) Parameter estimation for text analysis 暨LDA学习小结

    Reading Note : Parameter estimation for text analysis 暨LDA学习小结 原文:http://www.xperseverance.net/blogs ...

  3. Spark MLlib LDA 源代码解析

    1.Spark MLlib LDA源代码解析 http://blog.csdn.net/sunbow0 Spark MLlib LDA 应该算是比較难理解的,当中涉及到大量的概率与统计的相关知识,并且 ...

  4. 【转】LDA数学八卦

    转自LDA数学八卦 在 Machine Learning 中,LDA 是两个常用模型的简称: Linear Discriminant Analysis 和 Latent Dirichlet Alloc ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. 用python+selenium抓取微博24小时热门话题的前15个并保存到txt中

    抓取微博24小时热门话题的前15个,抓取的内容请保存至txt文件中,需要抓取排行.话题和阅读数 #coding=utf-8 from selenium import webdriver import ...

  7. 【原创】Kakfa api包源代码分析

    既然包名是api,说明里面肯定都是一些常用的Kafka API了. 一.ApiUtils.scala 顾名思义,就是一些常见的api辅助类,定义的方法包括: 1. readShortString: 从 ...

  8. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  9. Citect:How do I translate Citect error messages?

    http://www.opcsupport.com/link/portal/4164/4590/ArticleFolder/51/Citect   To decode the error messag ...

随机推荐

  1. Activity回传值报错:Failure delivering result ResultInfo{who=null,request=7,result = 0,data=null}

    Activity  A   -----值------->  Activity  B   -----值----->  Activity  A     场景:当A跳转到B,再从B直接点击返回按 ...

  2. 《剑指offer》第四十一题(数据流中的中位数)

    // 面试题41:数据流中的中位数 // 题目:如何得到一个数据流中的中位数?如果从数据流中读出奇数个数值,那么 // 中位数就是所有数值排序之后位于中间的数值.如果从数据流中读出偶数个数值, // ...

  3. 虹软2.0免费离线人脸识别 Demo [C++]

    环境: win10(10.0.16299.0)+ VS2017 sdk版本:ArcFace v2.0 OPENCV3.43版本 x64平台Debug.Release配置都已通过编译 下载地址:http ...

  4. js中use或者using方法

    看Vue.use方法,想起了以前工作中别人用过的use方法. var YANMethod = { using:function() { var a = arguments, o = this, i = ...

  5. Java使用Spring初识

    1.首先是引用了,然后pom.xml如下: <dependency> <groupId>org.springframework</groupId> <arti ...

  6. 邂逅明下 HDU - 2897

    Problem description: 有三个数字n,p,q,表示一堆硬币一共有n枚,从这个硬币堆里取硬币,一次最少取p枚,最多q枚,如果剩下少于p枚就要一次取完.两人轮流取,直到堆里的硬币取完,最 ...

  7. python基础之面向对象的多继承以及MRO算法

    内容梗概: 1. python多继承 2. python经典类的MRO 3. python新式类的MRO C3算法 1.python多继承 class Shen: def fly(self): pri ...

  8. Excel文件的读写

    import xlsxwriter,xlrd import sys,os.path fname = 'zm6.xlsx' if not os.path.isfile(fname): print ('文 ...

  9. vivado实现模16的计数器

    `timescale 1ns / 1ps module ctr_mod_16( clk, rst_n, count ); input clk, rst_n; :] count; wire clk, r ...

  10. python-部分redis

    数据量非常大时想向数据库中保存的时候,可以在中间加一个队列(队列的长度可以控制),可能数据库一个个取会效率慢一些,但是不会服务端不会蹦 redis: 端口6379 1.本质:向内存中存数据 2.对内存 ...