Reading Note : Parameter estimation for text analysis 暨LDA学习小结

原文：http://www.xperseverance.net/blogs/2013/03/1744/

伟大的Parameter estimation for text analysis！当把这篇看的差不多的时候，也就到了LDA基础知识终结的时刻了，意味着LDA基础模型的基本了解完成了。所以对该模型的学习告一段落，下一阶段就是了解LDA无穷无尽的变种，不过那些不是很有用了，因为LDA已经被人水遍了各大“论坛”……

抛开LDA背后复杂深入的数学背景不说，光就LDA的内容，确实不多，虽然变分法还是不懂，不过现在终于还是理解了“LDA is just a simple model”这句话。

总结一下学习过程：

1.概率的基本概念：CDF、PDF、Bayes’rule、各种简单的分布Bernoulli，binomial，multinomial、包括对prior、likelihood、postprior的理解（PRML1.2）

2.共轭：为何Beta Distribution与Bernoulli共轭？狄利克雷分布 Dirichlet Distribution

3.概率图模型 Probabilistic Graphical Models: PRML Chapter 8 基本概念即可

4.采样算法：Basic Sampling，Sampling Methods（PRML Chapter 11），马尔科夫蒙特卡洛 MCMC，Gibbs Sampling

5.原始论文阅读记录：【JMLR】LDA

6.进阶资料：《Gibbs Sampling for the Uninitiated》、本文

——————————————– 伟大的分割线！PETA！ ——————————————–

一、前面无关部分

关于ML、MAP、Bayesian inference

二、模型进一步记忆

从本图来看，需要记住：

1.θm是每一个document单独一个θ，所以M个doc共有M个θm，整个θ是一个M*K的矩阵（M个doc，每个doc一个K维topic分布向量）。

2.φk总共只有K个，对于每一个topic，有一个φk，这些参数是独立于文档的，也就是对于整个corpus只sample一次。不像θm那样每一个都对应一个文档，每个文档都不同，φk对于所有文档都相同，是一个K*V的矩阵（K个topic，每个topic一个V维从topic产生词的概率分布）。

就这些了。

三、推导

公式（39）：P(p|α)=Dir(p|α)意思是从参数为α的狄利克雷分布，采样一个多项分布参数p的概率是多少，概率是标准狄利克雷PDF。这里Dirichlet delta function为：

Δ(α⃗ )=Γ(α1)∗Γ(α2)∗…∗Γ(αk)Γ(∑K1 αk)

这个function要记住，下面一溜烟全是这个。

公式（43）是一元语言模型的likelihood，意思是如果提供了语料库W，知道了W里面每个词的个数，那么使用最大似然估计最大化L就可以估计出参数多项分布p。

公式（44）是考虑了先验的情形，假如已知语料库W和参数α，那么他们产生多项分布参数p的概率是Dir(p|α+n)，这个推导我记得在PRML2.1中有解释，抛开复杂的数学证明，只要参考标准狄利克雷分布的归一化项，很容易想出式（46）的归一化项就是Δ(α+n)。这时如果要通过W估计参数p，那么就要使用贝叶斯推断，用这个狄利克雷pdf输出一个p的期望即可。

最关键的推导（63）-（78）：从63-73的目标是要求出整个LDA的联合概率表达式，这样（63）就可以被用在Gibbs Sampler的分子上。首先（63）把联合概率拆成相互独立的两部分p(w|z,β)和p(z|α)，然后分别对这两部分布求表达式。式（64）、（65）首先不考虑超参数β，而是假设已知参数Φ。这个Φ就是那个K*V维矩阵，表示从每一个topic产生词的概率。然后（66）要把Φ积分掉，这样就可以求出第一部分p(w|z,β)为表达式（68）。从66-68的积分过程一直在套用狄利克雷积分的结果，反正整篇文章套来套去始终就是这么一个狄利克雷积分。n⃗ z是一个V维的向量，对于topic z，代表每一个词在这个topic里面有几个。从69到72的道理其实和64-68一模一样了。n⃗ m是一个K维向量，对于文档m，代表每一个topic在这个文档里有几个词。

最后（78）求出了Gibbs Sampler所需要的条件概率表达式。这个表达式还是要贴出来的，为了和代码里面对应：

具体选择下一个新topic的方法是：通过计算每一个topic的新的产生概率p(zi=k|z┐i,w)也就是代码中的p[k]产生一个新topic。比如有三个topic，算出来产生新的p的概率值为{0.3,0.2,0.4}，注意这个条件概率加起来并不一定是一。然后我为了按照这个概率产生一个新topic，我用random函数从uniform distribution产生一个0至0.9的随机数r。如果0<=r<0.3，则新topic赋值为1，如果0.3<=r<0.5，则新topic赋值为2，如果0.5<=r<0.9，那么新topic赋值为3。

四、代码

/*
* LdaGibbsSampler is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the Free
* Software Foundation; either version 2 of the License, or (at your option) any
* later version.
* LdaGibbsSampler is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
* details.
* You should have received a copy of the GNU General Public License along with
* this program; if not, write to the Free Software Foundation, Inc., 59 Temple
* Place, Suite 330, Boston, MA 02111-1307 USA
*/
import java.text.DecimalFormat;
import java.text.NumberFormat;
public class LdaGibbsSampler {
/**
* document data (term lists)
*/
int[][] documents;
/**
* vocabulary size
*/
int V;
/**
* number of topics
*/
int K;
/**
* Dirichlet parameter (document--topic associations)
*/
double alpha;
/**
* Dirichlet parameter (topic--term associations)
*/
double beta;
/**
* topic assignments for each word.
* N * M 维，第一维是文档，第二维是word
*/
int z[][];
/**
* nw[i][j] number of instances of word i (term?) assigned to topic j.
*/
int[][] nw;
/**
* nd[i][j] number of words in document i assigned to topic j.
*/
int[][] nd;
/**
* nwsum[j] total number of words assigned to topic j.
*/
int[] nwsum;
/**
* nasum[i] total number of words in document i.
*/
int[] ndsum;
/**
* cumulative statistics of theta
*/
double[][] thetasum;
/**
* cumulative statistics of phi
*/
double[][] phisum;
/**
* size of statistics
*/
int numstats;
/**
* sampling lag (?)
*/
private static int THIN_INTERVAL = 20;
/**
* burn-in period
*/
private static int BURN_IN = 100;
/**
* max iterations
*/
private static int ITERATIONS = 1000;
/**
* sample lag (if -1 only one sample taken)
*/
private static int SAMPLE_LAG;
private static int dispcol = 0;
/**
* Initialise the Gibbs sampler with data.
*
* @param V
* vocabulary size
* @param data
*/
public LdaGibbsSampler(int[][] documents, int V) {
this.documents = documents;
this.V = V;
}
/**
* Initialisation: Must start with an assignment of observations to topics ?
* Many alternatives are possible, I chose to perform random assignments
* with equal probabilities
*
* @param K
* number of topics
* @return z assignment of topics to words
*/
public void initialState(int K) {
int i;
int M = documents.length;
// initialise count variables.
nw = new int[V][K];
nd = new int[M][K];
nwsum = new int[K];
ndsum = new int[M];
// The z_i are are initialised to values in [1,K] to determine the
// initial state of the Markov chain.
// 为了方便，他没用从狄利克雷参数采样，而是随机初始化了！
z = new int[M][];
for (int m = 0; m < M; m++) {
int N = documents[m].length;
z[m] = new int[N];
for (int n = 0; n < N; n++) {
//随机初始化！
int topic = (int) (Math.random() * K);
z[m][n] = topic;
// number of instances of word i assigned to topic j
// documents[m][n] 是第m个doc中的第n个词
nw[documents[m][n]][topic]++;
// number of words in document i assigned to topic j.
nd[m][topic]++;
// total number of words assigned to topic j.
nwsum[topic]++;
}
// total number of words in document i
ndsum[m] = N;
}
}
/**
* Main method: Select initial state ? Repeat a large number of times: 1.
* Select an element 2. Update conditional on other elements. If
* appropriate, output summary for each run.
*
* @param K
* number of topics
* @param alpha
* symmetric prior parameter on document--topic associations
* @param beta
* symmetric prior parameter on topic--term associations
*/
private void gibbs(int K, double alpha, double beta) {
this.K = K;
this.alpha = alpha;
this.beta = beta;
// init sampler statistics
if (SAMPLE_LAG > 0) {
thetasum = new double[documents.length][K];
phisum = new double[K][V];
numstats = 0;
}
// initial state of the Markov chain:
//启动马尔科夫链需要一个起始状态
initialState(K);
//每一轮sample
for (int i = 0; i < ITERATIONS; i++) {
// for all z_i
for (int m = 0; m < z.length; m++) {
for (int n = 0; n < z[m].length; n++) {
// (z_i = z[m][n])
// sample from p(z_i|z_-i, w)
//核心步骤，通过论文中表达式（78）为文档m中的第n个词采样新的topic
int topic = sampleFullConditional(m, n);
z[m][n] = topic;
}
}
// get statistics after burn-in
//如果当前迭代轮数已经超过 burn-in的限制，并且正好达到 sample lag间隔
//则当前的这个状态是要计入总的输出参数的，否则的话忽略当前状态，继续sample
if ((i > BURN_IN) && (SAMPLE_LAG > 0) && (i % SAMPLE_LAG == 0)) {
updateParams();
}
}
}
/**
* Sample a topic z_i from the full conditional distribution: p(z_i = j |
* z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) +
* alpha)/(n_-i,.(d_i) + K * alpha)
*
* @param m
* document
* @param n
* word
*/
private int sampleFullConditional(int m, int n) {
// remove z_i from the count variables
//这里首先要把原先的topic z(m,n)从当前状态中移除
int topic = z[m][n];
nw[documents[m][n]][topic]--;
nd[m][topic]--;
nwsum[topic]--;
ndsum[m]--;
// do multinomial sampling via cumulative method:
double[] p = new double[K];
for (int k = 0; k < K; k++) {
//nw 是第i个word被赋予第j个topic的个数
//在下式中，documents[m][n]是word id，k为第k个topic
//nd 为第m个文档中被赋予topic k的词的个数
p[k] = (nw[documents[m][n]][k] + beta) / (nwsum[k] + V * beta)
* (nd[m][k] + alpha) / (ndsum[m] + K * alpha);
}
// cumulate multinomial parameters
for (int k = 1; k < p.length; k++) {
p[k] += p[k - 1];
}
// scaled sample because of unnormalised p[]
double u = Math.random() * p[K - 1];
for (topic = 0; topic < p.length; topic++) {
if (u < p[topic])
break;
}
// add newly estimated z_i to count variables
nw[documents[m][n]][topic]++;
nd[m][topic]++;
nwsum[topic]++;
ndsum[m]++;
return topic;
}
/**
* Add to the statistics the values of theta and phi for the current state.
*/
private void updateParams() {
for (int m = 0; m < documents.length; m++) {
for (int k = 0; k < K; k++) {
thetasum[m][k] += (nd[m][k] + alpha) / (ndsum[m] + K * alpha);
}
}
for (int k = 0; k < K; k++) {
for (int w = 0; w < V; w++) {
phisum[k][w] += (nw[w][k] + beta) / (nwsum[k] + V * beta);
}
}
numstats++;
}
/**
* Retrieve estimated document--topic associations. If sample lag > 0 then
* the mean value of all sampled statistics for theta[][] is taken.
*
* @return theta multinomial mixture of document topics (M x K)
*/
public double[][] getTheta() {
double[][] theta = new double[documents.length][K];
if (SAMPLE_LAG > 0) {
for (int m = 0; m < documents.length; m++) {
for (int k = 0; k < K; k++) {
theta[m][k] = thetasum[m][k] / numstats;
}
}
} else {
for (int m = 0; m < documents.length; m++) {
for (int k = 0; k < K; k++) {
theta[m][k] = (nd[m][k] + alpha) / (ndsum[m] + K * alpha);
}
}
}
return theta;
}
/**
* Retrieve estimated topic--word associations. If sample lag > 0 then the
* mean value of all sampled statistics for phi[][] is taken.
*
* @return phi multinomial mixture of topic words (K x V)
*/
public double[][] getPhi() {
double[][] phi = new double[K][V];
if (SAMPLE_LAG > 0) {
for (int k = 0; k < K; k++) {
for (int w = 0; w < V; w++) {
phi[k][w] = phisum[k][w] / numstats;
}
}
} else {
for (int k = 0; k < K; k++) {
for (int w = 0; w < V; w++) {
phi[k][w] = (nw[w][k] + beta) / (nwsum[k] + V * beta);
}
}
}
return phi;
}
/**
* Configure the gibbs sampler
*
* @param iterations
* number of total iterations
* @param burnIn
* number of burn-in iterations
* @param thinInterval
* update statistics interval
* @param sampleLag
* sample interval (-1 for just one sample at the end)
*/
public void configure(int iterations, int burnIn, int thinInterval,
int sampleLag) {
ITERATIONS = iterations;
BURN_IN = burnIn;
THIN_INTERVAL = thinInterval;
SAMPLE_LAG = sampleLag;
}
/**
* Driver with example data.
*
* @param args
*/
public static void main(String[] args) {
// words in documents
int[][] documents = { {1, 4, 3, 2, 3, 1, 4, 3, 2, 3, 1, 4, 3, 2, 3, 6},
{2, 2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2},
{1, 6, 5, 6, 0, 1, 6, 5, 6, 0, 1, 6, 5, 6, 0, 0},
{5, 6, 6, 2, 3, 3, 6, 5, 6, 2, 2, 6, 5, 6, 6, 6, 0},
{2, 2, 4, 4, 4, 4, 1, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 0},
{5, 4, 2, 3, 4, 5, 6, 6, 5, 4, 3, 2}};
// vocabulary
int V = 7;
int M = documents.length;
// # topics
int K = 2;
// good values alpha = 2, beta = .5
double alpha = 2;
double beta = .5;
LdaGibbsSampler lda = new LdaGibbsSampler(documents, V);
//设定sample参数，采样运行10000轮，burn-in 2000轮，第三个参数没用，是为了显示
//第四个参数是sample lag，这个很重要，因为马尔科夫链前后状态conditional dependent，所以要跳过几个采样
lda.configure(10000, 2000, 100, 10);
//跑一个！走起！
lda.gibbs(K, alpha, beta);
//输出模型参数，论文中式（81）与（82）
double[][] theta = lda.getTheta();
double[][] phi = lda.getPhi();
}
}

（转) Parameter estimation for text analysis 暨LDA学习小结的更多相关文章

something about Parameter Estimation (参数估计)
点估计 Point Estimation 最大似然估计(Maximum Likelihood Estimate —— MLE):视θ为固定的参数,假设存在一个最佳的参数(或参数的真实值是存在的),目的 ...
总结暨JAVAWEB学习开篇（一）
匆匆,距上一篇博客已经过去7月有余,遂作文一篇总结暨JAVAWEB学习开篇. 1. 啃英文新概念.在多方讨教英语大佬后改变学习方式,通过背诵英文书籍以及多听英文录音来学习,效果还不错(等真正有成效了跟 ...
线性判别分析（Linear Discriminant Analysis，LDA）
一.LDA的基本思想线性判别式分析(Linear Discriminant Analysis, LDA),也叫做Fisher线性判别(Fisher Linear Discriminant ,FLD) ...
Click Models for Web Search(2) - Parameter Estimation
在Click Model中进行参数预估的方法有两种:最大似然(MLE)和期望最大(EM).至于每个click model使用哪种参数预估的方法取决于此model中的随机变量的特性.如果model中的随 ...
[Bayes] Parameter estimation by Sampling
虽然openBugs效果不错,但原理是什么呢?需要感性认识,才能得其精髓. Recall [Bayes] prod: M-H: Independence Sampler firstly. 采样法 Re ...
LDA学习笔记
线性判别分析(Linear Discriminant Analysis,简称LDA)是一种经典的线性学习方法.其思想非常朴素,设法将样例投影到一条直线上,使得同类样例的投影点尽可能接近,异类的样例的投 ...
LDA学习小记
看到一段对主题模型的总结,感觉很精辟: 如何找到文本隐含的主题呢?常用的方法一般都是基于统计学的生成方法.即假设以一定的概率选择了一个主题,然后以一定的概率选择当前主题的词.最后这些词组成了我们当前的 ...
LDA学习之beta分布和Dirichlet分布
---恢复内容开始--- 今天学习LDA主题模型,看到Beta分布和Dirichlet分布一脸的茫然,这俩玩意怎么来的,再网上查阅了很多资料,当做读书笔记记下来: 先来几个名词: 共轭先验: 在贝叶斯 ...
NGUI的HUD Text的扩展插件学习--(HUDText)的使用
一,我们先添加一个空的游戏对象,在菜单中找到这个添加空的游戏对象二,然后我们给该对象添加HUDText,然后给这个添加字体三,我们添加个脚本,代码如下: using UnityEngine; us ...

随机推荐

VINS（一）简介与代码结构
VINS-Mono和VINS-Mobile是香港科技大学沈劭劼团队开源的单目视觉惯导SLAM方案.是基于优化和滑动窗口的VIO,使用IMU预积分构建紧耦合框架.并且具备自动初始化,在线外参标定,重定位 ...
Spring Cloud 熔断机制 -- 断路器
Spring Cloud 入门教程(七): 熔断机制 -- 断路器对断路器模式不太清楚的话,可以参看另一篇博文:断路器(Curcuit Breaker)模式,下面直接介绍Spring Cloud的断 ...
Drupal中自定义登录页面
通过覆写template定义新的user_login表单来为自定义登录页面.方法: 1. 本站使用的主题是Rorty.来到\sites\all\themes\rorty,打开template.php ...
RTL8188EUS之MAC地址烧写（使用利尔达模组）
1. 手上有几个RTL8188EUS的wifi模块,打算把台式机装个无线网卡,但是插上之后发现没有MAC,没办法只能自己去找个烧写MAC的软件.RTL8188内部有个eFuse,用来配置之类的.这个e ...
Python 通过sgmllib模块解析HTML
""" 对html文本的解析方案-示例:在标签开始的时候检查标签中的attrs属性,解析出所有的参数的href属性值依赖安装:pip install sgmllib3k ...
Selenium（Python）PageObject页面对象
使用PageObject页面对象的好处是, 当页面元素的位置发生改变时, 只需要去修改Xpath或者ID, 而不用去修改测试用例本身: 本次的思路是: 1.常用方法类 2.页面对象类 3.测试用例类 ...
word record 2
word record 2 scavenger // si ga wen ger a person, animal or insect who takes what others have left ...
ionic 组件学习
利用css列表多选框: <div class="{{Conceal}}" > <ion-checkbox color="secondary" ...
UVa 340 - Master-Mind Hints 解题报告 - C语言
1.题目大意比较给定序列和用户猜想的序列,统计有多少数字位置正确(x),有多少数字在两个序列中都出现过(y)但位置不对. 2.思路这题自己思考的思路跟书上给的思路差不多.第一个小问题——位置正确的 ...
1.EOS源码编译运行
目前网络上都是针对老版EOS2.0源码编译的文章,我在mac上参考这些文章编译,最后发现根本就不对,最新版本只需一条命令(./eosio_build.sh,依赖库会自动安装的)即可.我根据这些文章手动 ...

（转) Parameter estimation for text analysis 暨LDA学习小结

Reading Note : Parameter estimation for text analysis 暨LDA学习小结

（转) Parameter estimation for text analysis 暨LDA学习小结的更多相关文章

随机推荐

热门专题