1. Topic Models

  • Topic models are based upon the idea that documents are mixtures of topics, where a topic is a probabilistic distribution over words. A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated (Steyvers and Griffiths, 2007). Two general steps are taken to make a new document:

    • Step 1, for each document, one chooses a distribution over topics.
    • Step 2, to generate each word in that document, one chooses a topic at random according to the distribution (because one same word may belong to various topics with different probabilities.). Then a word is drawn from the chosen topic in terms of probabilistic sampling, e.g. as illustrated in Figures 1 and 2.
  • When fitting a generative model, the goal is to find the best set of latent variables that can explain the observed data (i.e., observed words in documents), assuming that the model actually generated the data.
  • Many different generative models have been proposed under the same assumption that a document is a mixture of topics, but make slightly different statistical assumptions.
  • The number of topics will affect the interpretability of the results. A solution with too few topics will generally result in very broad topics whereas a solution with too many topics will result in uninterpretable topics that pick out idiosyncratic word combinations. One way is to choose the number of topics that leads to best generalization performance to new tasks.
  • Notations:
    • P(z) is the probability distribution over topics z in a particular document
    • P(w|z) is the probability distribution over words w given topic z.
    • The model specifies the distribution over words within a document is:

      where t is the number of topics, is the probability that the j-th topic is chosen/sampled for the i-th word token, and is the probability of word wi under topic j.

    • is the multinomial distribution over words for topic j.
    • is the multinomial distribution over topics for document d.
    • D is the number of documents, and each document d consists of Nd words,  is the number of word tokens.

2. The LDA Model

  • Latent Dirichlet Allocation (LDA) is a generative model (i.e., graphic model) that allows sets of observations to be explained by unobserved latent variables that explain why some parts of the data are similar.
  • Different from PLSA, the topic distribution in LDA is assumed to have a Dirichlet prior.
    • Specifically, each document has a Dirichlet prior distribution of topics, and each topic has a Dirichlet prior distribution of words.
    • In practice, this assumption results in more reasonable mixtures of topics in a document.
    • However, PLSA may be equivalent to the LDA model under a uniform Dirichlet prior distribution.

Toy Example

This exmample is from Edwin Chen's Blog. Suppose you have the following set of sentences: 
  • I like to eat broccoli and bananas.
  • I ate a banana and spinach smoothie for breakfast.
  • Chinchillas and kittens are cute.
  • My sister adopted a kitten yesterday.
  • Look at this cute hamster munching on a piece of broccoli.
Given these sentences, we look for two topics. LDA might produce something like:
  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

Dirichlet Prior

  • Blei et al. (2003) extends the PLSI model by introducing a Dirichlet prior on topics , later Griffiths and Steyvers (2003) enrich it by placing a Dirichlet prior on words . Good choices for the hyperparameters α and β will depend on number of topics and vocabulary size. From previous research, we have found α =50/t and β = 0.01 to work well with many different text collections (Steyvers and Griffiths, 2007).
  • Conjugate prior. If the posterior distributions are in the same family as the prior probability distributions, the prior and posterior are then called conjugate distributions, and the prior is called conjugate prior for the likelihood.

Graphical Model

  • The graphical model is represented by plate notation as shown as follows, where the shaded and unshaded variables indicate observed and latent (i.e., unobserved) variables respectively.

    where arrows indicate conditional dependencies between variables while plates (the boxes in the figure) refer to repetitions of sampling steps with the variable in the lower right corner referring to the number of samples. For example, the inner plate over z and w illustrates the repeated sampling of topics and words until Nd words have been generated for document d.

  • The topic models can be interpreted by matrix factorization, illustrated as follows together with LSA interpretations. The word-document co-occurrence matrix is split into two matrices: a topic matrix and a document matrix.Note that the constraints in LDA are that the feature values (topic distributions) are non-negative and should be summed up to one.  LSA decomposition (i.e., SVD factorization) does not have such constraints. 

Gibbs Sampling

  • The challenge is to efficiently estimate the posterior distributions and, according to the large number of word tokens in the document collections.
  • Gibbs sampling (a.k.a alternating conditional sampling) is a specific form of Markov chain Monte Carlo, simulating a high-dimensional distribution by sampling on lower-dimensional subsets of variables where each subset is conditioned on the value of all others. The sampling is done sequentially and proceeds until the sampled values approximate the target distribution.
  • Markov chain Monte Carlo (MCMC) refers to a set of approximate iterative techniques designed to sample values from complex (often high-dimensional) distributions. More lectures about MCMC can be referred to here.
  • The procedure is as follows (assuming K topics): 
    • Go through each document, and randomly assign each word in the document to one of the K topics.
    • Notice that this random assignment already  gives you both topic representations of all the documents, and word distributions of all the topics (albeit not very good ones).
    • So to improve on them, for each document d ...
      • Go through each word w in d, and for each topic t, compute two things:

        • p(t|d) = the proportion of words in document d that are currently assigned to topic t.
        • p(w|t) = the proportion of assignments to topic t over all documents that come from this word w.
        • we compute that p(w|d)=p(w|t)p(t|d).
    • After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

Generative Process

  • Decide the number of words N that a document will have, according to a Poisson distribution.
  • Draw a topic distribution, , where is a draw from a uniform Dirichlet distribution with scaling parameter .
  • For each word in the document
    • Draw a specific topic , where is a multinomial distribution.
    • Draw a word  using the picked topic.

    Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

Example. To generate some particular document D, you might:

  • Pick 5 to be the number of words in D.
  • Decide that D will be 1/2 about food and 1/2 about cute animals.
  • Pick the first word to come from the food topic, which then gives you the word “broccoli”.
  • Pick the second word to come from the cute animals topic, which gives you “panda”.
  • Pick the third word to come from the cute animals topic, giving you “adorable”.
  • Pick the fourth word to come from the food topic, giving you “cherries”.
  • Pick the fifth word to come from the food topic, giving you “eating”.So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).

3. Computing Similarity

  • The derived topic probability distributions can be used to compute document or word similarity.
  • Two documents are similar to the extent that the same topics appear in these documents.
  • Two words are similar to the extent that they appear in the same topic.
  • My understanding: if we look at the matrix representation, it is easy to compute both document and word similarities according to topic distributions.

Document Similarity

  • Given the topic distributions ,  of two documents, the document similarity is measured as that of the topic distributions.
  • A standard function to measure the difference or divergence between two distributions p and q is the Kullback Leibler (KL) divergence:
  • The KL divergence is asymmetric and in many applications, it is convenient to apply a symmetric measure based on KL divergence:
  • Another option is to apply the symmetrized Jensen-Shannon (JS) divergence:
  • If we treat the topic distributions as a vector, then other measures such as Euclidian distance, cosine similarity can be applied.

Word Similarity

  • The word similarity can be measured by the extent that they share the same topics. Specifically, one word can appear in many topics. Hence the word similarity can be regarded as their overlappings.
  • Given two word-topic distributions  and , either the symmetrized KL or JS divergence can be used to compute word similarity.

References

  1. Edwin Chen, Introduction to Latent Dirichlet Allocation.
  2. Blei et al., 2003, Latent Dirichlet Allocation, Journal of Machine Learning Research.
  3. Griffiths and Steyvers, 2003, Prediction and semantic association, In Neural information processing systems.
  4. Steyvers and Griffiths, 2007, Probabilistic Topic Models, Handbook of latent semantic analysis.

Study notes for Latent Dirichlet Allocation的更多相关文章

  1. LDA( Latent Dirichlet Allocation)主题模型 学习报告

    1     问题描述 LDA由Blei, David M..Ng, Andrew Y..Jordan于2003年提出,是一种主题模型,它可以将文档集中每篇文档的主题以概率分布的形式给出,从而通过分析一 ...

  2. [综] Latent Dirichlet Allocation(LDA)主题模型算法

    多项分布 http://szjc.math168.com/book/ebookdetail.aspx?cateid=1&&sectionid=983 二项分布和多项分布 http:// ...

  3. Latent Dirichlet Allocation 文本分类主题模型

    文本提取特征常用的模型有:1.Bag-of-words:最原始的特征集,一个单词/分词就是一个特征.往往一个数据集就会有上万个特征:有一些简单的指标可以帮助筛选掉一些对分类没帮助的词语,例如去停词,计 ...

  4. LDA(Latent Dirichlet Allocation)

    转自:http://leyew.blog.51cto.com/5043877/860255#559183-tsina-1-46862-ed0973a0c870156ed15f06a6573c8bf0 ...

  5. JGibbLDA:java版本的LDA(Latent Dirichlet Allocation)实现、修改及使用

    转载自:http://blog.csdn.net/memray/article/details/16810763   一.概述 JGibbLDA是一个java版本的LDA(Latent Dirichl ...

  6. LDA(latent dirichlet allocation)的应用

    http://www.52ml.net/1917.html 主题模型LDA(latent dirichlet allocation)的应用还是很广泛的,之前我自己在检索.图像分类.文本分类.用户评论的 ...

  7. 转:关于Latent Dirichlet Allocation及Hierarchical LDA模型的必读文章和相关代码

    关于Latent Dirichlet Allocation及Hierarchical LDA模型的必读文章和相关代码 转: http://andyliuxs.iteye.com/blog/105174 ...

  8. LDA(latent dirichlet allocation)

    1.LDA介绍 LDA假设生成一份文档的步骤如下: 模型表示: 单词w:词典的长度为v,则单词为长度为v的,只有一个分量是1,其他分量为0的向量         $(0,0,...,0,1,0,... ...

  9. LDA(Latent Dirichlet allocation)主题模型

    LDA是一种典型的词袋模型,即它认为一篇文档是由一组词构成的一个集合,词与词之间没有顺序以及先后的关系.一篇文档可以包含多个主题,文档中每一个词都由其中的一个主题生成. 它是一种主题模型,它可以将文档 ...

随机推荐

  1. referer htttp headers 统计信息 防盗链

    HTTP headers是HTTP请求和相应的核心模块,它承载了关于客户端浏览器.请求页面.服务器等相关信息.Referer是HTTP头中的一个属性,告诉服务器我是从哪个页面链接过来的,所携带的信息用 ...

  2. BZOJ 1324: Exca王者之剑

    1324: Exca王者之剑 Description Input 第一行给出数字N,M代表行列数.N,M均小于等于100 下面N行M列用于描述数字矩阵 Output 输出最多可以拿到多少块宝石 Sam ...

  3. BNU Invading system

    http://www.bnuoj.com/bnuoj/problem_show.php?pid=29364 这个题被坑了. 题意:密码就是那些数字里面的数,转换成二进制后1最少的那个数,当1的个数相同 ...

  4. nginx学习六 高级数据结构之双向链表ngx_queue_t

    1 ngx_queue_t简单介绍 ngx_queue_t是nginx提供的一个轻量级的双向链表容器,它不负责存储数据,既不提供数据的内存分配.它仅仅有两个指针负责把数据链入链表.它跟stl提供的qu ...

  5. 完整的Android手机短信验证源码

    短信验证功能我分两个模块来说,短信验证码的后台和代码实现短信验证码的功能. 一.短信验证码的后台      1.注册Mob账号:http://www.mob.com/#/login 2.注册成功之后, ...

  6. ios8.1上运行程序,程序界面只显示一部分

    在ios 9.1上运行程序没问题 但是在8.1上运行发现模拟器上只显示了程序的一小部分界面,没有显示完全. 结果发现由以下代码设置问题引起的 - (BOOL)application:(UIApplic ...

  7. c++,内联成员函数

    内联成员函数有两程方式实现内联成员函数1)在声名成员函数的同时定义成员函数体2)声明成员函数时,在最前面加上inline关键字在定义成员函数时也在最前面加上inline关键字 建议inline函数在头 ...

  8. BZOJ 1004: [HNOI2008]Cards( 置换群 + burnside引理 + 背包dp + 乘法逆元 )

    题意保证了是一个置换群. 根据burnside引理, 答案为Σc(f) / (M+1). c(f)表示置换f的不动点数, 而题目限制了颜色的数量, 所以还得满足题目, 用背包dp来计算.dp(x,i, ...

  9. 如何设置ssh安全只允许用户从指定的IP登陆

    原文链接: 如何设置ssh安全只允许用户从指定的IP登陆 由于开发上传文件需要  在服务器上开启  允许用户名和密码ssh登录.这样不太安全.百度后参考文章现在ssh用户名和密码登录的ip. 登录服务 ...

  10. C#实现时间戳转化

    /// <summary> /// 时间戳转为C#格式时间 /// </summary> /// <param name=”timeStamp”></para ...