Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍一

Latent Semantic Analysis (LSA) Tutorial

译：http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html

WangBen 2011-09-16 beijing

http://blog.csdn.net/yihucha166/article/details/6783212

潜语义分析LSA介绍

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.

Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing (LSI)，从字面上的意思理解就是通过分析文档去发现这些文档中潜在的意思和概念。假设每个词仅表示一个概念，并且每个概念仅仅被一个词所描述，LSA 将非常简单（从词到概念存在一个简单的映射关系）

Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.

不幸的是，这个问题并没有如此简单，因为存在不同的词表示同一个意思（同义词），一个词表示多个意思，所有这种二义性（多义性）都会混淆概念以至于有时就算是人也很难理解。

For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.

例如,银行这个词和抵押、贷款、利率一起出现时往往表示金融机构。但是，和鱼饵，投掷、鱼一起出现时往往表示河岸。

How Latent Semantic Analysis Works

潜语义分析工作原理

Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we compare words to find relevant documents, because what we really want to do is compare the meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.

潜语义分析（Latent Semantic Analysis）源自问题：如何从搜索query中找到相关的文档。当我们试图通过比较词来找到相关的文本时，存在着难以解决的局限性，那就是在搜索中我们实际想要去比较的不是词，而是隐藏在词之后的意义和概念。潜语义分析试图去解决这个问题，它把词和文档都映射到一个‘概念’空间并在这个空间内进行比较（注：也就是一种降维技术）。

Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents.

当文档的作者写作的时候，对于词语有着非常宽泛的选择。不同的作者对于词语的选择有着不同的偏好，这样会导致概念的混淆。这种对于词语的随机选择在词-概念的关系中引入了噪音。LSA滤除了这样的一些噪音，并且还能够从全部的文档中找到最小的概念集合（为什么是最小？）。

In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.

1. Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document.

2. Concepts are represented as patterns of words that usually appear together in documents. For example "leash", "treat", and "obey" might usually appear in documents about dog training.

3. Words are assumed to have only one meaning. This is clearly not the case (banks could be river banks or financial banks) but it makes the problem tractable.

To see a small example of LSA, take a look at the next section.

为了让这个难题更好解决，LSA引入一些重要的简化：

1. 文档被表示为”一堆词（bags of words）”，因此词在文档中出现的位置并不重要，只有一个词的出现次数。

2. 概念被表示成经常出现在一起的一些词的某种模式。例如“leash”（栓狗的皮带）、“treat”、“obey”（服从）经常出现在关于训练狗的文档中。

3. 词被认为只有一个意思。这个显然会有反例（bank表示河岸或者金融机构），但是这可以使得问题变得更加容易。（这个简化会有怎样的缺陷呢？）

接下来看一个LSA的小例子，Next Part：

Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍一的更多相关文章

NLP —— 图模型（三）pLSA（Probabilistic latent semantic analysis，概率隐性语义分析）模型
LSA(Latent semantic analysis,隐性语义分析).pLSA(Probabilistic latent semantic analysis,概率隐性语义分析)和 LDA(Late ...
潜在语义分析Latent semantic analysis note(LSA)原理及代码
文章引用:http://blog.sina.com.cn/s/blog_62a9902f0101cjl3.html Latent Semantic Analysis (LSA)也被称为Latent S ...
潜语义分析(Latent Semantic Analysis)
LSI(Latent semantic indexing, 潜语义索引)和LSA(Latent semantic analysis,潜语义分析)这两个名字其实是一回事.我们这里称为LSA. LSA源自 ...
Latent Semantic Analysis(LSA/ LSI)原理简介
LSA的工作原理: How Latent Semantic Analysis Works LSA被广泛用于文献检索,文本分类,垃圾邮件过滤,语言识别,模式检索以及文章评估自动化等场景. LSA其中一个 ...
Latent semantic analysis note(LSA)
1 LSA Introduction LSA(latent semantic analysis)潜在语义分析,也被称为LSI(latent semantic index),是Scott Deerwes ...
主题模型之概率潜在语义分析（Probabilistic Latent Semantic Analysis）
上一篇总结了潜在语义分析(Latent Semantic Analysis, LSA),LSA主要使用了线性代数中奇异值分解的方法,但是并没有严格的概率推导,由于文本文档的维度往往很高,如果在主题聚类 ...
主题模型之潜在语义分析（Latent Semantic Analysis）
主题模型(Topic Models)是一套试图在大量文档中发现潜在主题结构的机器学习模型,主题模型通过分析文本中的词来发现文档中的主题.主题之间的联系方式和主题的发展.通过主题模型可以使我们组织和总结 ...
海量数据挖掘MMDS week4: 推荐系统之隐语义模型latent semantic analysis
http://blog.csdn.net/pipisorry/article/details/49256457 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
词向量---LSA(Latent Semantic Analysis)
举例: 矩阵分解之后,取前两维,k=2, 单词距离: 文档距离: 通过LSA分析之后计算文档间的余弦相似度,属于同一个类型文本之间的相似度很接近:在原始文档间计算相似度,效果不如LSA 当出现新的 ...

随机推荐

.net 使用memcache做缓存
前段时间去一家公司面试,面试官问到我对缓存了解多少,因为我是做B/S开发的,所以把知道的都说了.比如:Application.Cache.页面缓存.文件缓存.然后面试官说“不止这些,还有呢?”,我后来 ...
android 之fragment创建
1.使用xml标签 1.1定义两个重要属性 <fragment android:id="@+id/fregment_top" android: ...
KM算法及其优化的学习笔记&&bzoj2539: [Ctsc2000]丘比特的烦恼
感谢 http://www.cnblogs.com/vongang/archive/2012/04/28/2475731.html 这篇blog里提供了3个链接……基本上很明白地把KM算法是啥讲清楚 ...
js-FCC算法Smallest Common Multiple。找出两个参数和它们之间的连续数字的最小公倍数。
存档. 找出能被两个给定参数和它们之间的连续数字整除的最小公倍数. function smallestCommons(arr) { //分解质因数法,分解为若干个质数相乘 var arrratio=[ ...
Leetcode 400. Nth digits
解法一: 一个几乎纯数学的解法 numbers: 1,...,9, 10, ..., 99, 100, ... 999, 1000 ,..., 9999, ... # of digits: 9 ...
【BZOJ-3631】松鼠的新家树形DP？+ 倍增LCA + 打标记
3631: [JLOI2014]松鼠的新家 Time Limit: 10 Sec Memory Limit: 128 MBSubmit: 1231 Solved: 620[Submit][Stat ...
深入了解Mvc路由系统
请求一个MVC页面的处理过程 1.浏览器发送一个Home/Index 的链接请求到iis.iis发现时一个asp.net处理程序.则调用asp.net_isapi 扩展程序发送asp.net框架 2. ...
UOJ #10 pyx的难题
pyx的难题被这题搞得生无可恋．容易看出题目完成时间与优先级之间的关系是单调的,故可以二分答案. 用于二分的答案可以取\(O(n)\)个离散值, 这样就很方便地保证了优先级各不相同．可以用优先 ...
vi编辑文件出现Can't open file for writing错误
可以用 ll 命令查看一下文件的权限,很有可能是没有权限,用chmod命令修改一下权限就可以了(当然是文件所有者或者root用户才能修改),或者切换成root用户(不推荐)
如何判断ios设备中是否安装了某款应用
URL Schemes关键字研究一下即可常见得URL Schemes见http://www.cnblogs.com/huangzs/p/4491286.html if ([[UIApplicatio ...

Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 一

Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 一的更多相关文章

随机推荐

热门专题

Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍一

Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍一的更多相关文章