An introduction to latent semantic analysis

SVD的有关资料，从很多大牛的博客中整理了一下，然后自己写了个python版本，放上来，跟大家分享～

关于SVD的讲解，参考博客

本文由LeftNotEasy发布于http://leftnoteasy.cnblogs.com, 本文可以被全部的转载或者部分使用，但请注明出处，如果有问题，请联系wheeleast@gmail.com

python的拓展包numpy,scipy都能求解SVD，基于numpy写了一个文档做svd的程序。首先将每篇文档向量化，然后对向量化后的文档集合做SVD，取计算后的矩阵U，进行分析。先上代码：

 #coding=utf-8

 import re

 import math

 import numpy as np

 import matplotlib.pylab as plt

 def f_file_open(trace_string):

     """open the document_set, save in the list called txt"""

     f=open(trace_string,'r')

     txt=f.readlines()

     f.close()

     return txt

 def f_vector_found(txt):

     """calculate all of the word in the document set---构造词空间"""

     word_list=[]

     for line in txt:

         line_clean=line.split()

         for word in line_clean:

             if word not in word_list:

                 word_list.append(word)

             else:

                 pass

     return word_list

 def f_document_vector(document,word_list):

     """transform the document to vector---文档向量化"""

     vector=[]

     document_clean=document.split()

     for word in word_list:

         a=document_clean.count(word)

         vector.append(a)

     return vector

 def f_svd_calculate(document_array):

     """calculate the svd and return the three matrics"""

     U,S,V=np.linalg.svd(document_array)

     return (U,S,V)

 def f_process_matric_U(matric_U,Save_N_Singular_value):

     """according to the matric U, choose the words as the feature in each document,根据前N个奇异值对U进行切分,选择前N列"""

     document_matric_U=[]

     for line in matric_U:

         line_new=line[:Save_N_Singular_value]

         document_matric_U.append(line_new)

     return document_matric_U

 def f_process_matric_S(matric_S,Save_information_value):

     """choose the items with large singular value,根据保留信息需求选择奇异值个数"""

     matricS_new=[]

     S_self=0

     N_count=0

     Threshold=sum(matric_S)*float(Save_information_value)

     for value in matric_S:

         if S_self<=Threshold:

             matricS_new.append(value)

             S_self+=value

             N_count+=1

         else:

             break

     print ("the %d largest singular values keep the %s information " %(N_count,Save_information_value))

     return (N_count,matricS_new)

 def f_process_matric_V(matric_V,Save_N_Singular_value):

     """according to the matric V, choose the words as the feature in each document,根据前N个奇异值对U进行切分,选择前N行"""

     document_matric_V=matric_V[:Save_N_Singular_value]

     return document_matric_V

 def f_combine_U_S_V(matric_u,matric_s,matirc_v):

     """calculate the new document对奇异值筛选后重新计算文档矩阵"""

     new_document_matric=np.dot(np.dot(matric_u,np.diag(matric_s)),matirc_v)

     return new_document_matric

 def f_matric_to_document(document_matric,word_list_self):

     """transform the matric to document,将矩阵转换为文档"""

     new_document=[]

     for line in document_matric:

         count=0

         for word in line:

             if float(word)>=0.9:                                                                                     #转换后文档中词选择的阈值

                 new_document.append(word_list_self[count]+" ")

             else:

                 pass

             count+=1

         new_document.append("\n")

     return new_document

 def f_save_file(trace,document):

     f=open(trace,'a')

     for line in document:

         for word in line:

             f.write(word)

 trace_open="/home/alber/experiment/test.txt"

 trace_save="/home/alber/experiment/20140715/svd_result1.txt"

 txt=f_file_open(trace_open)

 word_vector=f_vector_found(txt)

 print (len(word_vector))

 document=[]

 Num_line=0

 for line in txt:                                #transform the document set to matric

     Num_line=Num_line+1

     document_vector=f_document_vector(line,word_vector)

     document.append(document_vector)

 print (len(document))

 U,S,V=f_svd_calculate(document)

 print (sum(S))

 N_count,document_matric_S=f_process_matric_S(S,0.9)

 document_matric_U=f_process_matric_U(U,N_count)

 document_matric_V=f_process_matric_V(V,N_count)

 print (len(document_matric_U[1]))

 print (len(document_matric_V))

 new_document_matric=f_combine_U_S_V(document_matric_U,document_matric_S,document_matric_V)

 print (sorted(new_document_matric[1],reverse=True))

 new_document=f_matric_to_document(new_document_matric,word_vector)

 f_save_file(trace_save,new_document)

 print ("the new document has been saved in %s"%trace_save)

第一篇文档对应的向量的结果如下图（未列完，已排序）：

[1.0557039715196566, 1.0302828340480468, 1.0177955652284856, 1.0059864028992798, 0.99050787479103541, 0.93109816291875147, 0.70360233131357808, 0.22614603502510683, 0.10577134907675778, 0.098346889985350489, 0.091221506093784849, 0.085227549911874326, 0.052355994530275715, 0.049805639460153352, 0.046430974364203001, 0.046430974364203001, 0.045655634442695908, 0.043471974743277547, 0.041953839699628029, 0.041483792741663243, 0.039635143169293147, 0.03681955156197822, 0.034893319065413916, 0.0331697465114036, 0.029874818442883051, 0.029874818442883051, 0.028506042937487715, 0.028506042937487715, 0.027724455461901349, 0.026160357130229708, 0.023821284531034687, 0.023821284531034687, 0.017212073571417009, 0.016793815602261938, 0.016793815602261938, 0.016726955476865021, 0.015012207148054771, 0.013657280765244915。。。。。

基于这样一种结果，要对分解后的矩阵进行分析，如上图，值越大，表明该位置的词对该文档贡献越大，而值越小则该词无意义，因而，下一步就是设定阈值，取每一篇文档的特征词，至于阈值的设定，有很多种方法，可以对所有值进行排序，取拐点。如图（不是上面的结果做出来的图）：

显然，只有拐点以后的值对文档的贡献较高，而拐点以后的值变为0，这样，一个文档--词矩阵就通过SVD分解而降低了维度。

这个过程中，有两个认为设定的参数，一个是奇异值的选择，如上图（右）：奇异值下降较快，而其中前N个奇异值已经能够代替整个矩阵大部分的的信息。在我的程序中，通过设定需要保留的信息比率（保留90%或者95%或者其他等等）来控制奇异值个数。

另一个需要设定的就是在对上图（左），对于重新构造的矩阵，要用来代替原来的文档矩阵，需要对词进行选择，上面已经说过的，取拐点值是一种。

词--文档矩阵的SVD分解基本上就是这些内容。欢迎纠错和吐槽。

用Python做SVD文档聚类---奇异值分解----文档相似性----LSI（潜在语义分析）的更多相关文章

[原创博文] 用Python做统计分析（Scipy.stats的文档）
[转自] 用Python做统计分析 (Scipy.stats的文档) 对scipy.stats的详细介绍: 这个文档说了以下内容,对python如何做统计分析感兴趣的人可以看看,毕竟Python的库也 ...
Kmeans文档聚类算法实现之python
实现文档聚类的总体思想: 将每个文档的关键词提取,形成一个关键词集合N: 将每个文档向量化,可以参看计算余弦相似度那一章: 给定K个聚类中心,使用Kmeans算法处理向量: 分析每个聚类中心的相关文档 ...
Python爬虫、自动化常用库&帮助文档URL
一.Python下载地址 Windows终端Cmder.exe下载--->http://cmder.net/ Python下载(Windows) ---> https://w ...
孤荷凌寒自学python第五十四天使用python来删除Firebase数据库中的文档
孤荷凌寒自学python第五十四天使用python来删除Firebase数据库中的文档 (完整学习过程屏幕记录视频地址在文末) 今天继续研究Firebase数据库,利用google免费提供的这个数据库 ...
Python之文件处理-批量修改md文档内容
目录 Python之文件处理-批量修改md文档内容 Python之文件处理-批量修改md文档内容 #!/usr/bin/env python # -*- coding:utf-8 -*- import ...
【技术博客】利用Python将markdown文档转为html文档
利用Python将markdown文档转为html文档 v1.0 作者:FZK 元素简单的md文件 Python中自带有一个markdown库,你可以直接这样使用 md_file = open(&qu ...
docfx 做一个和微软一样的文档平台
开发中,有一句话叫最不喜欢的是写文档,最不喜欢的是看别人家代码没有文档.那么世界上文档写最 la 好 ji 的就是微软了,那么微软的api文档是如何做的?难道请了很多人去写文档? 实际上微软有工具用 ...
k-means+python︱scikit-learn中的KMeans聚类实现( + MiniBatchKMeans)
来源:, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, ...
使用python做科学计算
这里总结一个guide,主要针对刚开始做数据挖掘和数据分析的同学说道统计分析工具你一定想到像excel,spss,sas,matlab以及R语言.R语言是这里面比较火的,它的强项是强大的绘图功能以及 ...

随机推荐

vj1010:高精乘+细心模拟
这题的话思路挺简单的,主要是打一个高精乘,然后考虑一些细节的东西码得挺少时间的,但是调错调了很久... 讲一下思路吧: 就是读入的时候,先把小数点去掉,mark一下小数点的位置去掉小数点之后也就进 ...
AspNet MVC3中过滤器 + 实例
AspNet MVC3中过滤器 + 实例过滤器在请求管线注入额外的逻辑,提供简单优雅的方法实现横切点关注(AOP),例如日志,授权,缓存等应用.通过AOP可以减少在实际的业务逻辑中参杂过多非直接业务 ...
MongoDB：利用官方驱动改装为EF代码风格的MongoDB.Repository框架一
本人系新接触MongoDB不久,属于MongoDB的菜鸟范畴.在使用MongoDB的过程中,总结了一些认识,在此总结跟大家分享.欢迎拍砖. 关于MongoDB的内容,在此就不做介绍了,网上有浩如烟海的 ...
CSS3:三个矩形，一个宽200px，其余宽相等且自适应满铺
某公司面试题:下图绿色区域的宽度为100%,其中有三个矩形,第一个矩形的宽度是200px,第二个和第三个矩形的宽度相等.使用CSS3中的功能实现它们的布局. 这里要用到的CSS3特性box-flex ...
7z文件格式及其源码
7z文件格式及其源码的分析(四) 这是7z文件格式及其源码的分析系列的第四篇. 上一篇讲到了7z文件静态结构的尾header部分.这一篇开始,将从7z实际压缩流程开始详细介绍7z文件尾header的详 ...
ICMP：internet 控制报文协议
ICMP:internet 控制报文协议 1.概述 ICMP是(Internet Control Message Protocol)Internet控制报文协议.它是TCP/IP协议族的一个 ...
Microsoft 2013校园招聘笔试题及解答
Microsoft 2013校园招聘笔试题及解答题目是自己做的,求讨论.吐槽.拍砖 1. Which of the following callingconvension(s) suppo ...
python 调用 bash （python 调用linux命令）
原文这里有显示地址:http://zhou123.blog.51cto.com/4355617/1312791 现在摘取一部分: 这里介绍一下python执行shell命令的四种方法: 1.os模块中 ...
转义字符（\）对JavaScript中JSON.parse的影响
Email:longsu2010 at yeah dot net 按照ECMA262第五版中的解释,JSON是一个提供了stringify和parse方法的内置对象,前者用于将js对象转化为符合jso ...
ColorMatrixFilter色彩矩阵滤镜（as3）
matrix是一个长度为4*5＝20的数组,其构成如下所示: R ,G, B, A, offset [1, 0, 0, 0, 0]); // red [0, 1, 0, 0, 0 ...

用Python做SVD文档聚类---奇异值分解----文档相似性----LSI（潜在语义分析）

An introduction to latent semantic analysis

用Python做SVD文档聚类---奇异值分解----文档相似性----LSI（潜在语义分析）的更多相关文章

随机推荐

热门专题