models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8 import sys import gensim import numpy as np from gensim.models.doc2vec import Doc2Vec, LabeledSentence from sklearn.cluster import KMeans TaggededDocument = gensim.models.doc2vec.TaggedDocument def get_datasest(): with open("out/text_dict_cut.txt", 'r') as cf: docs = cf.readlines() print len(docs) x_train = [] #y = np.concatenate(np.ones(len(docs))) for i, text in enumerate(docs): word_list = text.split(' ') l = len(word_list) word_list[l-1] = word_list[l-1].strip() document = TaggededDocument(word_list, tags=[i]) x_train.append(document) return x_train def train(x_train, size=200, epoch_num=1): model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4) model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100) model_dm.save('model/model_dm') return model_dm def cluster(x_train): infered_vectors_list = [] print "load doc2vec model..." model_dm = Doc2Vec.load("model/model_dm") print "load train vectors..." i = 0 for text, label in x_train: vector = model_dm.infer_vector(text) infered_vectors_list.append(vector) i += 1 print "train kmean model..." kmean_model = KMeans(n_clusters=15) kmean_model.fit(infered_vectors_list) labels= kmean_model.predict(infered_vectors_list[0:100]) cluster_centers = kmean_model.cluster_centers_ with open("out/own_claasify.txt", 'w') as wf: for i in range(100): string = "" text = x_train[i][0] for word in text: string = string + word string = string + '\t' string = string + str(labels[i]) string = string + '\n' wf.write(string) return cluster_centers if __name__ == '__main__': x_train = get_datasest() model_dm = train(x_train) cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- Http Header Content-Typ
Http Header里的Content-Type一般有这三种:application/x-www-form-urlencoded:数据被编码为名称/值对.这是标准的编码格式.multipart/fo ...
- JSP开发Web应用系统
1.动态网站开发基础 1-1:动态网页 a.为什么需要动态网页(当我们需要修改网页内容的时候,都要重新上传一次覆盖原来的页面.而且,制作必须要通过专用的网页制作工具,比如:Dreamweaver.Fr ...
- io重定向打开关闭 Eclipse中c开发printf无法输出解决办法
if(freopen("e:\\lstm-comparec\\lstm\\lstm\\output.txt","a",stdout)==NULL)fprintf ...
- JS&Jquery中的循环/遍历
JavaScript中的遍历 for 遍历 var anArray = ['one','two']; for(var n = 0; n < anArray.length; n++) { //具体 ...
- Java 统计字母个数
原理: 将字符串转换成char字符数组 然后使用另一个数组存储 代码如下 public class CalChar { public static void main(String[] args) { ...
- Dev-cpp怎样去掉括号匹配?
很多编C/C++的同学在用Dev-cpp的时候,就感觉到括号匹配很烦,又不知道哪里去掉. 所以,让ljn告诉你怎样去掉括号匹配. 1.打开Dev-cpp. 2.在菜单栏上,点击“工具[T]”,选择“编 ...
- maven项目配置findbugs插件 使用git钩子控制代码的提交
maven项目配置findbugs插件对代码进行静态检测 当发现代码有bug时,就不让用户commit代码到远程仓库里 没有bug时才可以commit到远程仓库中 (1)新建maven项目 ,配置fi ...
- 四:(之四)基于已有镜像构建自己的Docker镜像
4构建自己的Docker镜像 4.1常用命令: 等同于docker commit 将一个被改变的容器创建成一个新的image 等同于docker build 通过Dockerfile创建一个image ...
- 十四. Python基础(14)--递归
十四. Python基础(14)--递归 1 ● 递归(recursion) 概念: recursive functions-functions that call themselves either ...
- Python的网络编程--思维导图
Python的网络编程--思维导图