models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8 import sys import gensim import numpy as np from gensim.models.doc2vec import Doc2Vec, LabeledSentence from sklearn.cluster import KMeans TaggededDocument = gensim.models.doc2vec.TaggedDocument def get_datasest(): with open("out/text_dict_cut.txt", 'r') as cf: docs = cf.readlines() print len(docs) x_train = [] #y = np.concatenate(np.ones(len(docs))) for i, text in enumerate(docs): word_list = text.split(' ') l = len(word_list) word_list[l-1] = word_list[l-1].strip() document = TaggededDocument(word_list, tags=[i]) x_train.append(document) return x_train def train(x_train, size=200, epoch_num=1): model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4) model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100) model_dm.save('model/model_dm') return model_dm def cluster(x_train): infered_vectors_list = [] print "load doc2vec model..." model_dm = Doc2Vec.load("model/model_dm") print "load train vectors..." i = 0 for text, label in x_train: vector = model_dm.infer_vector(text) infered_vectors_list.append(vector) i += 1 print "train kmean model..." kmean_model = KMeans(n_clusters=15) kmean_model.fit(infered_vectors_list) labels= kmean_model.predict(infered_vectors_list[0:100]) cluster_centers = kmean_model.cluster_centers_ with open("out/own_claasify.txt", 'w') as wf: for i in range(100): string = "" text = x_train[i][0] for word in text: string = string + word string = string + '\t' string = string + str(labels[i]) string = string + '\n' wf.write(string) return cluster_centers if __name__ == '__main__': x_train = get_datasest() model_dm = train(x_train) cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- Python print打印
1.Python运算符: +:加 -:减 *:乘 /:除以 %:去除法后的余数 //:取整除 2.字符串center方法: a='111'print(a.center(4,'2')) #让字符串占位 ...
- JSON 对象 与 字符串 互转
$sui = [ 'xixixi' => 'suisuisui', 'hahaha' => 'longlonglong', ]; $data = json_encode($sui); pr ...
- [LeetCode] 191. Number of 1 Bits ☆(位 1 的个数)
描述 Write a function that takes an unsigned integer and return the number of '1' bits it has (also kn ...
- PL/SQL Developer安装教程
1.下载:http://pan.baidu.com/s/1qYtvy1I密码:451g instantclient官方下载链接:http://www.oracle.com/technetwork/to ...
- ubuntu ssh前后台切换命令相关
后台运行:命令+& 例如 sleep 60 & jobs -l 显示job的pid和状态 ps 显示用户进程 将第一个job切换回前台:fg 1 放到后台:bg 1 cltr + z ...
- linux下stat命令详解
在linux系统下,使用stat(显示inode信息)命令可以查看一个文件的某些信息,我们先来尝试一下. 简单的介绍一下stat命令显示出来的文件其他信息: - File:显示文件名 - Size: ...
- Win10系列:VC++绘制位图图片
在使用Direct2D绘制图片的过程中,通过IWICImagingFactory工厂接口来得到绘制图片所需要的资源.本小节将介绍如何通过IWICImagingFactory工厂接口得到这些资源,并使用 ...
- [CodeForces332E]Binary Key
Problem 题目给出一个加密前的字符串长度为p和加密后的字符串长度为s,让你求一个长度为K字典序最小的密钥. 密钥是循环的,第i位为1表示加密前的第i为是有用的否则是没用的. Solution 首 ...
- Saiku缓存处理(七)
Saiku缓存处理方案 Saiku默认是从缓存中读取数据的(如果缓存中有数据的话),所以用户看到的数据不一定是最新的,如果需要看到最新的的数据需要手动刷新数据或者更改配置信息. Saiku获取实时数据 ...
- 将excel数据分块多线程导入
参考链接:http://blog.csdn.net/liqi_q/article/details/53032027