参考:

用 Doc2Vec 得到文档/段落/句子的向量表达

https://radimrehurek.com/gensim/models/doc2vec.html

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程:训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

转自:

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱 为方便看摘录过来。。

用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下:
# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

  1. DEEP LEARNING WITH STRUCTURE

    DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...

  2. deep learning新征程

    deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan  2015-11-26   声明: 1 ...

  3. A Statistical View of Deep Learning (I): Recursive GLMs

    A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...

  4. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  5. 《Deep Learning》(深度学习)中文版 开发下载

    <Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...

  6. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  7. 深度学习Deep learning

    In the last chapter we learned that deep neural networks are often much harder to train than shallow ...

  8. 《Deep Learning》全书已完稿_附全书电子版

    Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...

  9. How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

  1. Python print打印

    1.Python运算符: +:加 -:减 *:乘 /:除以 %:去除法后的余数 //:取整除 2.字符串center方法: a='111'print(a.center(4,'2')) #让字符串占位 ...

  2. JSON 对象 与 字符串 互转

    $sui = [ 'xixixi' => 'suisuisui', 'hahaha' => 'longlonglong', ]; $data = json_encode($sui); pr ...

  3. [LeetCode] 191. Number of 1 Bits ☆(位 1 的个数)

    描述 Write a function that takes an unsigned integer and return the number of '1' bits it has (also kn ...

  4. PL/SQL Developer安装教程

    1.下载:http://pan.baidu.com/s/1qYtvy1I密码:451g instantclient官方下载链接:http://www.oracle.com/technetwork/to ...

  5. ubuntu ssh前后台切换命令相关

    后台运行:命令+& 例如 sleep 60 & jobs -l 显示job的pid和状态 ps 显示用户进程 将第一个job切换回前台:fg 1 放到后台:bg 1 cltr + z ...

  6. linux下stat命令详解

    在linux系统下,使用stat(显示inode信息)命令可以查看一个文件的某些信息,我们先来尝试一下.  简单的介绍一下stat命令显示出来的文件其他信息: - File:显示文件名 - Size: ...

  7. Win10系列:VC++绘制位图图片

    在使用Direct2D绘制图片的过程中,通过IWICImagingFactory工厂接口来得到绘制图片所需要的资源.本小节将介绍如何通过IWICImagingFactory工厂接口得到这些资源,并使用 ...

  8. [CodeForces332E]Binary Key

    Problem 题目给出一个加密前的字符串长度为p和加密后的字符串长度为s,让你求一个长度为K字典序最小的密钥. 密钥是循环的,第i位为1表示加密前的第i为是有用的否则是没用的. Solution 首 ...

  9. Saiku缓存处理(七)

    Saiku缓存处理方案 Saiku默认是从缓存中读取数据的(如果缓存中有数据的话),所以用户看到的数据不一定是最新的,如果需要看到最新的的数据需要手动刷新数据或者更改配置信息. Saiku获取实时数据 ...

  10. 将excel数据分块多线程导入

    参考链接:http://blog.csdn.net/liqi_q/article/details/53032027