参考：

转自：

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱为方便看摘录过来。。

用doc2vec做文本相似度，模型可以找到输入句子最相似的句子，然而分析大量的语料时，不可能一句一句的输入，语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来，然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下：

# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
《Deep Learning》（深度学习）中文版开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

span 超出部分换行
span{ word-break: normal; width: auto; display: block; white-space: pre-wrap; word-wrap: break-word; ...
js 鼠标滚动禁用启用
function disabledMouseWheel() { var div = document.getElementById('divid'); if (div.addEventListener ...
LY.JAVA.DAY12.String类
2018-07-24 14:06:03 String类概述字符串是由多个字符组成一串数据(字符序列) 字符串可以看成字符数组一旦被赋值就不能被改变值不能变 1.过程概述: 方法区---字符 ...
C++解析七-重载运算符和重载函数
重载运算符和重载函数C++ 允许在同一作用域中的某个函数和运算符指定多个定义,分别称为函数重载和运算符重载.重载声明是指一个与之前已经在该作用域内声明过的函数或方法具有相同名称的声明,但是它们的参数列 ...
跨域 jsonp 和 CORS 资料
http://mp.weixin.qq.com/s/iAShnqvsOyV-Xd0Ft7Nl2Q HTML5安全:CORS(跨域资源共享)简介...ie67不要想了... http://www.cnb ...
webpack分离打包css和less
github仓库:https://github.com/llcMite/webpack.git 为什么要分离打包? 答:刚开始学webpack的时候就很郁闷,明明没几个文件,打包出来体积特 ...
vue常见开发问题整理
1.(webpack)vue-cli构建的项目如何设置每个页面的title 在路由里每个都添加一个meta [{ path:'/login', meta: { title: '登录页面' }, com ...
EF-记录程序自动生成并执行的sql语句日志
在EntityFramework的CodeFirst模式中,我们想将程序自动生成的sql语句和执行过程记录到日志中,方便以后查看和分析. 在EF的6.x版本中,在DbContext中有一个Databa ...
redis、memcache、mongoDB 对比
从以下几个维度,对 redis.memcache.mongoDB 做了对比. 1.性能都比较高,性能对我们来说应该都不是瓶颈. 总体来讲,TPS 方面 redis 和 memcache 差不多,要大 ...
uitableviewcell textlabel detailtextLabel 换行的位置及尺寸问题
我们在使用uitableView的时候,一些简单的cell样式其实是不需要自定义的,但是系统的方法又似乎又无法满足需要,这时候我们就需要在系统上做一些改变来达到我们的需求: 像这种cell,简单分析下 ...

models.doc2vec – Deep learning with paragraph2vec

用 Doc2Vec 得到文档／段落／句子的向量表达

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程：训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

gensim doc2vec + sklearn kmeans 做文本聚类

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

随机推荐

热门专题