前言

在前面我们大致介绍了什么是意图识别，把这个问题抽象出来其实是一个分类问题。在结构上面，我们使用LSTM来提取特征,Softmax来进行最后的多分类。由于语料的限制，我们目前仅考虑电台,音乐，问答类等三类的意图识别。更多种类的意图识别, 其实也是把更多种类的语料加入进来，修改softmax的分类数。最后的目标是在这三类的分类准备率能够达到90%。

我们将考虑使用 keras(严格意义上只能说是一个接口)来实现这个意图识别的工作。

整体流程

图一意图分类训练流程

我们整体的流程如图所示，首先是利用对语料语料进行预处理，包括去除语料的标点符号，去除停用词等等。将语料初始化以后便是利用word2vec生成词向量, 生成词向量以后便是利用LSTM来进行特征提取,最后便是利用softmax来完成我们的意图分类工作。整体流程非常的清晰。

数据说明

我们的数据有三个文件，一个是question.txt, 一个是music.txt, 一个是station.txt。我们展示一下数据的格式，大家按照如下结构组织训练即可，至于更多分类是一样的。

music.txt

我想听千千阙歌

汪峰的歌曲

question.txt

天为甚么这么蓝

中国有多大

station.txt

我要听郭德纲的相声

交通广播电台

语料预处理

在语料预处理这块，我们的工作目前做的很粗糙，仅仅是将语料按照1:1:1的比例提取出来进行训练，这里有个问题大家可以思考一下，为什么我们在训练的时候要尽量使不同类别的数据按照1:1:1的比例来进行训练.

生成词向量

生成词向量的过程，是将语料从文字转化为数值，方便程序后续处理的过程。我们直接使用word2vec来进行训练的，至于word2Vec的原理，我们不在这里展开。在训练的时候，我们把所有一万五千条数据全部加入进行训练。

# -*- coding: UTF-8 -*-

import os

import numpy as np

from gensim.models.word2vec import Word2Vec

from gensim.corpora.dictionary import Dictionary

class Embedding(object):

    def __init__(self, dirname):

        self.dirname = dirname

    def __iter__(self):

        for fname in os.listdir(self.dirname):

            for line in open(os.path.join(self.dirname, fname)):

                yield line.split()

 if __name__ == '__main__':

    // 训练word2vec模型

    sentences = Embedding('../data/') # a memory-friendly iterator

代码的架构如下

图二：多层LSTM提取特征,外接softmax 三分类

 # -*- coding: utf-8 -*-

 import yaml

 import sys

 reload(sys)

 sys.setdefaultencoding("utf-8")

 from sklearn.cross_validation import train_test_split

 import multiprocessing

 import numpy as np

 from keras.utils import np_utils

 from gensim.models.word2vec import Word2Vec

 from gensim.corpora.dictionary import Dictionary

 from keras.preprocessing import sequence

 from keras.models import Sequential

 from keras.layers.embeddings import Embedding

 from keras.layers.recurrent import LSTM

 from keras.layers.core import Dense, Dropout,Activation

 from keras.models import model_from_yaml

 from sklearn.preprocessing import LabelEncoder

 np.random.seed(1337)  # For Reproducibility

 import jieba

 import pandas as pd

 sys.setrecursionlimit(1000000)

 # set parameters:

 vocab_dim = 100

 maxlen = 100

 n_iterations = 1  # ideally more..

 n_exposures = 10

 window_size = 7

 batch_size = 32

 n_epoch = 15

 input_length = 100

 cpu_count = multiprocessing.cpu_count()

 #加载训练文件

 def loadfile():

     fopen = open('data/question_query.txt', 'r')

     questtion = []

     for line in fopen:

         question.append(line)

     fopen = open('data/music_query.txt', 'r')

     music = []

     for line in fopen:

         music.append(line)

     fopen = open('data/station_query.txt', 'r')

     station = []

     for line in fopen:

         station.append(line)

     combined = np.concatenate((station, music, qabot))

     question_array = np.array([-1]*len(question),dtype=int)

     station_array = np.array([0]*len(station),dtype=int)

     music_array = np.array([1]*len(music),dtype=int)

     #y = np.concatenate((np.ones(len(station), dtype=int), np.zeros(len(music), dtype=int)),qabot_array[0])

     y = np.hstack((qabot_array, station_array,music_array))

     print "y is:"

     print y.size

     print "combines is:"

     print combined.size

     return combined, y

 #对句子分词，并去掉换行符

 def tokenizer(document):

     ''' Simple Parser converting each document to lower-case, then

         removing the breaks for new lines and finally splitting on the

         whitespace

     '''

     #text = [jieba.lcut(document.replace('\n', '')) for str(document) in text_list]

     result_list = []

     for text in document:

         result_list.append(' '.join(jieba.cut(text)).encode('utf-8').strip())

     return result_list

 #创建词语字典，并返回每个词语的索引，词向量，以及每个句子所对应的词语索引

 def create_dictionaries(model=None,

                         combined=None):

     ''' Function does are number of Jobs:

         1- Creates a word to index mapping

         2- Creates a word to vector mapping

         3- Transforms the Training and Testing Dictionaries

         4- 返回所有词语的向量的拼接结果

     '''

     if (combined is not None) and (model is not None):

         gensim_dict = Dictionary()

         gensim_dict.doc2bow(model.wv.vocab.keys(),

                             allow_update=True)

         w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引

         w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量

         def parse_dataset(combined):

             ''' Words become integers

             '''

             data=[]

             for sentence in combined:

                 new_txt = []

                 sentences = sentence.split(' ')

                 for word in sentences:

             try:

                 word = unicode(word, errors='ignore')

                         new_txt.append(w2indx[word])

                     except:

                         new_txt.append(0)

                 data.append(new_txt)

             return data

         combined=parse_dataset(combined)

         combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引，所以句子中含有频数小于10的词语，索引为0

         return w2indx, w2vec,combined

     else:

         print 'No data provided...'

 #创建词语字典，并返回每个词语的索引，词向量，以及每个句子所对应的词语索引

 def word2vec_train(combined):

     # 加载word2vec 模型

     model = Word2Vec.load('lstm_data/model/Word2vec_model.pkl')

     index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)

     return   index_dict, word_vectors,combined

 def get_data(index_dict,word_vectors,combined,y):

     # 获取句子的向量

     n_symbols = len(index_dict) + 1                       # 所有单词的索引数，频数小于10的词语索引为0，所以加1

     embedding_weights = np.zeros((n_symbols, vocab_dim))  #索引为0的词语，词向量全为0

     for word, index in index_dict.items():                #从索引为1的词语开始，对每个词语对应其词向量

         embedding_weights[index, :] = word_vectors[word]

     x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2)

     # encode class values as integers

     encoder = LabelEncoder()

     encoded_y_train = encoder.fit_transform(y_train)

     encoded_y_test = encoder.fit_transform(y_test)

     # convert integers to dummy variables (one hot encoding)

     y_train = np_utils.to_categorical(encoded_y_train)

     y_test = np_utils.to_categorical(encoded_y_test)

     print x_train.shape,y_train.shape

     return n_symbols,embedding_weights,x_train,y_train,x_test,y_test

 ##定义网络结构

 def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):

     nb_classes = 3

     print 'Defining a Simple Keras Model...'

     ## 定义基本的网络结构

     model = Sequential()  # or Graph or whatever

     ## 对于LSTM 变长的文本使用Embedding 将其变成指定长度的向量

     model.add(Embedding(output_dim=vocab_dim,

                         input_dim=n_symbols,

                         mask_zero=True,

                         weights=[embedding_weights],

                         input_length=input_length))  # Adding Input Length

     ## 使用单层LSTM 输出的向量维度是50，输入的向量维度是vocab_dim,激活函数relu

     model.add(LSTM(output_dim=50, activation='relu', inner_activation='hard_sigmoid'))

     model.add(Dropout(0.5))

     ## 在这里外接softmax，进行最后的3分类

     model.add(Dense(output_dim=nb_classes, input_dim=50, activation='softmax'))

     print 'Compiling the Model...'

     ## 激活函数使用的是adam

     model.compile(loss='categorical_crossentropy',

                   optimizer='adam',metrics=['accuracy'])

     print "Train..."

     print y_train

     model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))

     print "Evaluate..."

     score = model.evaluate(x_test, y_test,

                                 batch_size=batch_size)

     yaml_string = model.to_yaml()

     with open('lstm_data/lstm_koubei.yml', 'w') as outfile:

         outfile.write( yaml.dump(yaml_string, default_flow_style=True) )

     model.save_weights('lstm_data/lstm_koubei.h5')

     print 'Test score:', score

 #训练模型，并保存

 def train():

     print 'Loading Data...'

     combined,y=loadfile()

     print len(combined),len(y)

     print 'Tokenising...'

     combined = tokenizer(combined)

     print 'Training a Word2vec model...'

     index_dict, word_vectors,combined=word2vec_train(combined)

     print 'Setting up Arrays for Keras Embedding Layer...'

     n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)

     print x_train.shape,y_train.shape

     train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)

 #训练模型，并保存

 def self_train():

     print 'Loading Data...'

     combined,y=loadfile()

     print len(combined),len(y)

     print 'Tokenising...'

     combined = tokenizer(combined)

     print 'Training a Word2vec model...'

     index_dict, word_vectors,combined=word2vec_train(combined)

     print 'Setting up Arrays for Keras Embedding Layer...'

     n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)

     print x_train.shape,y_train.shape

     train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)

 def input_transform(string):

     words=' '.join(jieba.cut(string)).encode('utf-8').strip()

     tmp_list = []

     tmp_list.append(words)

     #words=np.array(tmp_list).reshape(1,-1)

     model=Word2Vec.load('lstm_data/model/Word2vec_model.pkl')

     _,_,combined=create_dictionaries(model,tmp_list)

     return combined 

 if __name__=='__main__':

     self_train()

修改网络结构

我们使用LSTM单层网络结构，在迭代15 次以后训练准确率已经可以达到96%以上。进一步思考一下，叠加LSTM网络，是否可以达到更高的训练准确率，其他的部分不变，我们仅仅修改我们的网络定义部分

 ##定义网络结构

 def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):

     nb_classes = 3

     print 'Defining a Simple Keras Model...'

     model = Sequential()  # or Graph or whatever

     model.add(Embedding(output_dim=vocab_dim,

                         input_dim=n_symbols,

                         mask_zero=True,

                         weights=[embedding_weights],

                         input_length=input_length))  # Adding Input Length

     print vocab_dim

     print n_symbols

     #model.add(LSTM(output_dim=50, activation='relu',inner_activation='hard_sigmoid'))

     #model.add(LSTM(output_dim=25, activation='relu', return_sequences=True))

     model.add(LSTM(64, input_dim=vocab_dim, activation='relu', return_sequences=True))

     model.add(LSTM(32, return_sequences=True))

     model.add(Dropout(0.5))

     #model.add(Dense(nb_classes))

     #model.add(Activation('softmax'))

     print model.summary()

     model.add(NonMasking())

     model.add(Flatten())

     model.add(Dense(output_dim=nb_classes, activation='softmax'))

     print 'Compiling the Model...'

     model.compile(loss='categorical_crossentropy',

                   optimizer='adam',metrics=['accuracy'])

     print "Train..."

     print y_train

     model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))

     print "Evaluate..."

     score = model.evaluate(x_test, y_test,

                                 batch_size=batch_size)

     yaml_string = model.to_yaml()

     with open('lstm_data/lstm_koubei.yml', 'w') as outfile:

         outfile.write( yaml.dump(yaml_string, default_flow_style=True) )

     model.save_weights('lstm_data/lstm_koubei.h5')

     print 'Test score:', score

　　我们发现同样迭代15次，训练准确率可以达到97%左右。说明叠加LSTM网络结构确实是有效的，能够更好的抓取训练语料的特征。

训练反思与总结

目前，我们仅仅可以说做了一个意图识别的demo，已经可以达到比较高的训练准确率，但是我们还有很多方面改进。第一也是最直观的是我们目前的训练语料还很少，并且训练的类别也比较少，我们希望在保持训练准确率的前提下，训练的语料可以更多，训练的类别更多。第二对语料的预处理做的非常的粗糙，没有去除停用词，没有去除标点符号等等，我们这里没有做的原因是我们的训练语料是比较干净所以就没有进行处理了。第三个是我们目前分词的算法是非常的粗糙，使用的结巴分词默认的词库进行分词。分词的词库没有匹配我们领域知识。第四我们还希望使用CNN来对比一下抽取的效果。

但是大家可以看到深度学习在自然语言处理当中巨大的威力，我们不用辛辛苦苦的去提取unigram,bigram等等特征，使用embeding的方法来描述文本，节省了大量人工，并且训练的准确率远超过我们的预期。

使用LSTM和Softmx来进行意图识别的更多相关文章

如何使用padlepadle 进行意图识别-开篇
前言意图识别是通过分类的办法将句子或者我们常说的query分到相应的意图种类.举一个简单的例子,我想听周杰伦的歌,这个query的意图便是属于音乐意图,我想听郭德纲的相声便是属于电台意图.做好了意图 ...
计算广告（5）----query意图识别
目录: 一.简介: 1.用户意图识别概念 2.用户意图识别难点 3.用户意图识别分类 4.意图识别方法: (1)基于规则 (2)基于穷举 (3)基于分类模型二.意图识别具体做法: 1.数据集 2.数 ...
智能问答中的NLU意图识别流程梳理
NLU意图识别的流程说明基于智能问答的业务流程,所谓的NLU意图识别就是针对已知的训练语料(如语料格式为\((x,y)\)格式的元组列表,其中\(x\)为训练语料,\(y\)为期望输出类别或者称为意 ...
浅谈意图识别各种实现&数学原理
\[ J_\alpha(x) = \sum_{m=0}^\infty \frac{(-1)^m}{m! \Gamma (m + \alpha + 1)} {\left({ \frac{x}{2} }\ ...
LSTM网络应用于DGA域名识别－－文献翻译－－更新中
原文献名称 Predicting Domain Generation Algorithms with Long Short-Term Memory Networks原文作者 Jonathan Wood ...
【NER】对命名实体识别(槽位填充)的一些认识
命名实体识别 1. 问题定义广义的命名实体识别是指识别出待处理文本中三大类(实体类.时间类和数字类).七小类(人名.机构名.地名.日期.货币和百分比)命名实体.但实际应用中不只是识别上述所说的实体类 ...
Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）
所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的 ...
任务型对话（一）—— NLU（意识识别和槽值填充）
1,概述任务型对话系统越来越多的被应用到实际的场景中,例如siri,阿里小密这类的产品.通常任务型对话系统都是基于pipline的方式实现的,具体的流程图如下: 整个pipline由五个模块组成:语 ...
A Network-based End-to-End Trainable Task-oriented Dialogue System
abstract 让机器去和人类自然的交谈是具有挑战性的.最近的任务型对话系统需要创造几个部分并且通常这需要大量的人工干预,或者需要标注数据去解决各部分训练的问题.在这里我们提出了一种端到端的任务型对 ...

随机推荐

精通CSS+DIV网页样式与布局--制作实用菜单
在上篇博文中,小编中主要的简单总结了一下CSS中关于如何设置页面和浏览器元素,今天小编继续将来介绍CSS的相关基础知识,这篇博文,小编主要简单的总结一下在CSS中如何制作网页中的菜单,这部分的内容包括 ...
Linux中的查找命令find
原文:http://blog.csdn.net/windone0109/article/details/2817792 查找目录:find /(查找范围) -name '查找关键字' -type d ...
pig limit 少于10行，会返回所有记录
my = limit g_log 3; STORE my INTO '/user/wizad/tmp/my' USING PigStorage(','); 这样会返回g_log的所有记录. 要大于等于 ...
Linux IPC实践(11) --System V信号量(1)
信号量API #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> int semget ...
阿里云服务器实战(一) : 在Linux下Tomcat7下使用连接池
云服务器的环境如下: Tomcat7+MySql5.6 一,如果自定义了程序的文件目录 , 下面的/alidata/xxx 就是自定义的目录在Linux的Tomcat的server.xml里的Ho ...
PR 审批界面增加显示项方法
PR 审批界面增加显示项解决方法 Step 1: 进入审批界面: Step 2: 在上图中,点击左下角'About this Page'查看数据源点击上图中'Expand ...
【一天一道LeetCode】#32. Longest Valid Parentheses
一天一道LeetCode系列 (一)题目 Given a string containing just the characters '(' and ')', find the length of t ...
Linux文件系统构成(第二版)
Linux文件系统构成 /boot目录: 内核文件.系统自举程序文件保存位置,存放了系统当前的内核[一般128M即可] 如:引导文件grub的配置文件等 /etc目录: 系统常用的配置文件,所以备份系 ...
android wheelview实现三级城市选择
很早之前看淘宝就有了ios那种的城市选择控件,当时也看到网友有分享,不过那个写的很烂,后来(大概是去年吧),我们公司有这么一个项目,当时用的还是网上比较流行的那个黑框的那个,感觉特别的丑,然后我在那个 ...
PS 图像特效算法— —渐变
这个特效利用图层的混合原理,先设置一个遮罩层,然后用遮罩层与原图进行相乘,遮罩层不同,图像最后呈现的渐变效果也不一样. clc;clear all;close all;addpath('E:\Phot ...

使用LSTM和Softmx来进行意图识别

前言