用Keras搞一个阅读理解机器人
catalogue
. 训练集
. 数据预处理
. 神经网络模型设计(对话集 <-> 问题集)
. 神经网络模型设计(问题集 <-> 回答集)
. RNN神经网络
. 训练
. 效果验证
1. 训练集
Mary moved to the bathroom.
John went to the hallway.
Where is Mary? bathroom
Daniel went back to the hallway.
Sandra moved to the garden.
Where is Daniel? hallway
John moved to the office.
Sandra journeyed to the bathroom.
Where is Daniel? hallway
Mary moved to the hallway.
Daniel travelled to the office.
Where is Daniel? office
John went back to the garden.
John moved to the bedroom.
Where is Sandra? bathroom
Sandra travelled to the office.
Sandra went to the bathroom.
Where is Sandra? bathroom
Mary went to the bedroom.
Daniel moved to the hallway.
Where is Sandra? bathroom
John went to the garden.
John travelled to the office.
Where is Sandra? bathroom
Daniel journeyed to the bedroom.
Daniel travelled to the hallway.
Where is John? office
训练集以对话集合[2] + 问题[1] + 回答[1]的形式组成
Relevant Link:
https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz
2. 数据预处理
0x1: 词汇表
词汇表是语料向量化的一个基础,用来将单词映射到向量空间
Here's what a "story" tuple looks like (input, query, answer):
([u'Mary', u'moved', u'to', u'the', u'bathroom', u'.', u'John', u'went', u'to', u'the', u'hallway', u'.'], [u'Where', u'is', u'Mary', u'?'], u'bathroom')
0x2: 训练集编码
根据词汇表将对话 + 问题 + 答案进行数字向量化编码
Vectorizing the word sequences...
inputs_train[]
[ ]
queries_train[]
[ ]
answers_train[]
[ . . . . . . . . . . . . . . . . . .
. . . .] //训练集和测试机同时都需要编码
inputs_test[]
[ ]
queries_test[]
[ ]
answers_test[]
[ . . . . . . . . . . . . . . . . . .
. . . .]
0x3: size归一化Padding
根据预处理过程中找到的最长对话、最长问题、最长答案,将所有的对话/问题/答案向量全部前置Padding到一个等长的size chunk
def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
X = []
Xq = []
Y = []
for story, query, answer in data:
x = [word_idx[w] for w in story]
xq = [word_idx[w] for w in query]
y = np.zeros(len(word_idx) + ) # let's not forget that index 0 is reserved
y[word_idx[answer]] =
X.append(x)
Xq.append(xq)
Y.append(y)
return (pad_sequences(X, maxlen=story_maxlen),
pad_sequences(Xq, maxlen=query_maxlen), np.array(Y))
Relevant Link:
3. 神经网络模型设计(对话集 <-> 问题集)
总体来说,我们把对话集和问题集分成2路神经网络,分别迭代计算,最后合并到一个activation function中(softmax activation)。之所以将对话集和问题集分成2路然后合并为一组,是因为对话和问题之间内在是由关联性的,阅读理解题目不会平白无故提出一个和对话无关的问题
0x1: 对话集训练
1. 嵌入层 Embedding
嵌入层将正整数(下标)转换为具有固定大小的向量,如[[4],[20]]->[[0.25,0.1],[0.6,-0.2]]
Embedding层只能作为模型的第一层,我们这里把它作为模型的第一层,用来接收对话词化向量
keras.layers.embeddings.Embedding(
input_dim,
output_dim,
init='uniform',
input_length=None,
W_regularizer=None,
activity_regularizer=None,
W_constraint=None,
mask_zero=False,
weights=None,
dropout=0.0
) . input_dim:大或等于0的整数,字典长度,即输入数据最大下标+
. output_dim:大于0的整数,代表全连接嵌入的维度
. init:初始化方法,为预定义初始化方法名的字符串,或用于初始化权重的Theano函数。该参数仅在不传递weights参数时有意义。
. weights:权值,为numpy array的list。该list应仅含有一个如(input_dim,output_dim)的权重矩阵
. W_regularizer:施加在权重上的正则项,为WeightRegularizer对象
. W_constraints:施加在权重上的约束项,为Constraints对象
. mask_zero:布尔值,确定是否将输入中的‘’看作是应该被忽略的‘填充’(padding)值,该参数在使用递归层处理变长输入时有用。设置为True的话,模型中后续的层必须都支持masking,否则会抛出异常
. input_length:当输入序列的长度固定时,该值为其长度。如果要在该层后接Flatten层,然后接Dense层,则必须指定该参数,否则Dense层的输出维度无法自动推断。
dropout:~1的浮点数,代表要断开的嵌入比例, 输入shape
形如(samples,sequence_length)的2D张量 输出shape
形如(samples, sequence_length, output_dim)的3D张量
code slice
# embed the input sequence into a sequence of vectors
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,
output_dim=,
input_length=story_maxlen))
2. Dropout防止过拟合
input_encoder_m.add(Dropout(0.3))
0x2: 问题集训练
和对话集模型结构一样
# embed the question into a sequence of vectors
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
output_dim=,
input_length=query_maxlen))
question_encoder.add(Dropout(0.3))
0x3: 合并对话集和问题集
Merge层根据给定的模式,将一个张量列表中的若干张量合并为一个单独的张量
keras.engine.topology.Merge(
layers=None,
mode='sum',
concat_axis=-,
dot_axes=-,
output_shape=None,
node_indices=None,
tensor_indices=None,
name=None
) . layers:该参数为Keras张量的列表,或Keras层对象的列表。该列表的元素数目必须大于1。
. mode:合并模式,为预定义合并模式名的字符串或lambda函数或普通函数,如果为lambda函数或普通函数,则该函数必须接受一个张量的list作为输入,并返回一个张量。如果为字符串,则必须是下列值之一:“sum”,“mul”,“concat”,“ave”,“cos”,“dot”
. concat_axis:整数,当mode=concat时指定需要串联的轴
. dot_axes:整数或整数tuple,当mode=dot时,指定要消去的轴
. output_shape:整数tuple或lambda函数/普通函数(当mode为函数时)。如果output_shape是函数时,该函数的输入值应为一一对应于输入shape的list,并返回输出张量的shape。
. node_indices:可选,为整数list,如果有些层具有多个输出节点(node)的话,该参数可以指定需要merge的那些节点的下标。如果没有提供,该参数的默认值为全0向量,即合并输入层0号节点的输出值。
. tensor_indices:可选,为整数list,如果有些层返回多个输出张量的话,该参数用以指定需要合并的那些张量
对于Embedding层来说,输入的是2D,输出的3D,第三维度对我们价值不大,因此在Merge的时候消去
match = Sequential()
match.add(Merge([input_encoder_m, question_encoder],
mode='dot',
dot_axes=[, ]))
0x4: 对话集和问题集激活层
match.add(Activation('softmax'))
Relevant Link:
https://keras-cn.readthedocs.io/en/latest/layers/embedding_layer/
https://keras-cn.readthedocs.io/en/latest/layers/core_layer/
4. 神经网络模型设计(问题集 <-> 回答集)
同样道理,问题集和回答集之间也是有关联关系的,因此也就它们分成2路嵌套网络,最后合并到一起
# embed the input into a single vector with size = story_maxlen:
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,
output_dim=query_maxlen,
input_length=story_maxlen))
input_encoder_c.add(Dropout(0.3))
# output: (samples, story_maxlen, query_maxlen)
# sum the match vector with the input vector:
response = Sequential()
response.add(Merge([match, input_encoder_c], mode='sum'))
# output: (samples, story_maxlen, query_maxlen)
response.add(Permute((, ))) # output: (samples, query_maxlen, story_maxlen) # concatenate the match vector with the question vector,
# and do logistic regression on top
answer = Sequential()
answer.add(Merge([response, question_encoder], mode='concat', concat_axis=-))
5. RNN神经网络
有很多算法模型适合做这种推测型的神经网络,这里选择了RNN LSTM,RNN 是包含循环的网络,允许信息的持久化。传统的BP在面对层数较多时,能反馈到之前层的能力就会很弱了,但是RNN它能够较好地利用之前的历史训练数据,相当于一个长期记忆的能力
keras.layers.recurrent.LSTM(
output_dim,
init='glorot_uniform',
inner_init='orthogonal',
forget_bias_init='one',
activation='tanh',
inner_activation='hard_sigmoid',
W_regularizer=None,
U_regularizer=None,
b_regularizer=None,
dropout_W=0.0,
dropout_U=0.0
) . output_dim:内部投影和输出的维度
. init:初始化方法,为预定义初始化方法名的字符串,或用于初始化权重的Theano函数。
. inner_init:内部单元的初始化方法
. forget_bias_init:遗忘门偏置的初始化函数.建议初始化为全1元素
. activation:激活函数,为预定义的激活函数名
. inner_activation:内部单元激活函数
. W_regularizer:施加在权重上的正则项,为WeightRegularizer对象
. U_regularizer:施加在递归权重上的正则项,为WeightRegularizer对象
. b_regularizer:施加在偏置向量上的正则项,为WeightRegularizer对象
. dropout_W:~1之间的浮点数,控制输入单元到输入门的连接断开比例
. dropout_U:~1之间的浮点数,控制输入单元到递归连接的断开比例
code slice
# the original paper uses a matrix multiplication for this reduction step.
# we choose to use a RNN instead.
answer.add(LSTM())
# one regularization layer -- more would probably be needed.
answer.add(Dropout(0.3))
answer.add(Dense(vocab_size))
# we output a probability distribution over the vocabulary
answer.add(Activation('softmax'))
Relevant Link:
http://www.jianshu.com/p/9dc9f41f0b29
https://keras-cn.readthedocs.io/en/latest/layers/recurrent_layer/
6. 训练
fit(
self,
x,
y,
batch_size=,
nb_epoch=,
verbose=,
callbacks=[],
validation_split=0.0,
validation_data=None,
shuffle=True,
class_weight=None,
sample_weight=None
) . x:输入数据。如果模型只有一个输入,那么x的类型是numpy array,如果模型有多个输入,那么x的类型应当为list,list的元素是对应于各个输入的numpy array。如果模型的每个输入都有名字,则可以传入一个字典,将输入名与其输入数据对应起来。
. y:标签,numpy array。如果模型有多个输出,可以传入一个numpy array的list。如果模型的输出拥有名字,则可以传入一个字典,将输出名与其标签对应起来。
. batch_size:整数,指定进行梯度下降时每个batch包含的样本数。训练时一个batch的样本会被计算一次梯度下降,使目标函数优化一步。
. nb_epoch:整数,训练的轮数,训练数据将会被遍历nb_epoch次。Keras中nb开头的变量均为"number of"的意思
. verbose:日志显示,0为不在标准输出流输出日志信息,1为输出进度条记录,2为每个epoch输出一行记录
. callbacks:list,其中的元素是keras.callbacks.Callback的对象。这个list中的回调函数将会在训练过程中的适当时机被调用,参考回调函数
. validation_split:~1之间的浮点数,用来指定训练集的一定比例数据作为验证集。验证集将不参与训练,并在每个epoch结束后测试的模型的指标,如损失函数、精确度等。
. validation_data:形式为(X,y)或(X,y,sample_weights)的tuple,是指定的验证集。此参数将覆盖validation_spilt。
. shuffle:布尔值,表示是否在训练过程中每个epoch前随机打乱输入样本的顺序。
. class_weight:字典,将不同的类别映射为不同的权值,该参数用来在训练过程中调整损失函数(只能用于训练)。该参数在处理非平衡的训练数据(某些类的训练样本数很少)时,可以使得损失函数对样本数不足的数据更加关注。
. sample_weight:权值的numpy array,用于在训练时调整损失函数(仅用于训练)。可以传递一个1D的与样本等长的向量用于对样本进行1对1的加权,或者在面对时序数据时,传递一个的形式为(samples,sequence_length)的矩阵来为每个时间步上的样本赋不同的权。这种情况下请确定在编译模型时添加了sample_weight_mode='temporal'。
fit函数返回一个History的对象,其History.history属性记录了损失函数和其他指标的数值随epoch变化的情况,如果有验证集的话,也包含了验证集的这些指标变化情况
0x1: 保存模型结构、训练出来的权重、及优化器状态
keras的callback参数可以帮助我们实现在训练过程中的适当时机被调用。实现实时保存训练模型以及训练参数
keras.callbacks.ModelCheckpoint(
filepath,
monitor='val_loss',
verbose=,
save_best_only=False,
save_weights_only=False,
mode='auto',
period=
) . filename:字符串,保存模型的路径
. monitor:需要监视的值
. verbose:信息展示模式,0或1
. save_best_only:当设置为True时,将只保存在验证集上性能最好的模型
. mode:‘auto’,‘min’,‘max’之一,在save_best_only=True时决定性能最佳模型的评判准则,例如,当监测值为val_acc时,模式应为max,当检测值为val_loss时,模式应为min。在auto模式下,评价准则由被监测值的名字自动推断。
. save_weights_only:若设置为True,则只保存模型权重,否则将保存整个模型(包括模型结构,配置信息等)
. period:CheckPoint之间的间隔的epoch数
0x2: 当监测值不再改善时中止训练
keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=,
verbose=,
mode='auto'
) . monitor:需要监视的量
. patience:当early stop被激活(如发现loss相比上一个epoch训练没有下降),则经过patience个epoch后停止训练。
. verbose:信息展示模式
. mode:‘auto’,‘min’,‘max’之一,在min模式下,如果检测值停止下降则中止训练。在max模式下,当检测值不再上升则停止训练。
0x3:学习率动态调整
keras.callbacks.LearningRateScheduler(schedule) schedule:函数,该函数以epoch号为参数(从0算起的整数),返回一个新学习率(浮点数)
也可以让keras自动调整学习率
keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.1,
patience=,
verbose=,
mode='auto',
epsilon=0.0001,
cooldown=,
min_lr=
) . monitor:被监测的量
. factor:每次减少学习率的因子,学习率将以lr = lr*factor的形式被减少
. patience:当patience个epoch过去而模型性能不提升时,学习率减少的动作会被触发
. mode:‘auto’,‘min’,‘max’之一,在min模式下,如果检测值触发学习率减少。在max模式下,当检测值不再上升则触发学习率减少。
. epsilon:阈值,用来确定是否进入检测值的“平原区”
. cooldown:学习率减少后,会经过cooldown个epoch才重新进行正常操作
. min_lr:学习率的下限
当学习停滞时,减少2倍或10倍的学习率常常能获得较好的效果
0x4: code
'''Trains a memory network on the bAbI dataset. References:
- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
http://arxiv.org/abs/1502.05698 - Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
"End-To-End Memory Networks",
http://arxiv.org/abs/1503.08895 Reaches 98.6% accuracy on task 'single_supporting_fact_10k' after epochs.
Time per epoch: 3s on CPU (core i7).
''' from __future__ import print_function
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Activation, Dense, Merge, Permute, Dropout
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from functools import reduce
import tarfile
import numpy as np
import re
import os
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau def tokenize(sent):
'''Return the tokens of a sentence including punctuation. >>> tokenize('Bob dropped the apple. Where is the apple?')
['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
'''
return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()] def parse_stories(lines, only_supporting=False):
'''Parse stories provided in the bAbi tasks format If only_supporting is true, only the sentences that support the answer are kept.
'''
data = []
story = []
for line in lines:
line = line.decode('utf-8').strip()
nid, line = line.split(' ', )
nid = int(nid)
if nid == :
story = []
if '\t' in line:
q, a, supporting = line.split('\t')
q = tokenize(q)
substory = None
if only_supporting:
# Only select the related substory
supporting = map(int, supporting.split())
substory = [story[i - ] for i in supporting]
else:
# Provide all the substories
substory = [x for x in story if x]
data.append((substory, q, a))
story.append('')
else:
sent = tokenize(line)
story.append(sent)
return data def get_stories(f, only_supporting=False, max_length=None):
'''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story. If max_length is supplied, any stories longer than max_length tokens will be discarded.
'''
data = parse_stories(f.readlines(), only_supporting=only_supporting)
flatten = lambda data: reduce(lambda x, y: x + y, data)
data = [(flatten(story), q, answer) for story, q, answer in data if not max_length or len(flatten(story)) < max_length]
return data def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
X = []
Xq = []
Y = []
for story, query, answer in data:
x = [word_idx[w] for w in story]
xq = [word_idx[w] for w in query]
y = np.zeros(len(word_idx) + ) # let's not forget that index 0 is reserved
y[word_idx[answer]] =
X.append(x)
Xq.append(xq)
Y.append(y)
return (pad_sequences(X, maxlen=story_maxlen),
pad_sequences(Xq, maxlen=query_maxlen), np.array(Y)) path = ''
try:
if not os.path.isfile("babi_tasks_1-20_v1-2.tar.gz"):
path = get_file('babi_tasks_1-20_v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
else:
path = 'babi_tasks_1-20_v1-2.tar.gz'
except:
print('Error downloading dataset, please download it manually:\n'
'$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
'$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
raise
if not os.path.isfile(path):
print("babi_tasks_1-20_v1-2.tar.gz downlaod faild")
exit()
tar = tarfile.open(path) challenges = {
# QA1 with , samples
'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
# QA2 with , samples
'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
}
challenge_type = 'single_supporting_fact_10k'
challenge = challenges[challenge_type] print('Extracting stories for the challenge:', challenge_type)
train_stories = get_stories(tar.extractfile(challenge.format('train')))
test_stories = get_stories(tar.extractfile(challenge.format('test'))) vocab = sorted(reduce(lambda x, y: x | y, (set(story + q + [answer]) for story, q, answer in train_stories + test_stories)))
print('vocab')
print(vocab) # Reserve for masking via pad_sequences
vocab_size = len(vocab) +
story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories)))
query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories))) print('-')
print('Vocab size:', vocab_size, 'unique words')
print('Story max length:', story_maxlen, 'words')
print('Query max length:', query_maxlen, 'words')
print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
print('-')
print('Here\'s what a "story" tuple looks like (input, query, answer):')
print(train_stories[])
print('-')
print('Vectorizing the word sequences...') word_idx = dict((c, i + ) for i, c in enumerate(vocab))
inputs_train, queries_train, answers_train = vectorize_stories(train_stories, word_idx, story_maxlen, query_maxlen)
inputs_test, queries_test, answers_test = vectorize_stories(test_stories, word_idx, story_maxlen, query_maxlen) print('inputs_train[0]')
print(inputs_train[])
print('queries_train[0]')
print(queries_train[])
print('answers_train[0]')
print(answers_train[]) print ('inputs_test[0]')
print(inputs_test[])
print('queries_test[0]')
print(queries_test[])
print('answers_test[0]')
print(answers_test[]) print('-')
print('inputs: integer tensor of shape (samples, max_length)')
print('inputs_train shape:', inputs_train.shape)
print('inputs_test shape:', inputs_test.shape)
print('-')
print('queries: integer tensor of shape (samples, max_length)')
print('queries_train shape:', queries_train.shape)
print('queries_test shape:', queries_test.shape)
print('-')
print('answers: binary (1 or 0) tensor of shape (samples, vocab_size)')
print('answers_train shape:', answers_train.shape)
print('answers_test shape:', answers_test.shape)
print('-')
print('Compiling...') ######################################## story - question_encoder ########################################
# embed the input sequence into a sequence of vectors
input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size,
output_dim=,
input_length=story_maxlen))
input_encoder_m.add(Dropout(0.3))
# output: (samples, story_maxlen, embedding_dim) # embed the question into a sequence of vectors
question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size,
output_dim=,
input_length=query_maxlen))
question_encoder.add(Dropout(0.3))
# output: (samples, query_maxlen, embedding_dim)
# compute a 'match' between input sequence elements (which are vectors)
# and the question vector sequence
match = Sequential()
match.add(Merge([input_encoder_m, question_encoder],
mode='dot',
dot_axes=[, ]))
match.add(Activation('softmax'))
# output: (samples, story_maxlen, query_maxlen)
######################################## story - question_encoder ######################################## # embed the input into a single vector with size = story_maxlen:
input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size,
output_dim=query_maxlen,
input_length=story_maxlen))
input_encoder_c.add(Dropout(0.3))
# output: (samples, story_maxlen, query_maxlen)
# sum the match vector with the input vector:
response = Sequential()
response.add(Merge([match, input_encoder_c], mode='sum'))
# output: (samples, story_maxlen, query_maxlen)
response.add(Permute((, ))) # output: (samples, query_maxlen, story_maxlen) # concatenate the match vector with the question vector,
# and do logistic regression on top
answer = Sequential()
answer.add(Merge([response, question_encoder], mode='concat', concat_axis=-)) # the original paper uses a matrix multiplication for this reduction step.
# we choose to use a RNN instead.
answer.add(LSTM())
# one regularization layer -- more would probably be needed.
answer.add(Dropout(0.3))
answer.add(Dense(vocab_size))
# we output a probability distribution over the vocabulary
answer.add(Activation('softmax')) # checkpoint
checkpointer = ModelCheckpoint(filepath="./checkpoint.hdf5", verbose=)
# learning rate adjust dynamic
lrate = ReduceLROnPlateau(min_lr=0.00001) answer.compile(optimizer='rmsprop', loss='categorical_crossentropy',
metrics=['accuracy'])
# Note: you could use a Graph model to avoid repeat the input twice
answer.fit(
[inputs_train, queries_train, inputs_train], answers_train,
batch_size=,
nb_epoch=,
validation_data=([inputs_test, queries_test, inputs_test], answers_test),
callbacks=[checkpointer, lrate]
)
Relevant Link:
https://keras-cn.readthedocs.io/en/latest/models/model/
7. 效果验证
predict的网络结构需要和train的时候保持一样,我们可以使用model.load将训练时保存的模型以及参数载入进来
0x1; 题目: target.story
little go back to the alibaba.
hann join to the jiangnan university.
Where is little?
0x2: code
'''Trains a memory network on the bAbI dataset. References:
- Jason Weston, Antoine Bordes, Sumit Chopra, Tomas Mikolov, Alexander M. Rush,
"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks",
http://arxiv.org/abs/1502.05698 - Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus,
"End-To-End Memory Networks",
http://arxiv.org/abs/1503.08895 Reaches 98.6% accuracy on task 'single_supporting_fact_10k' after epochs.
Time per epoch: 3s on CPU (core i7).
''' from __future__ import print_function
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Activation, Dense, Merge, Permute, Dropout
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from functools import reduce
import tarfile
import numpy as np
import re
import os
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau
from keras.models import load_model def tokenize(sent):
'''Return the tokens of a sentence including punctuation. >>> tokenize('Bob dropped the apple. Where is the apple?')
['Bob', 'dropped', 'the', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']
'''
return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()] def parse_stories(lines, only_supporting=False):
'''Parse stories provided in the bAbi tasks format If only_supporting is true, only the sentences that support the answer are kept.
'''
data = []
story = []
for line in lines:
line = line.decode('utf-8').strip()
nid, line = line.split(' ', )
nid = int(nid)
if nid == :
story = []
if '\t' in line:
q, a, supporting = line.split('\t')
q = tokenize(q)
substory = None
if only_supporting:
# Only select the related substory
supporting = map(int, supporting.split())
substory = [story[i - ] for i in supporting]
else:
# Provide all the substories
substory = [x for x in story if x]
data.append((substory, q, a))
story.append('')
else:
sent = tokenize(line)
story.append(sent)
return data def parse_stories_target(lines, only_supporting=False):
'''Parse stories provided in the bAbi tasks format If only_supporting is true, only the sentences that support the answer are kept.
'''
data = []
story = []
for line in lines:
line = line.decode('utf-8').strip()
nid, line = line.split(' ', )
nid = int(nid)
if nid == :
story = []
print(line)
sent = tokenize(line)
story.append(sent)
substory = story[:len(story) - ]
query = story[-]
data.append((substory, query))
return data def get_stories(f, only_supporting=False, max_length=None):
'''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story. If max_length is supplied, any stories longer than max_length tokens will be discarded.
'''
data = parse_stories(f.readlines(), only_supporting=only_supporting)
flatten = lambda data: reduce(lambda x, y: x + y, data)
data = [(flatten(story), q, answer) for story, q, answer in data if not max_length or len(flatten(story)) < max_length]
return data def get_stories_target(f, only_supporting=False, max_length=None):
'''Given a file name, read the file, retrieve the stories, and then convert the sentences into a single story. If max_length is supplied, any stories longer than max_length tokens will be discarded.
'''
data = parse_stories_target(f.readlines(), only_supporting=only_supporting)
print('parse_stories_target')
print(data)
flatten = lambda data: reduce(lambda x, y: x + y, data)
data = [(flatten(story), q) for story, q in data if not max_length or len(flatten(story)) < max_length]
return data def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
X = []
Xq = []
for story, query in data:
x = [word_idx[w] for w in story]
xq = [word_idx[w] for w in query]
X.append(x)
Xq.append(xq)
return (
pad_sequences(X, maxlen=story_maxlen),
pad_sequences(Xq, maxlen=query_maxlen)
) path = 'babi_tasks_1-20_v1-2.tar.gz'
if not os.path.isfile(path):
print("babi_tasks_1-20_v1-2.tar.gz not exist")
exit()
tar = tarfile.open(path) challenges = {
# QA1 with , samples
'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
# QA2 with , samples
'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
}
challenge_type = 'single_supporting_fact_10k'
challenge = challenges[challenge_type] print('Extracting stories for the challenge:', challenge_type)
train_stories = get_stories(tar.extractfile(challenge.format('train')))
test_stories = get_stories(tar.extractfile(challenge.format('test'))) # get the vocab, which is same as the train
vocab = sorted(reduce(lambda x, y: x | y, (set(story + q + [answer]) for story, q, answer in train_stories + test_stories)))
print('vocab')
print(vocab) # Reserve for masking via pad_sequences
vocab_size = len(vocab) +
story_maxlen = max(map(len, (x for x, _, _ in train_stories + test_stories)))
query_maxlen = max(map(len, (x for _, x, _ in train_stories + test_stories))) print('-')
print('Vocab size:', vocab_size, 'unique words')
print('Story max length:', story_maxlen, 'words')
print('Query max length:', query_maxlen, 'words')
print('Number of training stories:', len(train_stories))
print('Number of test stories:', len(test_stories))
print('-')
print('Here\'s what a "story" tuple looks like (input, query, answer):')
print(train_stories[])
print('-')
print('Vectorizing the word sequences...') word_idx = dict((c, i + ) for i, c in enumerate(vocab))
# get the vec of the target story we input
with open("./target.story") as fs:
target_input = get_stories_target(fs)
target_storys, target_queries = vectorize_stories(target_input, word_idx, story_maxlen, query_maxlen) print('target_storys[0]')
print(target_storys[])
print('target_queries[0]')
print(target_queries[]) print('-')
print('inputs: integer tensor of shape (samples, max_length)')
print('inputs_train shape:', target_storys.shape)
print('-')
print('queries: integer tensor of shape (samples, max_length)')
print('queries_train shape:', target_queries.shape)
print('-')
print('Compiling...') # laod the model
answer_model = load_model('./checkpoint.hdf5')
answer_output = answer_model.predict([target_storys], batch_size=, verbose=)
print(answer_output)
实做的时候发现几问题
. 如果我们用于训练的语料库不够完备,就会造成我的词汇表不够完备,从而直接导致在进行词向量化的时候解析失败,从而后续的归类也无法进行了
. 在递归下降的过程中,我们可能遇到2种情况
) 小型的凹槽
) 真正的谷底(可能很陡峭,是突然下降的那种)
这2种情况造成的现象都可能是在连续几个spoch中,loss不再减少或者降低很微小,这个时候我们要慎重使用learning rate自动调整(实际上是降低,例如常见的10%)策略,因为这有可能导致我们过早地收敛在一个假的小凹槽中,而无法下降都真正的最佳谷底。但是反过来将,如果遇到的是那种很深且突然下降的谷底,我们的learning rate太大可能导致我们始终无法收敛到最佳值,而一直在最佳值左右晃动
Relevant Link:
Copyright (c) 2017 LittleHann All rights reserved
用Keras搞一个阅读理解机器人的更多相关文章
- 第一讲 从头开始做一个web qq 机器人,第一步获取smart qq二维码
新手教程: 前言:最近在看了一下很久很久以前做的qq机器人失效了,最近也在换工作目前还在职,时间很挺宽裕的.就决定从新搞一个web qq机器人 PC的协议解析出来有点费时间以后再做. 准备工作: 编译 ...
- 机器阅读理解(看各类QA模型与花式Attention)
目录 简介 经典模型概述 Model 1: Attentive Reader and Impatient Reader Model 2: Attentive Sum Reader Model 3: S ...
- 机器阅读理解(看各类QA模型与花式Attention)(转载)
目录 简介 经典模型概述 Model 1: Attentive Reader and Impatient Reader Attentive Reader Impatient Reader Model ...
- Tensorflow做阅读理解与完形填空
catalogue . 前言 . 使用的数据集 . 数据预处理 . 训练 . 测试模型运行结果: 进行实际完形填空 0. 前言 开始写这篇文章的时候是晚上12点,突然想到几点新的理解,赶紧记下来.我们 ...
- Trie树【P3879】 [TJOI2010]阅读理解
Description 英语老师留了N篇阅读理解作业,但是每篇英文短文都有很多生词需要查字典,为了节约时间,现在要做个统计,算一算某些生词都在哪几篇短文中出现过. Input 第一行为整数N,表示短文 ...
- Keras 文档阅读笔记(不定期更新)
目录 Keras 文档阅读笔记(不定期更新) 模型 Sequential 模型方法 Model 类(函数式 API) 方法 层 关于 Keras 网络层 核心层 卷积层 池化层 循环层 融合层 高级激 ...
- HTTPS强制安全策略-HSTS协议阅读理解
https://developer.mozilla.org/en-US/docs/Web/Security/HTTP_strict_transport_security [阅读理解式翻译,非严格遵循原 ...
- Codeforces#543 div2 A. Technogoblet of Fire(阅读理解)
题目链接:http://codeforces.com/problemset/problem/1121/A 真·阅读理解 题意就是 有n个人 pi表示他们的强度 si表示他们来自哪个学校 现在Arkad ...
- P3879 [TJOI2010]阅读理解 题解
P3879 [TJOI2010]阅读理解 题解 题目描述 英语老师留了N篇阅读理解作业,但是每篇英文短文都有很多生词需要查字典,为了节约时间,现在要做个统计,算一算某些生词都在哪几篇短文中出现过. 输 ...
随机推荐
- 了解AutoCAD对象层次结构 —— 3 ——数据库
数据库的结构是什么样的?对象是如何存储在数据库中的?这些问题我们需要搞明白.在此我们可以借助工具ArxDbg或MgdDbg来查看数据库结构及其内容.下图就是利用MgdDbg工具查看到的内容,我们可以看 ...
- Spring 使用介绍(五)—— AOP(一)
一.简单使用:Hello World实例 1.定义目标类 public interface Hello { void sayHello(); } public class HelloImpl impl ...
- 【BZOJ4555】【TJOI2016】【HEOI2016】求和 第二类斯特林数 NTT
题目大意 求\(f(n)=\sum_{i=0}^n\sum_{j=0}^i2^j\times j!\times S(i,j)\\\) 对\(998244353\)取模 \(n\leq 100000\) ...
- python学习日记(join,range)
join方法 join方法用于将序列里的字符串以指定的字符串连接成一个新的字符串 s = 'fasfw123' s1 = '-'.join(s) print(s1) str = '&ooooo ...
- Hyper-V
docker与vm同时安装,会导致虚拟机无法启动及网络不通等问题,解决方法分为两步. 1.在网络连接中禁用vEthernet连接. 2.控制面板>程序>程序和功能中 >启用或关闭wi ...
- 全文检索 -- Solr从概念到实战(一)
全文检索: 将整个文本进行“分词”处理,在索引库中为分词得到的每一个词都建立索引,和用户搜索的关键词进行匹配.实现快速查找效果. 传统sql语句实现的局限性: select song_id,song_ ...
- SetCapture() & ReleaseCapture() 捕获窗口外的【松开左键事件】: WM_LBUTTONUP
今天在窗口上绘图的时候,遇到一个问题:在特殊情况下,当用户在窗口中按下鼠标左键,然后移动到窗口外松开鼠标左键,这时程序中只能捕获到 WM_LBUTTONDOWN(按下) 和 WM_MOUSEMOVE( ...
- 第一篇 - bsp抓取python中文开发者社区中的所有高级教程
工具:python3.6 pycharm 库:bs4 + urllib 第一步:读取html源码 from bs4 import BeautifulSoup import urllib.reques ...
- mybatis 插入一条记录 参数为map的写法
xml: <insert id="insertUser" parameterType="java.util.Map"> insert into us ...
- Ubuntu下使用sublime text进行py开发
放弃了,半天弄不出一个方便的开放环境,还是现成的iDE方便 下载方法 https://www.sublimetext.com/docs/3/linux_repositories.html 首先安装Pa ...