本教程转载至:TensorFlow练习7: 基于RNN生成古诗词

使用的数据集是全唐诗,首先提供一下数据集的下载链接:https://pan.baidu.com/s/13pNWfffr5HSN79WNb3Y0_w              提取码:koss

RNN不像传统的神经网络-它们的输出输出是固定的,而RNN允许我们输入输出向量序列。RNN是为了对序列数据进行建模而产生的。本帖代码移植自char-rnn,它是基于Torch的洋文模型,稍加修改即可应用于中文。char-rnn使用文本文件做为输入、训练RNN模型,然后使用它生成和训练数据类似的文本。

下边代码有修改,以适应TensorFlow1.4和GPU平台

 #coding=utf-8
import collections
import numpy as np
import tensorflow as tf
import io
import sys
import os
reload(sys)
sys.setdefaultencoding('utf-8')
#-------------------------------数据预处理---------------------------# poetry_file ='poetry.txt' # 诗集
poetrys = []
with io.open(poetry_file, "r", encoding='utf-8',) as f:
for line in f:
# print line
try:
title, content = line.strip().split(':')
content = content.replace(' ','')
if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
continue
if len(content) < 5 or len(content) > 79:
continue
content = '[' + content + ']'
poetrys.append(content)
except Exception as e:
pass #按诗的字数排序
poetrys = sorted(poetrys,key=lambda line: len(line))
print(u"唐诗总数: ")
print(len(poetrys))
print(u"测试") # 统计每个字出现次数
all_words = []
for poetry in poetrys:
all_words += [word for word in poetry]
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs) # 取前多少个常用字
words = words[:len(words)] + (' ',)
# 每个字映射为一个数字ID
word_num_map = dict(zip(words, range(len(words))))
# 把诗转换为向量形式,参考TensorFlow练习1
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [ list(map(to_num, poetry)) for poetry in poetrys]
#[[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
#[339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
#....] # 每次取64首诗进行训练
batch_size = 64
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []
for i in range(n_chunk):
start_index = i * batch_size
end_index = start_index + batch_size batches = poetrys_vector[start_index:end_index]
length = max(map(len,batches))
xdata = np.full((batch_size,length), word_num_map[' '], np.int32)
for row in range(batch_size):
xdata[row,:len(batches[row])] = batches[row]
ydata = np.copy(xdata)
ydata[:,:-1] = xdata[:,1:]
"""
xdata ydata
[6,2,4,6,9] [2,4,6,9,9]
[1,4,2,8,5] [4,2,8,5,5]
"""
x_batches.append(xdata)
y_batches.append(ydata) #---------------------------------------RNN--------------------------------------# input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])
# 定义RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
if model == 'rnn':
cell_fun = tf.nn.rnn_cell.BasicRNNCell
elif model == 'gru':
cell_fun = tf.nn.rnn_cell.GRUCell
elif model == 'lstm':
cell_fun = tf.nn.rnn_cell.BasicLSTMCell cell = cell_fun(rnn_size, state_is_tuple=True)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True) initial_state = cell.zero_state(batch_size, tf.float32) with tf.variable_scope('rnnlm'):
softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words)+1])
softmax_b = tf.get_variable("softmax_b", [len(words)+1])
with tf.device("/gpu:0"):
embedding = tf.get_variable("embedding", [len(words)+1, rnn_size])
inputs = tf.nn.embedding_lookup(embedding, input_data) outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
output = tf.reshape(outputs,[-1, rnn_size]) logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)
return logits, last_state, probs, cell, initial_state ckpt_dir="./ckpt_dir"
if not os.path.exists(ckpt_dir):
os.makedirs(ckpt_dir) #训练
def train_neural_network():
logits, last_state, _, _, _ = neural_network()
targets = tf.reshape(output_targets, [-1])
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets], [tf.ones_like(targets, dtype=tf.float32)], len(words))
cost = tf.reduce_mean(loss)
learning_rate = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, tvars)) with tf.Session() as sess:
sess.run(tf.initialize_all_variables()) saver = tf.train.Saver(tf.all_variables()) for epoch in range(295):
sess.run(tf.assign(learning_rate, 0.002 * (0.97 ** epoch)))
n = 0
for batche in range(n_chunk):
train_loss, _ , _ = sess.run([cost, last_state, train_op], feed_dict={input_data: x_batches[n], output_targets: y_batches[n]})
n += 1
print(epoch, batche, train_loss)
if epoch % 7 == 0:
saver.save(sess, ckpt_dir+'/poetry.module', global_step=epoch) train_neural_network()

这里我只说自己对bug调试和调优的一些想法,具体代码理解,请联系作者本人。

首先是#coding=utf-8的问题,这里是告诉python环境,当前python脚本的文字编码是utf-8,这里如果不调整的话,默认的ansii环境极有可能报告编码错误。

之后是数据集的utf-8编码问题,这里在encoding的时候,用了utf-8 的选项,但是却没有告诉python环境,字符集编码是utf-8,会导致每次解析到的content和title都会报错,最终处理完的数据集大小为0,设置sys的默认编码可以解决。

同时,默认的open函数没有encoding选项,这个是在io.open中的选项,这个地方需要修改。

还有一点是一些接口使用问题,比如saver.save现在需要一个parent directory

之后是预测的代码

 #coding=utf-8
import collections
import numpy as np
import tensorflow as tf
import io
import sys
import os
import pdb
import time
reload(sys)
sys.setdefaultencoding('utf-8')
#-------------------------------数据预处理---------------------------# poetry_file ='poetry.txt' # 诗集
poetrys = []
with io.open(poetry_file, "r", encoding='utf-8',) as f:
for line in f:
try:
title, content = line.strip().split(':')
content = content.replace(' ','')
if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
continue
if len(content) < 5 or len(content) > 79:
continue
content = '[' + content + ']'
poetrys.append(content)
except Exception as e:
pass # 按诗的字数排序
poetrys = sorted(poetrys,key=lambda line: len(line))
print(u'唐诗总数: ', len(poetrys)) # 统计每个字出现次数
all_words = []
for poetry in poetrys:
all_words += [word for word in poetry]
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs) # 取前多少个常用字
words = words[:len(words)] + (' ',)
# 每个字映射为一个数字ID
word_num_map = dict(zip(words, range(len(words))))
# 把诗转换为向量形式
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [ list(map(to_num, poetry)) for poetry in poetrys]
#[[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
#[339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
#....] batch_size = 1
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []
for i in range(n_chunk):
start_index = i * batch_size
end_index = start_index + batch_size batches = poetrys_vector[start_index:end_index]
length = max(map(len,batches))
xdata = np.full((batch_size,length), word_num_map[' '], np.int32)
for row in range(batch_size):
xdata[row,:len(batches[row])] = batches[row]
ydata = np.copy(xdata)
ydata[:,:-1] = xdata[:,1:]
"""
xdata ydata
[6,2,4,6,9] [2,4,6,9,9]
[1,4,2,8,5] [4,2,8,5,5]
"""
x_batches.append(xdata)
y_batches.append(ydata) #---------------------------------------RNN--------------------------------------# input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])
# 定义RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
if model == 'rnn':
cell_fun = tf.nn.rnn_cell.BasicRNNCell
elif model == 'gru':
cell_fun = tf.nn.rnn_cell.GRUCell
elif model == 'lstm':
cell_fun = tf.nn.rnn_cell.BasicLSTMCell cell = cell_fun(rnn_size, state_is_tuple=True)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True) initial_state = cell.zero_state(batch_size, tf.float32) with tf.variable_scope('rnnlm'):
softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words)+1])
softmax_b = tf.get_variable("softmax_b", [len(words)+1])
with tf.device("/gpu:0"):
embedding = tf.get_variable("embedding", [len(words)+1, rnn_size])
inputs = tf.nn.embedding_lookup(embedding, input_data) outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
output = tf.reshape(outputs,[-1, rnn_size]) logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)
return logits, last_state, probs, cell, initial_state #-------------------------------生成古诗---------------------------------#
# 使用训练完成的模型 def gen_poetry():
def to_word(weights):
t = np.cumsum(weights)
s = np.sum(weights)
sample = int(np.searchsorted(t, np.random.rand(1)*s))
return words[sample] _, last_state, probs, cell, initial_state = neural_network() with tf.Session() as sess:
sess.run(tf.initialize_all_variables()) saver = tf.train.Saver(tf.all_variables())
saver.restore(sess, './ckpt_dir/poetry.module-294') state_ = sess.run(cell.zero_state(1, tf.float32)) x = np.array([list(map(word_num_map.get, '['))])
[probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
word = to_word(probs_)
#word = words[np.argmax(probs_)]
poem = ''
while word != ']':
poem += word
x = np.zeros((1,1))
x[0,0] = word_num_map[word]
[probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
word = to_word(probs_)
#word = words[np.argmax(probs_)]
return poem def gen_poetry_with_head(head):
def to_word(weights):
t = np.cumsum(weights)
s = np.sum(weights)
sample = int(np.searchsorted(t, np.random.rand(1)*s))
return words[sample] _, last_state, probs, cell, initial_state = neural_network() with tf.Session() as sess:
sess.run(tf.initialize_all_variables()) saver = tf.train.Saver(tf.all_variables())
saver.restore(sess, './ckpt_dir/poetry.module-294') state_ = sess.run(cell.zero_state(1, tf.float32))
poem = ''
i = 0
# print head
# pdb.set_trace()
for word in head:
while word != ',' and word != '。':
poem += word
# print poem
# print head
# print word
x = np.array([list(map(word_num_map.get, word))])
[probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
word = to_word(probs_)
time.sleep(1)
if i % 2 == 0:
poem += ','
else:
poem += '。'
i += 1
return poem print(gen_poetry())
# print(gen_poetry_with_head(u'一二三四'))

这个藏头诗的代码用法有问题,不建议使用,我调了很久才调好,这次还是先列原作者的代码,下次单独说这块的调整和调优问题。

结果:

有那么点意思,但仔细看问题还是很大,胡言乱语,模型的调优远远不行。

TensorFlow教程使用RNN生成唐诗的更多相关文章

  1. Tensorflow生成唐诗和歌词(下)

    整个工程使用的是Windows版pyCharm和tensorflow. 源码地址:https://github.com/Irvinglove/tensorflow_poems/tree/master ...

  2. Tensorflow生成唐诗和歌词(上)

    整个工程使用的是Windows版pyCharm和tensorflow. 源码地址:https://github.com/Irvinglove/tensorflow_poems/tree/master ...

  3. TensorFlow练习7: 基于RNN生成古诗词

      http://blog.topspeedsnail.com/archives/10542 主题 TensorFlow RNN不像传统的神经网络-它们的输出输出是固定的,而RNN允许我们输入输出向量 ...

  4. MVC5+EF6 入门完整教程13 -- 动态生成多级菜单

    稍微有一定复杂性的系统,多级菜单都是一个必备组件. 本篇专题讲述如何生成动态多级菜单的通用做法. 我们不用任何第三方的组件,完全自己构建灵活通用的多级菜单. 需要达成的效果:容易复用,可以根据mode ...

  5. Pytorch基础——使用 RNN 生成简单序列

    一.介绍 内容 使用 RNN 进行序列预测 今天我们就从一个基本的使用 RNN 生成简单序列的例子中,来窥探神经网络生成符号序列的秘密. 我们首先让神经网络模型学习形如 0^n 1^n 形式的上下文无 ...

  6. Pytorch系列教程-使用字符级RNN生成姓名

    前言 本系列教程为pytorch官网文档翻译.本文对应官网地址:https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutor ...

  7. tensorflow教程:tf.contrib.rnn.DropoutWrapper

    tf.contrib.rnn.DropoutWrapper Defined in tensorflow/python/ops/rnn_cell_impl.py. def __init__(self, ...

  8. Tensorflow学习教程------tfrecords数据格式生成与读取

    首先是生成tfrecords格式的数据,具体代码如下: #coding:utf-8 import os import tensorflow as tf from PIL import Image cw ...

  9. Tensorflow-3-使用RNN生成中文小说

    https://blog.csdn.net/heisejiuhuche/article/details/73010638 这篇文章不涉及RNN的基本原理,只是从选择数据集开始,到最后生成文本,展示一个 ...

随机推荐

  1. vlc命令行: 转码 流化 推流

    vlc命令行: 转码 流化 推流 写在命令行之前的话: VLC不仅仅可以通过界面进行播放,转码,流化,也可以通过命令行进行播放,转码和流化.还可以利用里面的SDK进行二次开发. vlc命令行使用方法: ...

  2. 通过Xshell 执行压缩.zip文件出现unzip: command not found错误的解决办法

    利用unzip命令解压缩的时候,出现 -bash: unzip: command not found 的错误. unzip——命令没有找到,其原因肯定是没有安装unzip. 利用一句命令就可以解决了. ...

  3. Python3的编译安装

    Linux环境自带了Python 2.x版本,但是如果要更新到3.x的版本,可以在Python的官方网站下载Python的源代码并通过源代码构建安装的方式进行安装,具体的步骤如下所示. 1. 安装依赖 ...

  4. 【Linux】CentOS7.0安装jdk

    JDK安装 设置hostname [root@bigdata111 ~]# vi /etc/hostname 设置机器hosts [root@bigdata111 ~]# vi /etc/hosts ...

  5. Elasticsearch全文检索引擎。什么是elasticsearch? 有什么特点? 怎么使用?

    什么是ElasticSearch? Elasticsearch是一个基于Lucene的搜索引擎.它提供了具有HTTPWeb界面和无架构JSON文档的分布式,多租户能力的全文搜索引擎.Elasticse ...

  6. ubantu使用小结

    一.root账户问题 1.初始登录的时候root密码是随机的,自己改一个. 2.登录界面没有root选项 解决: #gedit /usr/share/lightdm/lightdm.conf.d/50 ...

  7. 【并行计算-CUDA开发】CUDA线程、线程块、线程束、流多处理器、流处理器、网格概念的深入理解

    GPU的硬件结构,也不是具体的硬件结构,就是与CUDA相关的几个概念:thread,block,grid,warp,sp,sm. sp: 最基本的处理单元,streaming processor  最 ...

  8. pmap 命令

    NAME pmap - report memory map of a process SYNOPSIS pmap [ -x | -d ] [ -q ] pids... pmap -V 常用参数: -x ...

  9. spring_mvc入门项目的小总结

    1.先搭建一个maven的web项目 ,然后把文件夹完善一下,创建一个java的文件夹和resource的问件夹,并指定他们各自的功能. 导入pom.xml文件的依赖 <properties&g ...

  10. [Agc036D]Do Not Duplicate_链表_贪心_数论

    Do Not Duplicate 题目链接:https://atcoder.jp/contests/agc036/tasks/agc036_b 题解: 首先最后肯定至多只有$n$个数. 我们想处理出来 ...