TensorFlow TFRecord封装不定长的序列数据（文本）

在实验室环境中，通常数据都是一次性导入内存的，然后使用手工写的数据mini-batch函数来切分数据，但是这样的做法在海量数据下显得不太合适：1）内存太小不足以将全部数据一次性导入；2）数据切分和模型训练之间无法异步，训练过程易受到数据mini-batch切分耗时阻塞。3）无法部署到分布式环境中去

下面的代码片段采取了TFrecord的数据文件格式，并且支持不定长序列，支持动态填充，基本可以满足处理NLP等具有序列要求的任务需求。

import tensorflow as tf

def generate_tfrecords(tfrecod_filename):

    sequences = [[1], [2, 2], [3, 3, 3], [4, 4, 4, 4], [5, 5, 5, 5, 5],

                 [1], [2, 2], [3, 3, 3], [4, 4, 4, 4]]

    labels = [1, 2, 3, 4, 5, 1, 2, 3, 4]

    with tf.python_io.TFRecordWriter(tfrecod_filename) as f:

        for feature, label in zip(sequences, labels):

            frame_feature = list(map(lambda id: tf.train.Feature(int64_list=tf.train.Int64List(value=[id])), feature))

            example = tf.train.SequenceExample(

                context=tf.train.Features(feature={

                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))}),

                feature_lists=tf.train.FeatureLists(feature_list={

                    'sequence': tf.train.FeatureList(feature=frame_feature)

                })

            )

            f.write(example.SerializeToString())

def single_example_parser(serialized_example):

    context_features = {

        "label": tf.FixedLenFeature([], dtype=tf.int64)

    }

    sequence_features = {

        "sequence": tf.FixedLenSequenceFeature([], dtype=tf.int64)

    }

    context_parsed, sequence_parsed = tf.parse_single_sequence_example(

        serialized=serialized_example,

        context_features=context_features,

        sequence_features=sequence_features

    )

    labels = context_parsed['label']

    sequences = sequence_parsed['sequence']

    return sequences, labels

def batched_data(tfrecord_filename, single_example_parser, batch_size, padded_shapes, num_epochs=1, buffer_size=1000):

    dataset = tf.data.TFRecordDataset(tfrecord_filename)\

        .map(single_example_parser)\

        .padded_batch(batch_size, padded_shapes=padded_shapes)\

        .shuffle(buffer_size)\

        .repeat(num_epochs)

    return dataset.make_one_shot_iterator().get_next()

if __name__ == "__main__":

    def model(features, labels):

        return features, labels

    tfrecord_filename = 'test.tfrecord'

    generate_tfrecords(tfrecord_filename)

    out = model(*batched_data(tfrecord_filename, single_example_parser, 2, ([None], [])))

    config = tf.ConfigProto()

    config.gpu_options.allow_growth = True

    with tf.Session(config=config) as sess:

        init_op = tf.group(tf.global_variables_initializer(),

                           tf.local_variables_initializer())

        sess.run(init_op)

        coord = tf.train.Coordinator()

        threads = tf.train.start_queue_runners(sess=sess, coord=coord)

        try:

            while not coord.should_stop():

                print(sess.run(out))

        except tf.errors.OutOfRangeError:

            print("done training")

        finally:

            coord.request_stop()

        coord.join(threads)

TensorFlow TFRecord封装不定长的序列数据（文本）的更多相关文章

STM32串口接收不定长数据原理与源程序（转）
今天说一下STM32单片机的接收不定长度字节数据的方法.由于STM32单片机带IDLE中断,所以利用这个中断,可以接收不定长字节的数据,由于STM32属于ARM单片机,所以这篇文章的方法也适合其他的A ...
【OCR技术系列之七】端到端不定长文字识别CRNN算法详解
在以前的OCR任务中,识别过程分为两步:单字切割和分类任务.我们一般都会讲一连串文字的文本文件先利用投影法切割出单个字体,在送入CNN里进行文字分类.但是此法已经有点过时了,现在更流行的是基于深度学习 ...
STM32之串口DMA接收不定长数据
STM32之串口DMA接收不定长数据引言在使用stm32或者其他单片机的时候,会经常使用到串口通讯,那么如何有效地接收数据呢?假如这段数据是不定长的有如何高效接收呢? 同学A:数据来了就会进入串口 ...
STM32使用串口1配合DMA接收不定长数据，减轻CPU载荷
STM32使用串口1配合DMA接收不定长数据,减轻CPU载荷 http://www.openedv.com/thread-63849-1-1.html 实现思路:采用STM32F103的串口1,并配 ...
关于socket客户端接收不定长数据的解决方案
#!/usr/bin/env python3.5 # -*-coding:utf8-*- """ 本实例客户端用于不断接收不定长数据,存储到变量res "&qu ...
Python3的tcp socket接收不定长数据包接收到的数据不全。
Python Socket API参考出处:http://blog.csdn.net/xiangpingli/article/details/47706707 使用socket.recv(pack_l ...
STM32 HAL库使用中断实现串口接收不定长数据
以前用DMA实现接收不定长数据,DMA的方法接收串口助手的数据,全部没问题,不过如果接收模块返回的数据,而这些数据如果包含回车换行的话就会停止接收,例如接收:AT\r\nOK\r\n,就只能接收到AT ...
Stm32使用串口空闲中断，基于队列来接收不定长、不定时数据
串口持续地接收不定长.不定时的数据,把每一帧数据缓存下来且灵活地利用内存空间,下面提供一种方式供参考.原理是利用串口空闲中断和DMA,每当对方发来一帧完整的数据后,串口接收开始空闲,触发中断,在中断处 ...
使用Python基于VGG/CTPN/CRNN的自然场景文字方向检测/区域检测/不定长OCR识别
GitHub:https://github.com/pengcao/chinese_ocr https://github.com/xiaofengShi/CHINESE-OCR |-angle 基于V ...

随机推荐

GitHub 开启 Two-factor authentication，如何在命令行下更新和上传代码
最近在使用GitHub管理代码,在git命令行管理代码时候遇到一些问题. 如果开起了二次验证(Two-factor authentication两个要素认证),命令行会一直提示输入用户名和密码.查找了 ...
gitinore修改不生效
.gitignore只能忽略那些尚未被被track的文件,如果某些文件已经被纳入了版本管理中,则修改.gitignore是无效的.一个简单的解决方法就是先把本地缓存删除(改变成未track状态),然后 ...
Python2和Python3语法区别
1.使用for循环进行换行 python 2.x, print 不换行>>> print x, python 3.x print 不换行>>> print(x, e ...
iOS Automated Tests with UIAutomation
参照:http://blog.manbolo.com/2012/04/08/ios-automated-tests-with-uiautomation#1 UI Automation JavaScri ...
HDU 1864 最大报销额(01背包，烂题)
题意:被坑惨,单项不能超过600,其实是一张发票上A类/B类/C类的总和分别不能超过600. 思路:此题的数据很烂.用贪心也能过,用01背包也可以.都测试不出到底那些是错的. #include < ...
11g 新特性 Member Kill Escalation 简介
首先我们介绍一下历史.在oracle 9i/10g 中,如果一个数据库实例需要驱逐(evict, alert 文件中会出现ora-29740错误)另一个实例时,需要通过LMON进程在控制文件(以下简称 ...
MySQL8 Authentication plugin 'caching_sha2_password' cannot be loaded
这是因为mysql8 和以前密码的验证方式不同,可以先从命令行进入 MySQL -uroot -p 然后输入 ALTER USER 'root'@'localhost' IDEN ...
AutoWidthInput
import React from 'react'; import PropTypes from 'prop-types'; class AutoWidthInput extends React.Co ...
Java动画重力弹球如鹏游戏引擎精灵设计一个小球加速落地又减速弹起并反复直到停止的Java程序
package com.swift; import com.rupeng.game.GameCore; public class BouncingBall implements Runnable { ...
C++实现Singleton模式（effective c++ 04）
阅读effective c++ 04 (31页) 提到的singleton设计模式.了解一下. 定义: 保证一个类仅有一个实例,并提供一个访问它的全局访问点,该实例被所有程序模块共享. 应用场景: 比 ...

TensorFlow TFRecord封装不定长的序列数据（文本）

TensorFlow TFRecord封装不定长的序列数据（文本）

TensorFlow TFRecord封装不定长的序列数据（文本）的更多相关文章

随机推荐

热门专题