PDNN中数据格式和数据装载选项
吃人家嘴短,拿人家手短,用别人的东西就不要BB了,按规矩来吧。
训练和验证的数据都在命令行以变量的形式按如下方式指定:
--train-data "train.pfile,context=5,ignore-label=0:3-9,map-label=1:0/2:1,partition=1000m"
--valid-data "valid.pfile,stream=False,random=True"
在第一个逗号前面的部分(如果有的话)指定了文件的名称。
全局样式通配符可以被用来指定多个文件(当前不支持Kaldi数据文件)。
数据文件也可以用gzip或者bz2进行压缩,在这种情况下,原始的扩展名后面有".gz"或者".bz2"这类的扩展名。
在文件名后面,你可以以"key=value"的格式指定任意数目的数据装载选项。这些选项的功能将在下面的部分中描述。
受支持的数据格式
PDNN目前支持3中数据格式:PFiles,Python pickle files,和Kaldi files。
PFiles
PFile是ICSI特征文件存档格式。PFiles以.pfile为扩展名。一个PFile可以存储多个语句,每个语句是一个帧序列。
每个帧都是和一个特征向量相关联,并且有一个或者多个标签。下面是一个PFile文件例子中的内容。
Sentence ID | Frame ID | Feature Vector | Class Label |
0 | 0 | [0.2, 0.3, 0.5, 1.4, 1.8, 2.5] | 10 |
0 | 1 | [1.3, 2.1, 0.3, 0.1, 1.4, 0.9] | 179 |
1 | 0 | [0.3, 0.5, 0.5, 1.4, 0.8, 1.4] | 32 |
对于语音处理,语句和帧分别对应于话语和帧。帧在每句话里面都是添加索引的。
对于其他的应用,可以使用伪造的语句指标和帧指标。
例如,如果有N个实例,你可以将所有的语句指标都设为0,而帧指标则从0到N-1。
一个标准的PFile工具箱是pfile_utils-v0_51。这个脚本将会自动安装,如果你在Linux上运行的话。HTK用户可以用这个Python脚本将HTK的特征和标签转化为PFiles。更多的信息可以参考上面的注释。
Python Pickle Files
Python pickle files是以".pickle"或者".pkl"为扩展名。一个Python pickle file将两个numpy数组的元组,(feature,label)序列化。在pickle files中没有像"sentence"之类的符号;换句话说,一个pickle files仅保存一个句子。feature是一个2维的numpy数组,每行都是一个实例的特征向量;label是一个一维的numpy数组,其中的每个元素是一个实例的类标签。
在Python中读取一个(gzip-compressed)pickle file可以通过:
import cPickle, numpy, gzip
with gzip.open('filename.pkl.gz', 'rb') as f:
feature, label = cPickle.load(f)
通过下面语句可以在Python中创建一个(gzip-compressed)pickle file:
import cPickle, numpy, gzip
feature = numpy.array([[0.2, 0.3, 0.5, 1.4], [1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1.4]], dtype = 'float32')
label = numpy.array([2, 0, 1])
with gzip.open('filename.pkl.gz', 'wb') as f:
cPickle.dump((feature, label), f)
Kaldi Files
Kaldi数据文件在PDNN中用Kaldi脚本文件生成,以".scp"为扩展名。这些文件包含了指向实际存储在以".ark"为扩展名的Kaldi存档文件中feature数据的指针。Kaldi脚本文件指定了一个语句的名字(等价于pfiles中的sentence),和它在Kaldi归档文件中的偏移量,如下:
utt01 train.ark:15213
对应于特征的标签在以".ali"为扩展名的"alignment files"文件中。为了指定一个alignment文件,可以用选项"label=filename.ali"。alignment文件是普通的文本文件,其中每行指定了语句的名称,后面紧跟着这个语句中每一帧的标签。例子如下:
utt01 0 51 51 51 51 51 51 48 48 7 7 7 7 51 51 51 51 48
即时语境填充和标签操作
通常,我们希望将临近帧的特征包括到当前帧的特征向量中。当然,你在准备数据的时候可以完成这个,但是这会使得文件变大。更聪明的做法是即时地完成语境填充。PDNN中提供了选项"context"来完成这个。指定"context=5"将用一个帧旁边的5个帧来填充,这样一个特征向量就变成原来的维度的11倍。指定"context=5:1"将用左边的5个帧和右边的一个帧来填充此帧。或者也可以指定"lcxt=5, rcxt=1"。语境填充不会越过语句的边界。在每个语句的开头和结尾,如果语境到达了语句的边界,那么将重复第一个和最后一个帧。
Some frames in the data files may be garbage frames (i.e. they do not belong to any of the classes to be classified), but they are important in making up the context for useful frames. To ignore such frames, you can assign a special class label (say c) to these frames, and specify the option "ignore-label=c". The garbage frames will be discarded; but the context of neighboring frames will still be correct, as the garbage frames are only discarded after context padding happens. Sometimes you may also want to train a classifier for only a subset of the classes in a data file. In such cases, you may specify multiple class labels to be ignored, e.g. "ignore-label=0:2:7-9". Multiple class labels are separated by colons; contiguous class labels may be specified with a dash.
When training a classifier of N classes, PDNN requires that their class labels be 0, 1, ..., N-1. When you ignore some class labels, the remaining class labels may not form such a sequence. In this situation, you may use the "map-label" option to map the remaining class labels to 0, 1, ..., N-1. For example, to map the classes 1, 3, 4, 5, 6 to 0, 1, 2, 3, 4, you can specify "map-label=1:0/3:1/4:2/5:3/6:4". Each pair of labels are separated by a colon; pairs are separated by slashes. The label mapping happens after unwanted labels are discarded; all the mappings are applied simultaneously (therefore class 3 is mapped to class 1 and is not further mapped to class 0). You may also use this option to merge classes. For example, "map-label=1:0/3:1/4-6:2" will map all the labels 4, 5, 6 to class 2.
Partitions, Streaming and Shuffling
The training / validation corpus may be too large to fit in the CPU or GPU memory. Therefore they are broken down into several levels of units: files, partitions, and minibatches. Such division happens after context padding and label manipulation, and the concept of "sentences" are no longer relevant. As a result, a sentence may be broken into multiple partitions of minibatches.
Both the training and validation corpora may consist of multiple files that can be matched by a single glob-style pattern. At any point in time, at most one file is held in the CPU memory. This means if you have multiple files, all the files will be reloaded every epoch. This can be very inefficient; you can avoid this inefficiency by lumping all the data into a single file if they can fit in the CPU memory.
A partition is the amount of data that is fed to the GPU at a time. For pickle files, a partition is always an entire file; for other files, you may specify the partition size with the option "partition", e.g. "partition=1000m". The partition size is specified in megabytes (220bytes); the suffix "m" is optional. The default partition size is 600 MB.
Files may be read in either the "stream" or the "non-stream" mode, controlled by the option "stream=True" or "stream=False". In the non-stream mode, an entire file is kept in the CPU memory. If there is only one file in the training / validation corpus, the file is loaded only once (and this is efficient). In the stream mode, only a partition is kept in the CPU memory. This is useful when the corpus is too large to fit in the CPU memory. Currently, PFiles can be loaded in either the stream mode or the non-stream mode; pickle files can only be loaded in the non-stream mode; Kaldi files can only be loaded in the stream mode.
It is usually desirable that instances of different classes be mixed evenly in the training data. To achieve this, you may specify the option "random=True". This options shuffles the order of the training instances loaded into the CPU memory at a time: in the stream mode, instances are shuffled partition by partition; in the non-stream mode, instance are shuffled across an entire file. The latter achieves better mixing, so it is again recommended to turn off the stream mode when the files can fit in the CPU memory.
A minibatch is the amount of data consumed by the training procedure between successive updates of the model parameters. The minibatch size is not specified as a data loading option, but as a separate command-line argument to the training scripts. A partition may not consist of a whole number of minibatches; the last instances in each partition that are not enough to make a minibatch are discarded.
PDNN中数据格式和数据装载选项的更多相关文章
- layui中使用自定义数据格式对数据表格进行渲染
1.引入 <link rel="stylesheet" href="../layui/css/layui.css"> <script src= ...
- 对于改善 MySQL 数据装载操作有效率的方法是怎样
多时候关心的是优化SELECT 查询,因为它们是最常用的查询,而且确定怎样优化它们并不总是直截了当.相对来说,将数据装入数据库是直截了当的.然而,也存在可用来改善数据装载操作效率的策略,其基本原理如下 ...
- CODESOFT中怎样打印数据库中的特定数据?
CODESOFT可用于打印.标记和跟踪的零售库存标签软件,每种产品的售卖都代表着需要打印大量的条码标签.通常我们采用的方法就是在CODESOFT连接数据库批量打 印.但是如果数据量很大,该如何选择 ...
- Entity Framework 数据生成选项DatabaseGenerated
在EF中,我们建立数据模型的时候,可以给属性配置数据生成选项DatabaseGenerated,它后有三个枚举值:Identity.None和Computed. Identity:自增长 None:不 ...
- 【MySQL】MySQL中针对大数据量常用技术_创建索引+缓存配置+分库分表+子查询优化(转载)
原文地址:http://blog.csdn.net/zwan0518/article/details/11972853 目录(?)[-] 一查询优化 1创建索引 2缓存的配置 3slow_query_ ...
- 数据库开发——参照完整性——在外键中使用Delete on cascade选项
原文:数据库开发--参照完整性--在外键中使用Delete on cascade选项 原文: http://www.mssqltips.com/sqlservertip/2743/using-dele ...
- sql点滴38—SQL Server 2008和SQL Server 2008 R2导出数据的选项略有不同
原文:sql点滴38—SQL Server 2008和SQL Server 2008 R2导出数据的选项略有不同 说明: 以前要将一个表中的数据导出为脚本,只有用存储过程.现在在SQL Server ...
- SQL Server数据库中导入导出数据及结构时主外键关系的处理
2015-01-26 软件开发中,经常涉及到不同数据库(包括不同产品的不同版本)之间的数据结构与数据的导入导出.处理过程中会遇到很多问题,尤为突出重要的一个问题就是主从表之间,从表有外检约束,从而导致 ...
- JMeter 中对于Json数据的处理方法
JMeter中对于Json数据的处理方法 http://eclipsesource.com/blogs/2014/06/12/parsing-json-responses-with-jmeter/ J ...
随机推荐
- TCP/IP协议栈--IP首部选项字段的分析
IP输入函数(ipintr)将在验证分组格式(检验和,长度等)之后.确定分组是否到达目的地之前,对选项进行处理. 这表明分组所 遇到的每一个路由器以及终于的目的主机都对要分组的选项进行处理. IP分组 ...
- 微信小程序的POST和GET请求方式的header区别
1.post请求: wx.request({ url: 'https://m.***.com/index.php/Home/Xiaoxxf/make_order', header: { "C ...
- mybatis映射文件(转)
以下内容为转载, 格式未调整,略丑,可直接空降至: [Mybatis框架]输出映射-resultType与resultMap 有时间或看: Mybatis 3.1中 Mapper XML 文件 的学习 ...
- poj-----Ultra-QuickSort(离散化+树状数组)
Ultra-QuickSort Time Limit: 7000MS Memory Limit: 65536K Total Submissions: 38258 Accepted: 13784 ...
- SSL原理
http://blog.csdn.net/terryzero/article/details/5921791SSL的原理以前一直很模糊,看了下面这篇文章后清楚了许多,为了方便以后的回顾,所以转载下 R ...
- RabbitMQ消息队列(三):任务分发机制[转]
在上篇文章中,我们解决了从发送端(Producer)向接收端(Consumer)发送“Hello World”的问题.在实际的应用场景中,这是远远不够的.从本篇文章开始,我们将结合更加实际的应用场景来 ...
- clscfg.bin: error while loading shared libraries: libcap.so.1:
RAC安装过程中,安装GI,运行root.sh脚本时报如下错误: # /u01/app//grid/root.sh Running Oracle 11g root script... The foll ...
- Workflow_标准控件Wait_For_Flow和Contiune_Flow的用法(案例)
2014-06-04 Created By BaoXinjian
- ADF_ADF基本概要(汇总)
20150601 Created By BaoXinjian
- Unix环境高级编程(二十一)数据库函数库
本章的内容是开发一个简单的.多用户数据库的C函数库.调用此函数库提供的C语言函数,其他程序可以读取和存储数据库中的记录.绝大部分商用数据库函数库提供多进程同时更新数据库所需要的并发控制,采用建议记录锁 ...