PDNN中数据格式和数据装载选项
吃人家嘴短,拿人家手短,用别人的东西就不要BB了,按规矩来吧。
训练和验证的数据都在命令行以变量的形式按如下方式指定:
--train-data "train.pfile,context=5,ignore-label=0:3-9,map-label=1:0/2:1,partition=1000m"
--valid-data "valid.pfile,stream=False,random=True"
在第一个逗号前面的部分(如果有的话)指定了文件的名称。
全局样式通配符可以被用来指定多个文件(当前不支持Kaldi数据文件)。
数据文件也可以用gzip或者bz2进行压缩,在这种情况下,原始的扩展名后面有".gz"或者".bz2"这类的扩展名。
在文件名后面,你可以以"key=value"的格式指定任意数目的数据装载选项。这些选项的功能将在下面的部分中描述。
受支持的数据格式
PDNN目前支持3中数据格式:PFiles,Python pickle files,和Kaldi files。
PFiles
PFile是ICSI特征文件存档格式。PFiles以.pfile为扩展名。一个PFile可以存储多个语句,每个语句是一个帧序列。
每个帧都是和一个特征向量相关联,并且有一个或者多个标签。下面是一个PFile文件例子中的内容。
Sentence ID | Frame ID | Feature Vector | Class Label |
0 | 0 | [0.2, 0.3, 0.5, 1.4, 1.8, 2.5] | 10 |
0 | 1 | [1.3, 2.1, 0.3, 0.1, 1.4, 0.9] | 179 |
1 | 0 | [0.3, 0.5, 0.5, 1.4, 0.8, 1.4] | 32 |
对于语音处理,语句和帧分别对应于话语和帧。帧在每句话里面都是添加索引的。
对于其他的应用,可以使用伪造的语句指标和帧指标。
例如,如果有N个实例,你可以将所有的语句指标都设为0,而帧指标则从0到N-1。
一个标准的PFile工具箱是pfile_utils-v0_51。这个脚本将会自动安装,如果你在Linux上运行的话。HTK用户可以用这个Python脚本将HTK的特征和标签转化为PFiles。更多的信息可以参考上面的注释。
Python Pickle Files
Python pickle files是以".pickle"或者".pkl"为扩展名。一个Python pickle file将两个numpy数组的元组,(feature,label)序列化。在pickle files中没有像"sentence"之类的符号;换句话说,一个pickle files仅保存一个句子。feature是一个2维的numpy数组,每行都是一个实例的特征向量;label是一个一维的numpy数组,其中的每个元素是一个实例的类标签。
在Python中读取一个(gzip-compressed)pickle file可以通过:
import cPickle, numpy, gzip
with gzip.open('filename.pkl.gz', 'rb') as f:
feature, label = cPickle.load(f)
通过下面语句可以在Python中创建一个(gzip-compressed)pickle file:
import cPickle, numpy, gzip
feature = numpy.array([[0.2, 0.3, 0.5, 1.4], [1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1.4]], dtype = 'float32')
label = numpy.array([2, 0, 1])
with gzip.open('filename.pkl.gz', 'wb') as f:
cPickle.dump((feature, label), f)
Kaldi Files
Kaldi数据文件在PDNN中用Kaldi脚本文件生成,以".scp"为扩展名。这些文件包含了指向实际存储在以".ark"为扩展名的Kaldi存档文件中feature数据的指针。Kaldi脚本文件指定了一个语句的名字(等价于pfiles中的sentence),和它在Kaldi归档文件中的偏移量,如下:
utt01 train.ark:15213
对应于特征的标签在以".ali"为扩展名的"alignment files"文件中。为了指定一个alignment文件,可以用选项"label=filename.ali"。alignment文件是普通的文本文件,其中每行指定了语句的名称,后面紧跟着这个语句中每一帧的标签。例子如下:
utt01 0 51 51 51 51 51 51 48 48 7 7 7 7 51 51 51 51 48
即时语境填充和标签操作
通常,我们希望将临近帧的特征包括到当前帧的特征向量中。当然,你在准备数据的时候可以完成这个,但是这会使得文件变大。更聪明的做法是即时地完成语境填充。PDNN中提供了选项"context"来完成这个。指定"context=5"将用一个帧旁边的5个帧来填充,这样一个特征向量就变成原来的维度的11倍。指定"context=5:1"将用左边的5个帧和右边的一个帧来填充此帧。或者也可以指定"lcxt=5, rcxt=1"。语境填充不会越过语句的边界。在每个语句的开头和结尾,如果语境到达了语句的边界,那么将重复第一个和最后一个帧。
Some frames in the data files may be garbage frames (i.e. they do not belong to any of the classes to be classified), but they are important in making up the context for useful frames. To ignore such frames, you can assign a special class label (say c) to these frames, and specify the option "ignore-label=c". The garbage frames will be discarded; but the context of neighboring frames will still be correct, as the garbage frames are only discarded after context padding happens. Sometimes you may also want to train a classifier for only a subset of the classes in a data file. In such cases, you may specify multiple class labels to be ignored, e.g. "ignore-label=0:2:7-9". Multiple class labels are separated by colons; contiguous class labels may be specified with a dash.
When training a classifier of N classes, PDNN requires that their class labels be 0, 1, ..., N-1. When you ignore some class labels, the remaining class labels may not form such a sequence. In this situation, you may use the "map-label" option to map the remaining class labels to 0, 1, ..., N-1. For example, to map the classes 1, 3, 4, 5, 6 to 0, 1, 2, 3, 4, you can specify "map-label=1:0/3:1/4:2/5:3/6:4". Each pair of labels are separated by a colon; pairs are separated by slashes. The label mapping happens after unwanted labels are discarded; all the mappings are applied simultaneously (therefore class 3 is mapped to class 1 and is not further mapped to class 0). You may also use this option to merge classes. For example, "map-label=1:0/3:1/4-6:2" will map all the labels 4, 5, 6 to class 2.
Partitions, Streaming and Shuffling
The training / validation corpus may be too large to fit in the CPU or GPU memory. Therefore they are broken down into several levels of units: files, partitions, and minibatches. Such division happens after context padding and label manipulation, and the concept of "sentences" are no longer relevant. As a result, a sentence may be broken into multiple partitions of minibatches.
Both the training and validation corpora may consist of multiple files that can be matched by a single glob-style pattern. At any point in time, at most one file is held in the CPU memory. This means if you have multiple files, all the files will be reloaded every epoch. This can be very inefficient; you can avoid this inefficiency by lumping all the data into a single file if they can fit in the CPU memory.
A partition is the amount of data that is fed to the GPU at a time. For pickle files, a partition is always an entire file; for other files, you may specify the partition size with the option "partition", e.g. "partition=1000m". The partition size is specified in megabytes (220bytes); the suffix "m" is optional. The default partition size is 600 MB.
Files may be read in either the "stream" or the "non-stream" mode, controlled by the option "stream=True" or "stream=False". In the non-stream mode, an entire file is kept in the CPU memory. If there is only one file in the training / validation corpus, the file is loaded only once (and this is efficient). In the stream mode, only a partition is kept in the CPU memory. This is useful when the corpus is too large to fit in the CPU memory. Currently, PFiles can be loaded in either the stream mode or the non-stream mode; pickle files can only be loaded in the non-stream mode; Kaldi files can only be loaded in the stream mode.
It is usually desirable that instances of different classes be mixed evenly in the training data. To achieve this, you may specify the option "random=True". This options shuffles the order of the training instances loaded into the CPU memory at a time: in the stream mode, instances are shuffled partition by partition; in the non-stream mode, instance are shuffled across an entire file. The latter achieves better mixing, so it is again recommended to turn off the stream mode when the files can fit in the CPU memory.
A minibatch is the amount of data consumed by the training procedure between successive updates of the model parameters. The minibatch size is not specified as a data loading option, but as a separate command-line argument to the training scripts. A partition may not consist of a whole number of minibatches; the last instances in each partition that are not enough to make a minibatch are discarded.
PDNN中数据格式和数据装载选项的更多相关文章
- layui中使用自定义数据格式对数据表格进行渲染
1.引入 <link rel="stylesheet" href="../layui/css/layui.css"> <script src= ...
- 对于改善 MySQL 数据装载操作有效率的方法是怎样
多时候关心的是优化SELECT 查询,因为它们是最常用的查询,而且确定怎样优化它们并不总是直截了当.相对来说,将数据装入数据库是直截了当的.然而,也存在可用来改善数据装载操作效率的策略,其基本原理如下 ...
- CODESOFT中怎样打印数据库中的特定数据?
CODESOFT可用于打印.标记和跟踪的零售库存标签软件,每种产品的售卖都代表着需要打印大量的条码标签.通常我们采用的方法就是在CODESOFT连接数据库批量打 印.但是如果数据量很大,该如何选择 ...
- Entity Framework 数据生成选项DatabaseGenerated
在EF中,我们建立数据模型的时候,可以给属性配置数据生成选项DatabaseGenerated,它后有三个枚举值:Identity.None和Computed. Identity:自增长 None:不 ...
- 【MySQL】MySQL中针对大数据量常用技术_创建索引+缓存配置+分库分表+子查询优化(转载)
原文地址:http://blog.csdn.net/zwan0518/article/details/11972853 目录(?)[-] 一查询优化 1创建索引 2缓存的配置 3slow_query_ ...
- 数据库开发——参照完整性——在外键中使用Delete on cascade选项
原文:数据库开发--参照完整性--在外键中使用Delete on cascade选项 原文: http://www.mssqltips.com/sqlservertip/2743/using-dele ...
- sql点滴38—SQL Server 2008和SQL Server 2008 R2导出数据的选项略有不同
原文:sql点滴38—SQL Server 2008和SQL Server 2008 R2导出数据的选项略有不同 说明: 以前要将一个表中的数据导出为脚本,只有用存储过程.现在在SQL Server ...
- SQL Server数据库中导入导出数据及结构时主外键关系的处理
2015-01-26 软件开发中,经常涉及到不同数据库(包括不同产品的不同版本)之间的数据结构与数据的导入导出.处理过程中会遇到很多问题,尤为突出重要的一个问题就是主从表之间,从表有外检约束,从而导致 ...
- JMeter 中对于Json数据的处理方法
JMeter中对于Json数据的处理方法 http://eclipsesource.com/blogs/2014/06/12/parsing-json-responses-with-jmeter/ J ...
随机推荐
- 提高PHP编码的一些技巧
1.不要使用相对路径 例如 require_once('../../lib/some_class.php'); 该方法有很多缺点: 1)它首先查找指定的php包含路径, 然后查找当前目录 2)如果该脚 ...
- HDUOJ----A Computer Graphics Problem
A Computer Graphics Problem Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (J ...
- NSUserDefault 的使用
1.NSUserDefault的使用: 作用:NSUserDefaults类提供了一个与默认系统进行交互的编程接口.NSUserDefaults对象是用来保存,恢复应用程序相关的偏好设置,配置数据等等 ...
- linux的cat命令
1 描述 cat 的全称 concatenate files and print on the standard output cat命令事Linux下的一个文本输出命令. 用于链接文件并打印到标准输 ...
- 《JAVA与模式》之适配器模式(转载)
适配器模式比较简单,偷个懒,直接转载一篇. 个人理解: * 类适配器是通过继承来完成适配 * 对象适配器是通过传递对象来完成适配 * 不管哪种,其实都是通过引用特殊接口的对象来完成特殊接口的适配调用 ...
- OpenSSH的RSA/DSA密钥认证系统
OpenSSH的RSA/DSA密钥认证系统,它可以代替OpenSSH缺省使用的标准安全密码认证系统. OpenSSH的RSA和DSA认证协议的基础是一对专门生成的密钥,分别叫做私用密钥和公用密钥. 使 ...
- DBA_实践指南系列5_Oracle Erp R12日常运维和管理(案例)
2013-12-05 Created By BaoXinjian
- org.hibernate.MappingException: entity class not found hbm可以解析,但是实体类不能解析
在hbm.xml中给实体类加上包 com.we.lkl.studentVO
- SpringSecurityFilter 链
1. HttpSessionContextIntegrationFilter 位于过滤器顶端,第一个起作用的过滤器. 用途一,在执行其他过滤器之前,率先判断用户的session中是否已经存在一个Sec ...
- java Webservice(一)HttpClient使用(二)
HTTP 协议可能是现在 Internet 上使用得最多.最重要的协议了,越来越多的 Java 应用程序需要直接通过 HTTP 协议来访问网络资源.虽然在 JDK 的 java.net 包中已经提供了 ...