PDNN中数据格式和数据装载选项

吃人家嘴短，拿人家手短，用别人的东西就不要BB了，按规矩来吧。

训练和验证的数据都在命令行以变量的形式按如下方式指定：

--train-data "train.pfile,context=5,ignore-label=0:3-9,map-label=1:0/2:1,partition=1000m"

--valid-data "valid.pfile,stream=False,random=True"

在第一个逗号前面的部分（如果有的话）指定了文件的名称。

全局样式通配符可以被用来指定多个文件（当前不支持Kaldi数据文件）。

数据文件也可以用gzip或者bz2进行压缩，在这种情况下，原始的扩展名后面有".gz"或者".bz2"这类的扩展名。

在文件名后面，你可以以"key=value"的格式指定任意数目的数据装载选项。这些选项的功能将在下面的部分中描述。

受支持的数据格式

PDNN目前支持3中数据格式：PFiles，Python pickle files，和Kaldi files。

PFiles

PFile是ICSI特征文件存档格式。PFiles以.pfile为扩展名。一个PFile可以存储多个语句，每个语句是一个帧序列。

每个帧都是和一个特征向量相关联，并且有一个或者多个标签。下面是一个PFile文件例子中的内容。

Sentence ID	Frame ID	Feature Vector	Class Label
0	0	[0.2, 0.3, 0.5, 1.4, 1.8, 2.5]	10
0	1	[1.3, 2.1, 0.3, 0.1, 1.4, 0.9]	179
1	0	[0.3, 0.5, 0.5, 1.4, 0.8, 1.4]	32

对于语音处理，语句和帧分别对应于话语和帧。帧在每句话里面都是添加索引的。

对于其他的应用，可以使用伪造的语句指标和帧指标。

例如，如果有N个实例，你可以将所有的语句指标都设为0，而帧指标则从0到N-1。

一个标准的PFile工具箱是pfile_utils-v0_51。这个脚本将会自动安装，如果你在Linux上运行的话。HTK用户可以用这个Python脚本将HTK的特征和标签转化为PFiles。更多的信息可以参考上面的注释。

Python Pickle Files

Python pickle files是以".pickle"或者".pkl"为扩展名。一个Python pickle file将两个numpy数组的元组，（feature，label）序列化。在pickle files中没有像"sentence"之类的符号；换句话说，一个pickle files仅保存一个句子。feature是一个2维的numpy数组，每行都是一个实例的特征向量；label是一个一维的numpy数组，其中的每个元素是一个实例的类标签。

在Python中读取一个（gzip-compressed）pickle file可以通过：

import cPickle, numpy, gzip

with gzip.open('filename.pkl.gz', 'rb') as f:

     feature, label = cPickle.load(f)

通过下面语句可以在Python中创建一个（gzip-compressed）pickle file：

import cPickle, numpy, gzip

feature = numpy.array([[0.2, 0.3, 0.5, 1.4], [1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1.4]], dtype = 'float32')

label = numpy.array([2, 0, 1])

with gzip.open('filename.pkl.gz', 'wb') as f:

     cPickle.dump((feature, label), f)

Kaldi Files

Kaldi数据文件在PDNN中用Kaldi脚本文件生成，以".scp"为扩展名。这些文件包含了指向实际存储在以".ark"为扩展名的Kaldi存档文件中feature数据的指针。Kaldi脚本文件指定了一个语句的名字（等价于pfiles中的sentence），和它在Kaldi归档文件中的偏移量，如下：

utt01 train.ark:15213

对应于特征的标签在以".ali"为扩展名的"alignment files"文件中。为了指定一个alignment文件，可以用选项"label=filename.ali"。alignment文件是普通的文本文件，其中每行指定了语句的名称，后面紧跟着这个语句中每一帧的标签。例子如下：

utt01 0 51 51 51 51 51 51 48 48 7 7 7 7 51 51 51 51 48

即时语境填充和标签操作

通常，我们希望将临近帧的特征包括到当前帧的特征向量中。当然，你在准备数据的时候可以完成这个，但是这会使得文件变大。更聪明的做法是即时地完成语境填充。PDNN中提供了选项"context"来完成这个。指定"context=5"将用一个帧旁边的5个帧来填充，这样一个特征向量就变成原来的维度的11倍。指定"context=5:1"将用左边的5个帧和右边的一个帧来填充此帧。或者也可以指定"lcxt=5, rcxt=1"。语境填充不会越过语句的边界。在每个语句的开头和结尾，如果语境到达了语句的边界，那么将重复第一个和最后一个帧。

Some frames in the data files may be garbage frames (i.e. they do not belong to any of the classes to be classified), but they are important in making up the context for useful frames. To ignore such frames, you can assign a special class label (say c) to these frames, and specify the option "ignore-label=c". The garbage frames will be discarded; but the context of neighboring frames will still be correct, as the garbage frames are only discarded after context padding happens. Sometimes you may also want to train a classifier for only a subset of the classes in a data file. In such cases, you may specify multiple class labels to be ignored, e.g. "ignore-label=0:2:7-9". Multiple class labels are separated by colons; contiguous class labels may be specified with a dash.

When training a classifier of N classes, PDNN requires that their class labels be 0, 1, ..., N-1. When you ignore some class labels, the remaining class labels may not form such a sequence. In this situation, you may use the "map-label" option to map the remaining class labels to 0, 1, ..., N-1. For example, to map the classes 1, 3, 4, 5, 6 to 0, 1, 2, 3, 4, you can specify "map-label=1:0/3:1/4:2/5:3/6:4". Each pair of labels are separated by a colon; pairs are separated by slashes. The label mapping happens after unwanted labels are discarded; all the mappings are applied simultaneously (therefore class 3 is mapped to class 1 and is not further mapped to class 0). You may also use this option to merge classes. For example, "map-label=1:0/3:1/4-6:2" will map all the labels 4, 5, 6 to class 2.

Partitions, Streaming and Shuffling

The training / validation corpus may be too large to fit in the CPU or GPU memory. Therefore they are broken down into several levels of units: files, partitions, and minibatches. Such division happens after context padding and label manipulation, and the concept of "sentences" are no longer relevant. As a result, a sentence may be broken into multiple partitions of minibatches.

Both the training and validation corpora may consist of multiple files that can be matched by a single glob-style pattern. At any point in time, at most one file is held in the CPU memory. This means if you have multiple files, all the files will be reloaded every epoch. This can be very inefficient; you can avoid this inefficiency by lumping all the data into a single file if they can fit in the CPU memory.

A partition is the amount of data that is fed to the GPU at a time. For pickle files, a partition is always an entire file; for other files, you may specify the partition size with the option "partition", e.g. "partition=1000m". The partition size is specified in megabytes (2²⁰bytes); the suffix "m" is optional. The default partition size is 600 MB.

Files may be read in either the "stream" or the "non-stream" mode, controlled by the option "stream=True" or "stream=False". In the non-stream mode, an entire file is kept in the CPU memory. If there is only one file in the training / validation corpus, the file is loaded only once (and this is efficient). In the stream mode, only a partition is kept in the CPU memory. This is useful when the corpus is too large to fit in the CPU memory. Currently, PFiles can be loaded in either the stream mode or the non-stream mode; pickle files can only be loaded in the non-stream mode; Kaldi files can only be loaded in the stream mode.

It is usually desirable that instances of different classes be mixed evenly in the training data. To achieve this, you may specify the option "random=True". This options shuffles the order of the training instances loaded into the CPU memory at a time: in the stream mode, instances are shuffled partition by partition; in the non-stream mode, instance are shuffled across an entire file. The latter achieves better mixing, so it is again recommended to turn off the stream mode when the files can fit in the CPU memory.

A minibatch is the amount of data consumed by the training procedure between successive updates of the model parameters. The minibatch size is not specified as a data loading option, but as a separate command-line argument to the training scripts. A partition may not consist of a whole number of minibatches; the last instances in each partition that are not enough to make a minibatch are discarded.