MNIST数据集和IDX文件格式

MNIST数据集

MNIST数据集是Yan Lecun整理出来的。

NIST是美国国家标准与技术研究院（National Institute of Standards and Technology）的简称，NIST这个机构整理了两套数据集Special Dataset 3和Special Dataset 1，SD3数据集是从人口普查机构的工作人员那里收集上来的，SD1数据集是从在校学生那里收集来的，SD3数据比较干净、识别起来比较简单（人口普查机构工作人员比在校学生靠谱）。YanLecun把这两个数据集合并了，从两个数据集中各抽取30000条数据拼成了一个包含60000条训练数据，从两个数据集中各抽取5000条数据拼成10000条测试数据。整理数据的过程包括：调整图像尺寸、调整图像位置等。

YanLecun在1998年就在MNIST数据集上各种花式机器学习方法，把MNIST玩了个遍，错误率在当年就降到了0.52%。在这个数据集上，YanLecun还使用了卷积神经网络（卷积神经网络很早就有人尝试了，只是十年后借深度神经网络之风扶摇直上）。

在MNIST官方主页上，可以看各种分类器的结果，还附带各种论文，MNIST真是学习机器学习、深度学习的好材料，MNIST主页上的内容也值得一读。

MNIST数据集使用了一种独创的数据格式，这种格式非常简单，简单到根本没必要为它出一个库来读取之。这种数据格式就是用来存储多维数组的。这种数据格式就叫IDX，如果数组是3个维度，就叫ID3，如果数组是1个维度，就叫ID1。

开头2个字节，表示该格式的版本号（一直是0x0000）。
接下来1个字节表示数组中每个元素的数据类型（所以最多表示256种数据类型）,相当于a.dtype。
再接下来1个字节表示数组的维度（所以数组最多有256维），相当于len(a.shape)
然后接下来的若干个int类型（4个字节）的数据表示各个维度的长度,相当于a.shape
最后是数据部分，数据部分的数据类型前面已经知道了，所以每个元素所占字节数确定了，最后如果元素个数符合维度特征，表明解析正确，否则说明文件损坏。

其中dtype字节的表示为：

0x08: unsigned byte
0x09: signed byte
0x0B: short (2 bytes)
0x0C: int (4 bytes)
0x0D: float (4 bytes)
0x0E: double (8 bytes)

下面是解析MNIST数据的代码，在以下代码中，用到了Python中的struct模块，这个模块用来读取字节非常方便，这个模块值得一学！

# encoding: utf-8

"""

@author: monitor1379

@contact: yy4f5da2@hotmail.com

@site: www.monitor1379.com

@version: 1.0

@license: Apache Licence

@file: mnist_decoder.py

@time: 2016/8/16 20:03

对MNIST手写数字数据文件转换为bmp图片文件格式。

数据集下载地址为http://yann.lecun.com/exdb/mnist。

相关格式转换见官网以及代码注释。

========================

关于IDX文件格式的解析规则：

========================

THE IDX FILE FORMAT

the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types.

The basic format is

magic number

size in dimension 0

size in dimension 1

size in dimension 2

.....

size in dimension N

data

The magic number is an integer (MSB first). The first 2 bytes are always 0.

The third byte codes the type of the data:

0x08: unsigned byte

0x09: signed byte

0x0B: short (2 bytes)

0x0C: int (4 bytes)

0x0D: float (4 bytes)

0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....

The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).

The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

"""

import numpy as np

import struct

import matplotlib.pyplot as plt

# 训练集文件

train_images_idx3_ubyte_file = '../../data/mnist/bin/train-images.idx3-ubyte'

# 训练集标签文件

train_labels_idx1_ubyte_file = '../../data/mnist/bin/train-labels.idx1-ubyte'

# 测试集文件

test_images_idx3_ubyte_file = '../../data/mnist/bin/t10k-images.idx3-ubyte'

# 测试集标签文件

test_labels_idx1_ubyte_file = '../../data/mnist/bin/t10k-labels.idx1-ubyte'

def decode_idx3_ubyte(idx3_ubyte_file):

    """

    解析idx3文件的通用函数

    :param idx3_ubyte_file: idx3文件路径

    :return: 数据集

    """

    # 读取二进制数据

    bin_data = open(idx3_ubyte_file, 'rb').read()

    # 解析文件头信息，依次为魔数、图片数量、每张图片高、每张图片宽

    offset = 0

    fmt_header = '>iiii'

    magic_number, num_images, num_rows, num_cols = struct.unpack_from(fmt_header, bin_data, offset)

    print '魔数:%d, 图片数量: %d张, 图片大小: %d*%d' % (magic_number, num_images, num_rows, num_cols)

    # 解析数据集

    image_size = num_rows * num_cols

    offset += struct.calcsize(fmt_header)

    fmt_image = '>' + str(image_size) + 'B'

    images = np.empty((num_images, num_rows, num_cols))

    for i in range(num_images):

        if (i + 1) % 10000 == 0:

            print '已解析 %d' % (i + 1) + '张'

        images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape((num_rows, num_cols))

        offset += struct.calcsize(fmt_image)

    return images

def decode_idx1_ubyte(idx1_ubyte_file):

    """

    解析idx1文件的通用函数

    :param idx1_ubyte_file: idx1文件路径

    :return: 数据集

    """

    # 读取二进制数据

    bin_data = open(idx1_ubyte_file, 'rb').read()

    # 解析文件头信息，依次为魔数和标签数

    offset = 0

    fmt_header = '>ii'

    magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)

    print '魔数:%d, 图片数量: %d张' % (magic_number, num_images)

    # 解析数据集

    offset += struct.calcsize(fmt_header)

    fmt_image = '>B'

    labels = np.empty(num_images)

    for i in range(num_images):

        if (i + 1) % 10000 == 0:

            print '已解析 %d' % (i + 1) + '张'

        labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0]

        offset += struct.calcsize(fmt_image)

    return labels

def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):

    """

    TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

    [offset] [type]          [value]          [description]

    0000     32 bit integer  0x00000803(2051) magic number

    0004     32 bit integer  60000            number of images

    0008     32 bit integer  28               number of rows

    0012     32 bit integer  28               number of columns

    0016     unsigned byte   ??               pixel

    0017     unsigned byte   ??               pixel

    ........

    xxxx     unsigned byte   ??               pixel

    Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

    :param idx_ubyte_file: idx文件路径

    :return: n*row*col维np.array对象，n为图片数量

    """

    return decode_idx3_ubyte(idx_ubyte_file)

def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):

    """

    TRAINING SET LABEL FILE (train-labels-idx1-ubyte):

    [offset] [type]          [value]          [description]

    0000     32 bit integer  0x00000801(2049) magic number (MSB first)

    0004     32 bit integer  60000            number of items

    0008     unsigned byte   ??               label

    0009     unsigned byte   ??               label

    ........

    xxxx     unsigned byte   ??               label

    The labels values are 0 to 9.

    :param idx_ubyte_file: idx文件路径

    :return: n*1维np.array对象，n为图片数量

    """

    return decode_idx1_ubyte(idx_ubyte_file)

def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):

    """

    TEST SET IMAGE FILE (t10k-images-idx3-ubyte):

    [offset] [type]          [value]          [description]

    0000     32 bit integer  0x00000803(2051) magic number

    0004     32 bit integer  10000            number of images

    0008     32 bit integer  28               number of rows

    0012     32 bit integer  28               number of columns

    0016     unsigned byte   ??               pixel

    0017     unsigned byte   ??               pixel

    ........

    xxxx     unsigned byte   ??               pixel

    Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

    :param idx_ubyte_file: idx文件路径

    :return: n*row*col维np.array对象，n为图片数量

    """

    return decode_idx3_ubyte(idx_ubyte_file)

def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):

    """

    TEST SET LABEL FILE (t10k-labels-idx1-ubyte):

    [offset] [type]          [value]          [description]

    0000     32 bit integer  0x00000801(2049) magic number (MSB first)

    0004     32 bit integer  10000            number of items

    0008     unsigned byte   ??               label

    0009     unsigned byte   ??               label

    ........

    xxxx     unsigned byte   ??               label

    The labels values are 0 to 9.

    :param idx_ubyte_file: idx文件路径

    :return: n*1维np.array对象，n为图片数量

    """

    return decode_idx1_ubyte(idx_ubyte_file)

def run():

    train_images = load_train_images()

    train_labels = load_train_labels()

    # test_images = load_test_images()

    # test_labels = load_test_labels()

    # 查看前十个数据及其标签以读取是否正确

    for i in range(10):

        print train_labels[i]

        plt.imshow(train_images[i], cmap='gray')

        plt.show()

    print 'done'

if __name__ == '__main__':

    run()

参考资料：monitor1379

MNIST数据集和IDX文件格式的更多相关文章

4.keras实现-->生成式深度学习之用变分自编码器VAE生成图像(mnist数据集和名人头像数据集)
变分自编码器(VAE,variatinal autoencoder) VS 生成式对抗网络(GAN,generative adversarial network) 两者不仅适用于图像,还可以 ...
[TensorFlow 团队] TensorFlow 数据集和估算器介绍
发布人:TensorFlow 团队原文链接:http://developers.googleblog.cn/2017/09/tensorflow.html TensorFlow 1.3 引入了两个重 ...
如何使用Pytorch迅速实现Mnist数据及分类器
一段时间没有更新博文,想着也该写两篇文章玩玩了.而从一个简单的例子作为开端是一个比较不错的选择.本文章会手把手地教读者构建一个简单的Mnist(Fashion-Mnist同理)的分类器,并且会使用相对 ...
使用Tensorflow操作MNIST数据
MNIST是一个非常有名的手写体数字识别数据集,在很多资料中,这个数据集都会被用作深度学习的入门样例.而TensorFlow的封装让使用MNIST数据集变得更加方便.MNIST数据集是NIST数据集的 ...
firedac数据集和字符串之间相互转换
firedac数据集和字符串之间相互转换 /// <author>cxg 2018-12-20</author> unit DatasetString; interface u ...
数据集和JSON相互转换
使用DELPHI原生类实现数据集和JSON相互转换 JSON二要素:数组和对象.对象可以包含数组,数组可以包含对象.无层数限制.OLEVARIANT也类似,OLEVARIANT的一个元素又可以是OL ...
mormot 数据集和JSON互相转换
mormot 数据集和JSON互相转换 uses SynVirtualDataSet, mORMotMidasVCL, SynCommons; procedure TForm1.Button1Clic ...
基于MNIST数据的卷积神经网络CNN
基于tensorflow使用CNN识别MNIST 参数数量:第一个卷积层5x5x1x32=800个参数,第二个卷积层5x5x32x64=51200个参数,第三个全连接层7x7x64x1024=3211 ...
tensorflow学习笔记——使用TensorFlow操作MNIST数据（2）
tensorflow学习笔记——使用TensorFlow操作MNIST数据(1) 一:神经网络知识点整理 1.1,多层:使用多层权重,例如多层全连接方式以下定义了三个隐藏层的全连接方式的神经网络样例 ...

随机推荐

Flume-NG一些注意事项（转）
原文链接:记Flume-NG一些注意事项这里只考虑flume本身的一些东西,对于JVM.HDFS.HBase等得暂不涉及.... 一.关于Source: 1.spool-source:适合静态文件, ...
LinkedList剖析
第1部分 LinkedList介绍 LinkedList简介 LinkedList 是一个继承于AbstractSequentialList的双向链表.它也可以被当作堆栈.队列或双端队列进行操作. D ...
Visual Studio 2015官方社区版/专业版/专业版下载地址
Visual Studio 2015官方社区版/专业版/专业版下载地址以下 Visual Studio 2015 社区版/专业版/专业版资源都是官方MSDN原版下载资源,统一为ISO格式镜像,使用解 ...
[Algorithm] How to use Max Heap to maintain K smallest items
Let's say we are given an array: [,,,,,,] We want to get K = 3 smallest items from the array and usi ...
在简化版Fedora8上安装jdk-7u25-linux-i586.rpm的过程
台式机的操作系统重新换回了Fedora8,遵从一些大牛的建议,把很多附件去了,尽量让系统保持最简化.这样能熟悉每个软件的安装配置过程,也能减少版本间的冲突. 进入控制台后,查查有没有Java存在系统中 ...
【解决】缺少libstdc++.so.6库的原因及解决的方法
问题原因: 系统是64bit,该库是32bit的,在64bit系统上安装32bit库解决的方法: 1. 查看哪个安装包包括该库:yum provides libstdc++.so.6 libs ...
shell和awk配合使用
#!/bin/sh#$1 video id#$2 save result file########################################################### ...
css中url的路径含义及使用
http://www.jb51.net/css/37554.html 在CSS中有用url语法来指定background-image或是其他引用文件中,如: 复制代码代码如下: .mainheade ...
嵌入式Linux的web视频服务器的构建
http://blog.sina.com.cn/s/blog_53d02d550102v8bu.html随着嵌入式处理器和开源Linux 的广泛应用,各种视频服务在嵌入式系统中逐渐发展起来. 1．引言 ...
C语言变量的声明位置
标准C里面必须放在代码前面,否则出错: C++里面不一定要放在最前面,用的时候声明也不迟: 所以要看具体的编译环境,如果是C的话必须放在最前,C++就不用:一般.c后缀的是C文件,按C来编译:.cpp ...

MNIST数据集和IDX文件格式

MNIST数据集和IDX文件格式的更多相关文章

随机推荐

热门专题