Creating an LMDB database in Python

LMDB is the database of choice when using Caffe with large datasets. This is a tutorial of how to create an LMDB database from Python. First, let’s look at the pros and cons of using LMDB over HDF5.

Reasons to use HDF5:

Simple format to read/write.

Reasons to use LMDB:

LMDB uses memory-mapped files, giving much better I/O performance.
Works well with really large datasets. The HDF5 files are always read entirely into memory, so you can’t have any HDF5 file exceed your memory capacity. You can easily split your data into several HDF5 files though (just put several paths to h5files in your text file). Then again, compared to LMDB’s page caching the I/O performance won’t be nearly as good.

LMDB from Python

You will need the Python package lmdb as well as Caffe’s python package (make pycaffe in Caffe). LMDB provides key-value storage, where each <key, value> pair will be a sample in our dataset. The key will simply be a string version of an ID value, and the value will be a serialized version of the Datum class in Caffe (which are built using protobuf).

import numpy as np

import lmdb

import caffe

N = 1000

# Let's pretend this is interesting data

X = np.zeros((N, 3, 32, 32), dtype=np.uint8)

y = np.zeros(N, dtype=np.int64)

# We need to prepare the database for the size. We'll set it 10 times

# greater than what we theoretically need. There is little drawback to

# setting this too big. If you still run into problem after raising

# this, you might want to try saving fewer entries in a single

# transaction.

map_size = X.nbytes * 10

env = lmdb.open('mylmdb', map_size=map_size)

with env.begin(write=True) as txn:

    # txn is a Transaction object

    for i in range(N):

        datum = caffe.proto.caffe_pb2.Datum()

        datum.channels = X.shape[1]

        datum.height = X.shape[2]

        datum.width = X.shape[3]

        datum.data = X[i].tobytes()  # or .tostring() if numpy < 1.9

        datum.label = int(y[i])

        str_id = '{:08}'.format(i)

        # The encode is only essential in Python 3

        txn.put(str_id.encode('ascii'), datum.SerializeToString())

You can also open up and inspect an existing LMDB database from Python:

import numpy as np

import lmdb

import caffe

env = lmdb.open('mylmdb', readonly=True)

with env.begin() as txn:

    raw_datum = txn.get(b'00000000')

datum = caffe.proto.caffe_pb2.Datum()

datum.ParseFromString(raw_datum)

flat_x = np.fromstring(datum.data, dtype=np.uint8)

x = flat_x.reshape(datum.channels, datum.height, datum.width)

y = datum.label

Iterating <key, value> pairs is also easy:

with env.begin() as txn:

    cursor = txn.cursor()

    for key, value in cursor:

        print(key, value)

Creating an LMDB database in Python的更多相关文章

Initialization of deep networks
Initialization of deep networks 24 Feb 2015Gustav Larsson As we all know, the solution to a non-conv ...
非图片格式如何转成lmdb格式--caffe
链接 LMDB is the database of choice when using Caffe with large datasets. This is a tutorial of how to ...
Movidius的深度学习入门
1.Ubuntu虚拟机上安装NC SDK cd /home/shine/Downloads/ mkdir NC_SDK git clone https://github.com/movidius/nc ...
Python框架、库以及软件资源汇总
转自:http://developer.51cto.com/art/201507/483510.htm 很多来自世界各地的程序员不求回报的写代码为别人造轮子.贡献代码.开发框架.开放源代码使得分散在世 ...
Awesome Python
Awesome Python A curated list of awesome Python frameworks, libraries, software and resources. Insp ...
Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...
Huge CSV and XML Files in Python, Error: field larger than field limit (131072)
Huge CSV and XML Files in Python January 22, 2009. Filed under python twitter facebook pinterest lin ...
（原）caffe中通过图像生成lmdb格式的数据
转载请注明出处: http://www.cnblogs.com/darkknightzh/p/5909121.html 参考网址: http://www.cnblogs.com/wangxiaocvp ...
Caffe︱构建lmdb数据集、binaryproto均值文件及各类难辨的文件路径名设置细解
Lmdb生成的过程简述 1.整理并约束尺寸,文件夹.图片放在不同的文件夹之下,注意图片的size需要规约到统一的格式,不然计算均值文件的时候会报错. 2.将内容生成列表放入txt文件中.两个txt文件 ...

随机推荐

python写csv文件
name=['lucy','jacky','eric','man','san'] place=['chongqing','guangzhou','beijing','shanghai','shenzh ...
train_test_split数据切分
train_test_split 数据切分格式: X_train,X_test, y_train, y_test =cross_validation.train_test_split(train_d ...
parted 分区命令
fdisk 是针对 MBR的分区 ,因为MBR分区空间最大不能超过2T 最多分4个主分区 , 所以parted可以修改磁盘为GPT 可以支持更大的分区,更多的分区 1 查看分区 : #part ...
Practice| 流程控制
若整数a除以非零整数b,商为整数,且余数为零, 我们就说a能被b整除(或说b能整除a),a为被除数,b为除数,即b|a("|"是整除符号),读作"b整除a"或& ...
IDEA创建Web项目（maven）
第一步:创建项目第二步:使用maven创建,并选择jdk 第三步:修改项目名称第四步:选择自动导入依赖(很重要!!) 第五步:添加核心依赖和打包第六步:编译一下第七步:配置web容器(这里是用 ...
day70 cookie & session 前后端交互分页显示
本文转载自qimi博客,cnblog.liwenzhou.com 概要: 我们的cookie是保存在浏览器中的键值对为什么要有cookie? 我们在访问浏览器的时候,千万个人访问同一个页面,我们只要 ...
day24 面向对象，交互，组合，命名空间，初始继承
面向对象的命名空间: #属性:静态属性 (直接和类名关联或者直接定义在class下的变量) # 对象属性 (在类内和self关联,在类外和对象名关联的变量) # 动态属性(函数) class Foo: ...
HDU-2032解题报告
Hdu-2032解题报告题意:实现给定行数的杨辉三角的输出. 杨辉三角的特点:每一行数据的开头和结尾是1,然后其他的数据是由其上一个数据与其左上角的数据之和组成11 11 2 11 3 3 11 4 ...
CodeForces - 862C Mahmoud and Ehab and the xor(构造)【异或】
<题目链接> 题目大意: 给出n.m,现在需要你输出任意n个不相同的数(n,m<1e5),使他们的异或结果为m,如果不存在n个不相同的数异或结果为m,则输出"NO" ...
vue笔记-模板,计算属性,class与style,data属性
数据和方法 1:只有当实例被创建时 data 中存在的属性才是响应式的,也可以预定义一些空的属性,唯一的意外就是Object.freeze(obj),这会阻止修改现有的属性;也就是说一个数据在放到根实 ...

Creating an LMDB database in Python

LMDB from Python

Creating an LMDB database in Python的更多相关文章

随机推荐

热门专题