Open Data for Deep Learning
Open Data for Deep Learning
Here you’ll find an organized list of interesting, high-quality datasets for machine learning research. We welcome your contributions for curating this list! You can find other lists of such datasets on Wikipedia, for example.
Recent Additions
- Open Source Biometric Recognition Data
- Google Audioset: An expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
- Uber 2B trip data: Slow rollout of access to ride data for 2Bn trips.
Natural-Image Datasets
- MNIST: handwritten digits: The most commonly used sanity check. Dataset of 25x25, centered, B&W handwritten digits. It is an easy task — just because something works on MNIST, doesn’t mean it works.
- CIFAR10 / CIFAR100: 32x32 color images with 10 / 100 categories. Not commonly used anymore, though once again, can be an interesting sanity check.
- Caltech 101: Pictures of objects belonging to 101 categories.
- Caltech 256: Pictures of objects belonging to 256 categories.
- STL-10 dataset: is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
- The Street View House Numbers (SVHN): House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
- NORB: Binocular images of toy figurines under various illumination and pose.
- Pascal VOC: Generic image Segmentation / classification — not terribly useful for building real-world image annotation, but great for baselines
- Labelme: A large dataset of annotated images.
- ImageNet: The de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category; WordNet; hierarchy from ImageNet.
- LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
- MS COCO: Generic image understanding / captioning, with an associated competition.
- COIL 20: Different objects imaged at every angle in a 360 rotation.
- COIL100: Different objects imaged at every angle in a 360 rotation.
- Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.
Geospatial data
- OpenStreetMap: Vector data for the entire planet under a free license. It contains (an older version of) the US Census Bureau’s data.
- Landsat8: Satellite shots of the entire Earth surface, updated every several weeks.
- NEXRAD: Doppler radar scans of atmospheric conditions in the US.
Artificial Datasets
- Arcade Universe: - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
- A collection of datasets inspired by the ideas from BabyAISchool
- BabyAIShapesDatasets: distinguishing between 3 simple shapes
- BabyAIImageAndQuestionDatasets: a question-image-answer dataset
- Datasets generated for the purpose of an empirical evaluation of deep architectures (DeepVsShallowComparisonICML2007):
- MnistVariations: introducing controlled variations in MNIST
- RectanglesData: discriminating between wide and tall rectangles
- ConvexNonConvex: discriminating between convex and nonconvex shapes
- BackgroundCorrelation: controling the degree of correlation in noisy MNIST backgrounds.
Facial Datasets
- Labelled Faces in the Wild: 13,000 cropped facial regions (using; Viola-Jones that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here.
- UMD Faces Annotated dataset of 367,920 faces of 8,501 subjects.
- CASIA WebFace Facial dataset of 453,453 images over 10,575 identities after face detection. Requires some filtering for quality.
- MS-Celeb-1M 1 million images of celebrities from around the world. Requires some filtering for best results on deep networks.
- Olivetti: A few images of several different people.
- Multi-Pie: The CMU Multi-PIE Face Database
- Face-in-Action
- JACFEE: Japanese and Caucasian Facial Expressions of Emotion
- FERET: The Facial Recognition Technology Database
- mmifacedb: MMI Facial Expression Database
- IndianFaceDatabase
- The Yale Face Database and The Yale Face Database B).
Video Datasets
- Youtube-8M: A large and diverse labeled video dataset for video understanding research.
Text Datasets
- 20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
- Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
- Penn Treebank: Used for next word prediction or next character prediction.
- UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
- Broadcast News: Large text dataset, classically used for next word prediction.
- Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
- WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
- SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
- Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
- Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
- Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.
Question answering
- Maluuba News QA Dataset: 120K Q&A pairs on CNN news articles.
- Quora Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.
- CMU Q/A Dataset: Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
- Maluuba goal-oriented dialogue: Procedural conversational dataset where the dialogue aims at accomplishing a task or taking a decision. Often used to work on chat bots.
- bAbi: Synthetic reading comprehension and question answering datasets from Facebook AI Research (FAIR).
- The Children’s Book Test: Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering (reading comprehension) and factoid look-up.
Sentiment
- Multidomain sentiment analysis dataset An older, academic dataset.
- IMDB: An older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.
- Stanford Sentiment Treebank: Standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence’s parse tree.
Recommendation and ranking systems
- Movielens: Movie ratings dataset from the Movielens website, in various sizes ranging from demo to mid-size.
- Million Song Dataset: Large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems.
- Last.fm: Music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems.
- Book-Crossing dataset:: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.
- Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
- Netflix Prize:: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies. First major Kaggle style data challenge. Only available unofficially, as privacy issues arose.
Networks and Graphs
- Amazon Co-Purchasing: Amazon Reviews crawled data from “the users who bought this also bought…” section of Amazon, as well as Amazon review data for related products. Good for experimenting with recommendation systems in networks.
- Friendster Social Network Dataset: Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.
Speech Datasets
- 2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
- LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
- VoxForge: Clean speech dataset of accented english. Useful for instances in which you expect to need robustness to different accents or intonations.
- TIMIT: English-only speech recognition dataset.
- CHIME: Noisy speech recognition challenge dataset. Dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
- TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.
Symbolic Music Datasets
- Piano-midi.de: classical piano pieces
- Nottingham : over 1000 folk tunes
- MuseData: electronic library of classical music scores
- JSB Chorales: set of four-part harmonized chorales
Miscellaneous Datasets
- CMU Motion Capture Database
- Brodatz dataset: texture modeling
- 300 terabytes of high-quality data from the Large Hadron Collider (LHC) at CERN
- NYC Taxi dataset: NYC taxi data obtained as a result of a FOIA request, led to privacy issues.
- Uber FOIL dataset: Data for 4.5M pickups in NYC from an Uber FOIL request.
- Criteo click stream dataset: Large Internet advertisement dataset from a major EU retargeter.
Health & Biology Data
- EU Surveillance Atlas of Infectious Diseases
- Merck Molecular Activity Challenge
- Musk dataset: The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property.
Government & statistics data
- Data USA: The most comprehensive visualization of US public data
- EU Gender statistics database
- The Netherlands’ Nationaal Georegister (Dutch)
- United Nations Development Programme Projects
Thanks to deeplearning.net and Luke de Oliveira for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome!
https://deeplearning4j.org/opendata
Open Data for Deep Learning的更多相关文章
- 学习Data Science/Deep Learning的一些材料
原文发布于我的微信公众号: GeekArtT. 从CFA到如今的Data Science/Deep Learning的学习已经有一年的时间了.期间经历了自我的兴趣.擅长事务的探索和试验,有放弃了的项目 ...
- Anomaly Detection for Time Series Data with Deep Learning——本质分类正常和异常的行为,对于检测异常行为,采用预测正常行为方式来做
A sample network anomaly detection project Suppose we wanted to detect network anomalies with the un ...
- (转) Deep Learning Resources
转自:http://www.jeremydjacksonphd.com/category/deep-learning/ Deep Learning Resources Posted on May 13 ...
- Deep Learning Papers Reading Roadmap
Deep Learning Papers Reading Roadmap https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadm ...
- Why deep learning?
1. 深度学习中网络越深越好么? 理论上说是这样的,因为网络越深,参数也越多,拟合能力也越强(但实际情况是,网络很深的时候,不容易训练,使得表现能力可能并不好). 2. 那么,不同什么深度的网络,在参 ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- Does Deep Learning Come from the Devil?
Does Deep Learning Come from the Devil? Deep learning has revolutionized computer vision and natural ...
- Deep Learning 16:用自编码器对数据进行降维_读论文“Reducing the Dimensionality of Data with Neural Networks”的笔记
前言 论文“Reducing the Dimensionality of Data with Neural Networks”是深度学习鼻祖hinton于2006年发表于<SCIENCE > ...
- 课程一(Neural Networks and Deep Learning),第三周(Shallow neural networks)—— 3.Programming Assignment : Planar data classification with a hidden layer
Planar data classification with a hidden layer Welcome to the second programming exercise of the dee ...
随机推荐
- 根据map中某一字段排序
以上是从小到大的排序...要注意.! 需要jdk8...
- nginx安装及高可用
nginx的安装: server1: tar zxf nginx-1.14.0.tar.gz cd nginx-1.14.0/src/core/ vim nginx.h cd /root/nginx- ...
- linux 数据库管理
1.安装数据库: yum install mariadb.serversystemctl staus mariadbsystemctl start mariadbsystemctl enable ma ...
- WireShark抓包分析(二)
简述:本文介绍了抓包数据含义,有TCP报文.Http报文.DNS报文.如有错误,欢迎指正. 1.TCP报文 TCP:(TCP是面向连接的通信协议,通过三次握手建立连接,通讯完成时要拆除连接,由于TCP ...
- Spark2.3.0 报 io.netty.buffer.PooledByteBufAllocator.metric
Spark2.3.0依赖的netty-all-4.1.17.Final.jar 与 hbase1.2.0依赖的netty-all-4.0.23.Final.jar 冲突 <!-- Spark2. ...
- inode与block知识总结
inode概述:硬盘要分区,然后格式化,创建文件系统在每个Linux存储设备的分区被格式化为ext3文件系统后一般有两个部分: 第一部分Inode:存储这些数据的属性信息(大小,属主,归属的用户 ...
- 用dango框架搭建博客网站
1.我早先下载了Anaconda35.0.1.但是Anaconda自带的编辑器Spyder我用的不太熟练.所以还是使用Pycharm来编辑代码.我的Pycharm试用期已经到了,所以需要注册码来使用P ...
- Vue.js-----轻量高效的MVVM框架(一、初识Vue.js)
1.什么是Vue.js? 众所周知,最近几年前端发展非常的迅猛,除各种框架如:backbone.angular.reactjs外,还有模块化开发思想的实现库:sea.js .require.js .w ...
- windows 7下安装MySQL5.6
一. 软件下载 从MySql官网上下载响应的版本,我的是5.6.17. 二.安装过程 以管理员权限运行安装程序,收集信息. 选择安装MySql产品,如果之前有安装过,那么就选择更新了. 同意Licen ...
- zookeeper开源客户端curator
zookeeper的原生api相对来说比较繁琐,比如:对节点添加监听事件,当监听触发后,我们需要再次手动添加监听,否则监听只生效一次:再比如,断线重连也需要我们手动代码来判断处理等等.对于curato ...