[转]NLP数据集
nlp-datasets
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.
Datasets
Apache Software Foundation Public Mail Archives: all publicly available Apache Software Foundation mail archives as of July 11, 2011 (200 GB)
Blog Authorship Corpus: consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 681,288 posts and over 140 million words. (298 MB)
Amazon Fine Food Reviews [Kaggle]: consists of 568,454 food reviews Amazon users left up to October 2012. Paper. (240 MB)
Amazon Reviews: Stanford collection of 35 million amazon reviews. (11 GB)
ArXiv: All the Papers on archive as fulltext (270 GB) + sourcefiles (190 GB).
ASAP Automated Essay Scoring [Kaggle]: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. (100 MB)
ASAP Short Answer Scoring [Kaggle]: Each of the data sets was generated from a single prompt. Selected responses have an average length of 50 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students primarily in Grade 10. All responses were hand graded and were double-scored. (35 MB)
Classification of political social media: Social media messages from politicians classified by content. (4 MB)
CLiPS Stylometry Investigation (CSI) Corpus: a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. (on request)
ClueWeb09 FACC: ClueWeb09 with Freebase annotations (72 GB)
ClueWeb11 FACC: ClueWeb11 with Freebase annotations (92 GB)
Common Crawl Corpus: web crawl data composed of over 5 billion web pages (541 TB)
Cornell Movie Dialog Corpus: contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters, 617 movies (9.5 MB)
Corporate messaging: A data categorization job concerning what corporations actually talk about on social media. Contributors were asked to classify statements as information (objective statements about the company or it’s activities), dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.). (600 KB)
Crosswikis: English-phrase-to-associated-Wikipedia-article database. Paper. (11 GB)
DBpedia: a community effort to extract structured information from Wikipedia and to make this information available on the Web (17 GB)
Death Row: last words of every inmate executed since 1984 online (HTML table)
Del.icio.us: 1.25 million bookmarks on delicious.com (170 MB)
Disasters on social media: 10,000 tweets with annotations whether the tweet referred to a disaster event (2 MB).
Economic News Article Tone and Relevance: News articles judged if relevant to the US economy and, if so, what the tone of the article was. Dates range from 1951 to 2014. (12 MB)
Enron Email Data: consists of 1,227,255 emails with 493,384 attachments covering 151 custodians (210 GB)
Event Registry: Free tool that gives real time access to news articles by 100.000 news publishers worldwide. Has API. (query tool)
Examiner.com - Spam Clickbait News Headlines [Kaggle]: 3 Million crowdsourced News headlines published by now defunct clickbait website The Examiner from 2010 to 2015. (200 MB)
Federal Contracts from the Federal Procurement Data Center (USASpending.gov): data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov (180 GB)
Flickr Personal Taxonomies: Tree dataset of personal tags (40 MB)
Freebase Data Dump: data dump of all the current facts and assertions in Freebase (26 GB)
Freebase Simple Topic Dump: data dump of the basic identifying facts about every topic in Freebase (5 GB)
Freebase Quad Dump: data dump of all the current facts and assertions in Freebase (35 GB)
German Political Speeches Corpus: collection of recent speeches held by top German representatives (25 MB, 11 MTokens)
GigaOM Wordpress Challenge [Kaggle]: blog posts, meta data, user likes (1.5 GB)
Google Books Ngrams: available also in hadoop format on amazon s3 (2.2 TB)
Google Web 5gram: contains English word n-grams and their observed frequency counts (24 GB)
Gutenberg Ebook List: annotated list of ebooks (2 MB)
Hansards text chunks of Canadian Parliament: 1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. (82 MB)
Harvard Library: over 12 million bibliographic records for materials held by the Harvard Library, including books, journals, electronic resources, manuscripts, archival materials, scores, audio, video and other materials. (4 GB)
Hate speech identification: Contributors viewed short text and identified if it a) contained hate speech, b) was offensive but without hate speech, or c) was not offensive at all. Contains nearly 15K rows with three contributor judgments per text string. (3 MB)
Hillary Clinton Emails [Kaggle]: nearly 7,000 pages of Clinton's heavily redacted emails (12 MB)
Historical Newspapers Yearly N-grams and Entities Dataset: Yearly time series for the usage of the 1,000,000 most frequent 1-, 2-, and 3-grams from a subset of the British Newspaper Archive corpus, along with yearly time series for the 100,000 most frequent named entities linked to Wikipedia and a list of all articles and newspapers contained in the dataset (3.1 GB)
Historical Newspapers Daily Word Time Series Dataset: Time series of daily word usage for the 25,000 most frequent words in 87 years of UK and US historical newspapers between 1836 and 1922. (2.7GB)
Home Depot Product Search Relevance [Kaggle]: contains a number of products and real customer search terms from Home Depot's website. The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters. (65 MB)
Identifying key phrases in text: Question/Answer pairs + context; context was judged if relevant to question/answer. (8 MB)
Jeopardy: archive of 216,930 past Jeopardy questions (53 MB)
200k English plaintext jokes: archive of 208,000 plaintext jokes from various sources.
Material Safety Datasheets: 230,000 Material Safety Data Sheets. (3 GB)
Million News Headlines - ABC Australia [Kaggle]: 1.3 Million News headlines published by ABC News Australia from 2003 to 2017. (56 MB)
Millions of News Article URLs: 2.3 million URLs for news articles from the frontpage of over 950 English-language news outlets in the six month period between October 2014 and April 2015. (101MB)
MCTest: a freely available set of 660 stories and associated questions intended for research on the machine comprehension of text; for question answering (1 MB)
NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)
News Headlines of India - Times of India [Kaggle]: 2.7 Million News Headlines with category published by Times of India from 2001 to 2017. (185 MB)
News article / Wikipedia page pairings: Contributors read a short article and were asked which of two Wikipedia articles it matched most closely. (6 MB)
NIPS2015 Papers (version 2) [Kaggle]: full text of all NIPS2015 papers (335 MB)
NYTimes Facebook Data: all the NYTimes facebook posts (5 MB)
One Week of Global News Feeds [Kaggle]: News Event Dataset of 1.4 Million Articles published globally in 20 languages over one week of August 2017. (115 MB)
Objective truths of sentences/concept pairs: Contributors read a sentence with two concepts. For example “a dog is a kind of animal” or “captain can have the same meaning as master.” They were then asked if the sentence could be true and ranked it on a 1-5 scale. (700 KB)
Open Library Data Dumps: dump of all revisions of all the records in Open Library. (16 GB)
Personae Corpus: collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays by 145 different students. (on request)
Reddit Comments: every publicly available reddit comment as of july 2015. 1.7 billion comments (250 GB)
Reddit Comments (May ‘15) [Kaggle]: subset of above dataset (8 GB)
Reddit Submission Corpus: all publicly available Reddit submissions from January 2006 - August 31, 2015). (42 GB)
Reuters Corpus: a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community. Need to sign agreement and sent per post to obtain. (2.5 GB)
SaudiNewsNet: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)
SMS Spam Collection: 5,574 English, real and non-enconded SMS messages, tagged according being legitimate (ham) or spam. (200 KB)
SouthparkData: .csv files containing script information including: season, episode, character, & line. (3.6 MB)
Stanford Question Answering Dataset (SQUAD 2.0): a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Stackoverflow: 7.3 million stackoverflow questions + other stackexchanges (query tool)
Twitter Cheng-Caverlee-Lee Scrape: Tweets from September 2009 - January 2010, geolocated. (400 MB)
Twitter New England Patriots Deflategate sentiment: Before the 2015 Super Bowl, there was a great deal of chatter around deflated footballs and whether the Patriots cheated. This data set looks at Twitter sentiment on important days during the scandal to gauge public sentiment about the whole ordeal. (2 MB)
Twitter Progressive issues sentiment analysis: tweets regarding a variety of left-leaning issues like legalization of abortion, feminism, Hillary Clinton, etc. classified if the tweets in question were for, against, or neutral on the issue (with an option for none of the above). (600 KB)
Twitter Sentiment140: Tweets related to brands/keywords. Website includes papers and research ideas. (77 MB)
Twitter sentiment analysis: Self-driving cars: contributors read tweets and classified them as very positive, slightly positive, neutral, slightly negative, or very negative. They were also prompted asked to mark if the tweet was not relevant to self-driving cars. (1 MB)
Twitter Tokyo Geolocated Tweets: 200K tweets from Tokyo. (47 MB)
Twitter UK Geolocated Tweets: 170K tweets from UK. (47 MB)
Twitter USA Geolocated Tweets: 200k tweets from the US (45MB)
Twitter US Airline Sentiment [Kaggle]: A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). (2.5 MB)
U.S. economic performance based on news articles: News articles headlines and excerpts ranked as whether relevant to U.S. economy. (5 MB)
Urban Dictionary Words and Definitions [Kaggle]: Cleaned CSV corpus of 2.6 Million of all Urban Dictionary words, definitions, authors, votes as of May 2016. (238 MB)
Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of all the articles in the English part of the Wikipedia that was taken in April 2010. It was processed, as described in detail below, to remove all links and irrelevant material (navigation text, etc) The corpus is untagged, raw text. Used by Stanford NLP (1.8 GB).
WorldTree Corpus of Explanation Graphs for Elementary Science Questions: a corpus of manually-constructed explanation graphs, explanatory role ratings, and associated semistructured tablestore for most publicly available elementary science exam questions in the US (8 MB)
Wikipedia Extraction (WEX): a processed dump of english language wikipedia (66 GB)
Wikipedia XML Data: complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. (500 GB)
Yahoo! Answers Comprehensive Questions and Answers: Yahoo! Answers corpus as of 10/25/2007. Contains 4,483,032 questions and their answers. (3.6 GB)
Yahoo! Answers consisting of questions asked in French: Subset of the Yahoo! Answers corpus from 2006 to 2015 consisting of 1.7 million questions posed in French, and their corresponding answers. (3.8 GB)
Yahoo! Answers Manner Questions: subset of the Yahoo! Answers corpus from a 10/25/2007 dump, selected for their linguistic properties. Contains 142,627 questions and their answers. (104 MB)
Yahoo! HTML Forms Extracted from Publicly Available Webpages: contains a small sample of pages that contain complex HTML forms, contains 2.67 million complex forms. (50+ GB)
Yahoo! Metadata Extracted from Publicly Available Web Pages: 100 million triples of RDF data (2 GB)
Yahoo N-Gram Representations: This dataset contains n-gram representations. The data may serve as a testbed for query rewriting task, a common problem in IR research as well as to word and sentence similarity task, which is common in NLP research. (2.6 GB)
Yahoo! N-Grams, version 2.0: n-grams (n = 1 to 5), extracted from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) crawled from over 12000 news-oriented sites (12 GB)
Yahoo! Search Logs with Relevance Judgments: Annonymized Yahoo! Search Logs with Relevance Judgments (1.3 GB)
Yahoo! Semantically Annotated Snapshot of the English Wikipedia: English Wikipedia dated from 2006-11-04 processed with a number of publicly-available NLP tools. 1,490,688 entries. (6 GB)
Yelp: including restaurant rankings and 2.2M reviews (on request)
Youtube: 1.7 million youtube videos descriptions (torrent)
Sources
- Awesome public datasets/NLP (includes more lists)
- AWS Public Datasets
- CrowdFlower: Data for Everyone (lots of little surveys they conducted and data obtained by crowdsourcing for a specific task)
- Kaggle 1, 2 (make sure though that the kaggle competition data can be used outside of the competition!)
- Open Library
- Quora (mainly annotated corpora)
- /r/datasets (endless list of datasets, most is scraped by amateurs though and not properly documented or licensed)
- rs.io (another big list)
- Stackexchange: Opendata
- Stanford NLP group (mainly annotated corpora and TreeBanks or actual NLP tools)
- Yahoo! Webscope (also includes papers that use the data that is provided)
[转]NLP数据集的更多相关文章
- 干货 | 100+个NLP数据集大放送,再不愁数据!
奉上100多个按字母顺序排列的开源自然语言处理文本数据集列表(原始未结构化的文本数据),快去按图索骥下载数据自己研究吧! 数据集 Apache软件基金会公开邮件档案:截止到2011年7月11日全部公开 ...
- NLP数据集大放送,再也不愁数据了!【上百个哦】
奉上100多个按字母顺序排列的开源自然语言处理文本数据集列表(原始未结构化的文本数据),快去按图索骥下载数据自己研究吧! 数据集 Apache软件基金会公开邮件档案:截止到2011年7月11日全部公开 ...
- 史上最详尽的NLP预处理模型汇总
文章发布于公号[数智物语] (ID:decision_engine),关注公号不错过每一篇干货. 转自 | 磐创AI(公众号ID:xunixs) 作者 | AI小昕 编者按:近年来,自然语言处理(NL ...
- 自然语言处理(NLP)入门学习资源清单
Melanie Tosik目前就职于旅游搜索公司WayBlazer,她的工作内容是通过自然语言请求来生产个性化旅游推荐路线.回顾她的学习历程,她为期望入门自然语言处理的初学者列出了一份学习资源清单. ...
- NLP论文解读:无需模板且高效的语言微调模型(下)
原创作者 | 苏菲 论文题目: Prompt-free and Efficient Language Model Fine-Tuning 论文作者: Rabeeh Karimi Mahabadi 论文 ...
- 使用Huggingface在矩池云快速加载预训练模型和数据集
作为NLP领域的著名框架,Huggingface(HF)为社区提供了众多好用的预训练模型和数据集.本文介绍了如何在矩池云使用Huggingface快速加载预训练模型和数据集. 1.环境 HF支持Pyt ...
- 新手如何入门pytorch?
我最近的文章中,专门为想学Pytorch的新手推荐了一些学习资源,包括教程.视频.项目.论文和书籍.希望能对你有帮助:一.PyTorch学习教程.手册 (1)PyTorch英文版官方手册:https: ...
- 论文阅读 | ERNIE: Enhanced Representation through Knowledge Integration
摘要 知识加强的语义表示模型. knowledge masking strategies : entity-level masking / phrase-level masking 实 ...
- ERNIE:知识图谱结合BERT才是「有文化」的语言模型
自然语言表征模型最近受到非常多的关注,很多研究者将其视为 NLP 最重要的研究方向之一.例如在大规模语料库上预训练的 BERT,它可以从纯文本中很好地捕捉丰富的语义模式,经过微调后可以持续改善不同 N ...
随机推荐
- 【Javascript设计模式1】-单例模式
<parctical common lisp>的作者曾说,如果你需要一种模式,那一定是哪里出了问题.他所说的问题是指因为语言的天生缺陷,不得不去寻求和总结一种通用的解决方案. 不管是弱类型 ...
- 定时删除日志文件---linux定时清理日志
linux是一个很能自动产生文件的系统,日志.邮件.备份等.虽然现在硬盘廉价,我们可以有很多硬盘空间供这些文件浪费,让系统定时清理一些不需要的文件很有一种爽快的事情.不用你去每天惦记着是否需要清理日志 ...
- 在Android上山寨了一个Ios9的LivePhotos,放Github上了
9月10号的凌晨上演了一场IT界的春晚,相信很多果粉(恩,如果你指坚果,那我也没办法了,是在下输了)都熬夜看了吧,看完打算去医院割肾了吧.在发布会上发布了游戏机 Apple TV,更大的砧板 Ipad ...
- 主机无法访问虚拟机的apache解决办法
1.前言 今天学习搭建wordpress,apache服务器安装在虚拟机的Centos上.配置好以后,发现在虚拟机上可以访问,但在windows主机上不能访问.于是百度.google一下,终于解决问题 ...
- 【转】vim中多标签和多窗口的使用
原文:https://my.oschina.net/kutengshe/blog/464602 ---------------------------------------------------- ...
- ssh tunnel
https://peppoj.net/2012/10/tunnel-http-traffic-encrypted-using-polipo-and-ssh/ --------------------- ...
- Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十九)ES6.2.2 安装Ik中文分词器
注: elasticsearch 版本6.2.2 1)集群模式,则每个节点都需要安装ik分词,安装插件完毕后需要重启服务,创建mapping前如果有机器未安装分词,则可能该索引可能为RED,需要删除后 ...
- Sqlite向MySql导入数据
想把手上的Sqlite数据库导入到MySql,想来应该很简单,结果发现非常麻烦. 1.工具直接导入.试着找了几个软件,都不行.网上有人开发的,但是要收费,也不能用. 2.用各自支持的方式,中转.我用的 ...
- Modbus常用功能码协议详解
Modbus常用功能码协议详解 01H-读线圈状态 1)描述:读从机线圈寄存器,位操作,可读单个或者多个: 2)发送指令: 假设从机地址位0x01,寄存器开始地址0x0023,寄存器结束抵制0x003 ...
- ZH奶酪:如何在Ubuntu上安装Java/管理多个JAVA/设置JAVA_HOME
0.简介 Java的地位及重要性,大家都懂的,很多软件都依赖于jdk,在Ubuntu上安装Java的选择有很多,openJDK,Oracle Jdk... 1.安装默认 JRE/JDK(可选) 这是最 ...