SA: 情感分析资源(Corpus、Dictionary)
先主要摘自一篇中文Survey,
http://wenku.baidu.com/view/0c33af946bec0975f465e277.html
4.2 情感分析的资源建设 4.2.1 情感分析的语料 除了4.1节中三个国际/国内评测所提供的语料外,不少研究单位和个人也提供了一定规模的语料. 1. 康奈尔大学(Cornell)提供的影评数据集(http://www.cs.cornell.edu/people/pabo/movie-review-data/):由电影评论组成,其中持肯定和否定态度的各1,000篇;另外还有标注了褒贬极性的句子各5,331句,标注了主客观标签的句子各5,000句.目前影评库被广泛应用于各种粒度的,如词语、句子和篇章级情感分析研究中. 2. 伊利诺伊大学芝加哥分校(UIC)的Hu和Liu提供的产品领域的评论语料:主要包括从亚马逊和Cnet下载的五种电子产品的网络评论(包括两个品牌的数码相机,手机,MP3和DVD播放器).其中他们将这些语料按句子为单元详细标注了评价对象,情感句的极性及强度等信息.因此,该语料适合于评价对象抽取和句子级主客观识别,以及情感分类方法的研究.此外,Liu还贡献了比较句研究[74]方面的语料. 3. Janyce Wiebe等人所开发的MPQA(Multiple-Perspective QA)库:包含535篇不同视角的新闻评论,它是一个进行了深度标注的语料库.其中标注者为每个子句手工标注出一些情感信息,如观点持有者,评价对象,主观表达式以及其极性与强度.文献[75]描述了整个的标注流程.MPQA语料适合于新闻评论领域任务的研究. 4. 麻省理工学院(MIT)的Barzilay等人构建的多角度餐馆评论语料:共4,488篇,每篇语料分别按照五个角度(饭菜,环境,服务,价钱,整体体验)分别标注上1~5个等级.这组语料为单文档的基于产品属性的情感文摘提供了研究平台. 5. 国内的中科院计算所的谭松波博士提供的较大规模的中文酒店评论语料:约有10,000篇,并标注了褒贬类别,可以为中文的篇章级的情感分类提供一定的平台. 4.2.2 情感分析的词典资源 情感分析发展到现在,有不少前人总结出来的情感资源,大多数表现为评价词词典资源. 1. GI(General Inquirer)评价词词典(英文,http://www.wjh.harvard.edu/~inquirer/).该词典收集了1,914个褒义词和2,293个贬义词,并为每个词语按照极性,强度,词性等打上不同的标签,便于情感分析任务中的灵活应用. 2. NTU评价词词典(繁体中文).该词典由台湾大学收集,含有2,812个褒义词与8,276个贬义词[76]. 3. 主观词词典(英文,http://www.cs.pitt.edu/mpqa/).该词典的主观词语来自OpinionFinder系统,该词典含有8,221个主观词,并为每个词语标注了词性,词性还原以及情感极性. 4. HowNet评价词词典(简体中文、英文,http://www.keenage.com/html/e_index.html).该词典包含9,193个中文评价词语/短语, 9,142个英文评价词语/短语,并被分为褒贬两类.其中,该词典提供了评价短语,为情感分析提供了更丰富的情感资源.
再补上上次总结的:
http://site.douban.com/204776/widget/notes/12599608/note/284723117/
##Datasets for SA:
###Lexicons:
[1]
The General Inquirer Lexicon
•Homepage: http://www.wjh.harvard.edu/~inquirer
•Categories
–Positive (1,915 words) and Negative (2,291 words)
–Strong vs Weak, Active vs Passive, Overstated versus Understated
–Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
•Free for research use
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press.
[2]
LIWC (Linguistic Inquiry and Word Count)
•Homepage: http://www.liwc.net/
•2,300 words, > 70 classes
–Affective Processes
•negative emotion (bad, weird, hate, problem, tough)
•positive emotion (love, nice, sweet)
–Cognitive Processes
•Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
–Pronouns, Negation (no, never), Quantifiers (few, many)
•$30 or $90 fee
Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007.
[3]
MPQA Subjectivity Cues Lexicon
•Homepage: http://www.cs.pitt.edu/mpqa/subj_lexicon.html
•6,885 words from 8,221 lemmas
–2,718 positive
–4,912 negative
•Each word annotated for intensity (strong, weak)
•GNU GPL
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003
[4]
Opinion Lexicon
•Homepage: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
•6,786 words
–2,006 positive
–4,783 negative
•Bing Liu's Page on Opinion Mining
Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. ACM SIGKDD-2004
[5]
SentiWordNet
•Homepage: http://sentiwordnet.isti.cnr.it/
•All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness
–[estimable(J,3)] “may be computed or estimated”
•Pos 0 Neg 0 Obj 1
–[estimable(J,1)] “deserving of respect or high regard”
•Pos .75 Neg 0 Obj .25
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010 SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010
Sentiment Classification of Reviews Using SentiWordNet
http://arrow.dit.ie/cgi/viewcontent.cgi?article=1000&context=ittpapnin
###Corpus and Reviews:
[1]
Movie reviews
–Internet Movie Database (IMDb)
•http://www.cs.cornell.edu/people/pabo/movie-review-data/
•http://reviews.imdb.com/Reviews/
–700 positive / 700 negative
[2]
MOVIEREVIEWSET (Pang and Lee 2004)
[3]
MPQACORPUS (Wiebeet al. 2005)
[4]
PRODUCTREVIEWSET (Yi et al. 2003)
[2]-[4]
http://www.cs.uic.edu/liub/FBS/sentiment-analysis.html
http://www.cs.pitt.edu/mpqa/
http://ai.stanford.edu/amaas/data/sentiment
http://people.csail.mit.edu/jrennie/20Newsgroups
[5]
BOOKREVIEWSET (Aueand Gamon, 2005)
[6]
SENTENCESET (Kim and Hovy2004)
[7]
The J.D. Power and Associates Sentiment Corpus
http://verbs.colorado.edu/jdpacorpus/
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document. Mentions of each entity are marked as co-referential. Mentions are assigned semantic types consisting of the Automatic Content Extraction (ACE) mention types and additional domain-specific types. Meronymy (part-of and feature-of) and instance relations are also annotated. Expressions which convey sentiment toward an entity are annotated with the polarity of their prior and contextual sentiments as well the mentions they target. The following modifiers are annotated. These may target other modifiers or sentiment expressions
negators (expressions which invert the polarity of a sentiment expression or modifier)
neutralizers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier)
committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier)
intensifiers (expressions which shift the intensity of a sentiment expression or modifier)
Additionally, we have annotated when the opinion holder of a sentiment expression is someone other than the author of the blog by linking the expression to the holder. We also annotate when two entities are compared on a particular dimension.
The data, organized into training and testing sets, consists of 515 documents (blog posts) covering 330,762 tokens which make up 19,322 sentences. 87,532 mentions, 15,637 sentiment expressions, and 22,662 relations between entities (co-reference groups) are annotated.
Please see the included README file for more information about this data. For a more detailed explanation of the preparation of the corpus, please read The JDPA Sentiment Corpus Annotation Guidelines or The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain.
##Packages and APIs for SA:
http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r
https://sites.google.com/site/miningtwitter/questions/sentiment
##Apps for SA:
Twitteratr
Tweetfeel
Twitter sentiment / Sentiment140
SA: 情感分析资源(Corpus、Dictionary)的更多相关文章
- 如何使用百度EasyDL进行情感分析
使用百度EasyDL定制化训练和服务平台有一段时间了,越来越能体会到EasyDL的易用性.在此之前我也接触过不少的深度学习平台,如类脑平台.Google的GCP深度学习平台.AWS深度学习平台,但我觉 ...
- 朴素贝叶斯算法下的情感分析——C#编程实现
这篇文章做了什么 朴素贝叶斯算法是机器学习中非常重要的分类算法,用途十分广泛,如垃圾邮件处理等.而情感分析(Sentiment Analysis)是自然语言处理(Natural Language Pr ...
- Python爬虫和情感分析简介
摘要 这篇短文的目的是分享我这几天里从头开始学习Python爬虫技术的经验,并展示对爬取的文本进行情感分析(文本分类)的一些挖掘结果. 不同于其他专注爬虫技术的介绍,这里首先阐述爬取网络数据动机,接着 ...
- C#编程实现朴素贝叶斯算法下的情感分析
C#编程实现 这篇文章做了什么 朴素贝叶斯算法是机器学习中非常重要的分类算法,用途十分广泛,如垃圾邮件处理等.而情感分析(Sentiment Analysis)是自然语言处理(Natural Lang ...
- pyhanlp文本分类与情感分析
语料库 本文语料库特指文本分类语料库,对应IDataSet接口.而文本分类语料库包含两个概念:文档和类目.一个文档只属于一个类目,一个类目可能含有多个文档.比如搜狗文本分类语料库迷你版.zip,下载前 ...
- 基于 Spark 的文本情感分析
转载自:https://www.ibm.com/developerworks/cn/cognitive/library/cc-1606-spark-seniment-analysis/index.ht ...
- 【转】用python实现简单的文本情感分析
import jieba import numpy as np # 打开词典文件,返回列表 def open_dict(Dict='hahah',path = r'/Users/zhangzhengh ...
- Spark 的情感分析
Spark 的情感分析 本文描述了基于 Spark 如何构建一个文本情感分析系统.文章首先介绍文本情感分析基本概念和应用场景,其次描述采用 Spark 作为分析的基础技术平台的原因和本文使用到技术组件 ...
- 文本情感分析(一):基于词袋模型(VSM、LSA、n-gram)的文本表示
现在自然语言处理用深度学习做的比较多,我还没试过用传统的监督学习方法做分类器,比如SVM.Xgboost.随机森林,来训练模型.因此,用Kaggle上经典的电影评论情感分析题,来学习如何用传统机器学习 ...
随机推荐
- (转)maven配置之pom.xml配置
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/20 ...
- 新年PR交期回写,展望期由14天改为30天,FP_PR2SAP ;转单量改为100W;FP_PR2SAP_MOD_NEW
- HDU 4831 Scenic Popularity
Scenic Popularity Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others ...
- php 验证码
$im =imagecreate(500,500); $bak =imagecolorallocate($im,200,100,0); $shk = imagecolorallocate($im,0, ...
- VS2008设置断点不命中
网上试了各种办法都不好使,最后想到要修复一下,其实只要重置一下开发环境就好了,具体方法如下: 开始 --> Microsoft Visual Studio 2008 --> Visual ...
- 社区O2O的发展与未来
虽然很多人都自我标榜为社区O2O,其实,在现在这个时间点上,社区O2O可以说是根本不存在的. 社区是什么?对用户来说,社区包括房子,包括邻居,包括宠物,包括保安,包括广场舞,也包括跳广场舞的大妈, ...
- convas demo1
1 getContext 语法 Canvas.getContext(contextID) 参数 参数 contextID 指定了您想要在画布上绘制的类型.当前唯一的合法值是 "2d" ...
- 各大门户网站的css初始化代码
腾讯QQ官网 css样式初始 body,ol,ul,h1,h2,h3,h4,h5,h6,p,th,td,dl,dd,form,fieldset,legend,input,textarea,select ...
- VBA学习之关于数据透视表的应用
工作中很多地方需要同时处理多个数据表,而且用数据透视表进行排版,排序,计算字段,一个一个的做非常累,这里给出批量处理的方法. 学习VBA之前最好懂一点点VB的基础知识,因为里面的很多语法问题都是由VB ...
- 对 web.config 节点信息进行加密
记录一下,免得以后再网上找 项目中,数据库访问链接字符串配置在web.config中,明文的,应客户需求需改成密文,so,需要加密. 一开始想的是需要重写configuration什么什么的,最后发现 ...