项目描述:这是一个关于情感分析的教程.谷歌的Word2Vec(文本深度表示模型)是一个由深度学习驱动的方法, 旨在获取words内部的含义.Word2Vec试图理解单词之间的含义与语义关系.它类似于recurrent neural nets(递归神经网络)或者深度神经网络, 但是计算效率更高.情感分析是机器学习领域的一个具有挑战性的任务,人们通过语言来表达自己的情感,比如说讽刺,歧视,双关语,这些无论是对人类还是计算机都具有一定的误导性.本教程将专注于Word2Vec在情感分析上的应用.

项目时间:2014/12/9-2015/6/30

教程概述:这个教程将帮助我们熟悉Word2Vec在自然语言处理方面的应用,它主要有两个目标:

     基本的自然语言处理: 这个教程的Part1涵盖了一些基本的自然语言处理技术,帮助初学者入门;

     基于深度学习的文本理解: Part2和Part3讲述了如何使用Word2Vec来训练一个模型以及如何使用得到的词向量来做情感分析.

本教程所采用的数据集为IMDB情感分析数据集[2],它包含了10万条电影评论.本文处理流程主要包含以下几个模块:

利用pd.read_csv读取数据 --> 利用BeautifulSoup包去除评论中的HTML标签 --> 用正则化re去除评论中的标点符号 --> 将评论中所有大写字母换成小写 -->

Part 1: 对于初学者-Bag of Words

1.1 数据读取

下图展示了部分的训练数据. 训练集的名称是: labeledTrainData.tsv(csv文件为用,分隔的文件, tsv为用制表符分隔的文件), 它包含了三列属性, id/sentiment/review, 分别表示用户的id, 评论内容是否具有情感色彩的真实类别标签(取值0/1), 以及用户具体的评论内容.

# Import the padnas package, then use the "read_csv" function to read the labeled training data
import pandas as pd
# Load the data
train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter="\t",quoting=3)

Kaggle比赛的数据一般为.csv或者.tsv文件, 均可以使用pandas模块里面的read_csv()函数来进行读取. 此函数输入的第一个参数为文件名, 这个参数是必须的; 其余还有很多输入参数可供选择来实现不同的功能, header可以是一个list, 列表里面的值指定了行数(这些行的数据被忽略), 比如header=0表示数据的第一行是属性值; delimiter是含义是分隔符, 指定文件里面的元素是用什么分隔的.这里delimiter="\t"代表这个文件是使用制表符(Tab)来分隔的. 返回值train是一个DataFrame类型的数据, 调用train的shape属性可以查看数据大小(25000x3), 从下面程序可以看出, 训练集中有25000条数据. DataFrame类型的没列数据用属性名来标识, 可以通过列属性名来提取数据, 这些属性名变成了train的一个属性,可以使用train.id 这样的方式获取, 也可以使用使用train['id'] 这样的方式. 如: train['id'][0] 可用来获取数据'id'列的第1个数据('"5814_8"'),train['id'][0:3] 来获取'id'列的第1-3个数据: 具体见下面的程序:

In [11]: train['id'][0]
Out[11]: '"5814_8"' In [12]: train.id[0:3]
Out[12]:
0 "5814_8"
1 "2381_9"
2 "7759_3"
Name: id, dtype: object In [13]: train.id[0]
Out[13]: '"5814_8"' In [14]: train['id'][0:3]
Out[14]:
0 "5814_8"
1 "2381_9"
2 "7759_3"
Name: id, dtype: object
 In[15]:train.shape
Out[15]: (25000, 3)
In [16]: train.columns.values
Out[16]: array(['id', 'sentiment', 'review'], dtype=object)

查看第一条评论内容:

train['review'][0]
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'

对数据进行观察可以看出, 评论里面除了正常的单词, 还带有HTML标签<br/>, 单词缩写(如Michael Jondon简写成MJ), 各种标点符号等. 在下一节我们将介绍如何对数据进行清洗.

1.2 数据清洗和文本预处理

1.2.1 去除评论中的HTML标签: BeautifulSoup包

首先, 我们将去除文本中的HTML标签, 为此, 这里需要使用BeautifulSoup包,这个包是python的一个库,主要的用于我们在写爬虫时,从HTML或者XML文件中提取数据,这里只用其来去除评论里面的HTML标签,关于这个包更进一步的用法可以参考文档[4].如果电脑上没有安装BeautifulSoup, 可以执行下面的命令进行安装:

$ sudo pip install BeautifulSoup4

BeautifulSoup是一个类, 有很多的成员函数;看看BeautifulSoup是如何对文本中的HTML标签进行处理的:

 # Import the padnas package, then use the "read_csv" function to read the labeled training data
import pandas as pd
from bs4 import BeautifulSoup # Load the training dataset
train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter="\t",quoting=3) # Initialize the BeautifulSoup object on a single movie review
example1 = BeautifulSoup(train['review'][0]) # 初始化一个BeautifulSoup对象!!! # print the raw review and then the output of get_text(),for comparison
print train['review'][0]
print example1.get_text()

下面显示的为  example1.get_text() 的结果, 里面的HTML标签被去除了. BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库. 这个库的功能很强大, 远超过我们对此数据集进行处理所用到的功能.虽然正则化表达式也可以达到同样的功能, 但是不建议在这里使用正则化表达式, 就算是像这里如此简单的应用, 也建议使用BeautifulSoup来完成.

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

2) 处理标点符号, 数字, 停用词(stopword) : NLTK和正则表达式

需要注意的是: 不是说遇到需要文本处理就要把标点符号,数字等这些字符去除掉, 到底需不需要去除, 要考虑实际的任务要求.举个例子, 在情感分析中, 类似于"!!!"或者":-("这样的符号都是带有一定的感情色彩的,这些符号需要被当做单词来对待.这里为了简便起见, 将文本中所有的符号一并去除. 类似的, 本教程中也将去除所有的数字, 但是也有处理它们的其他方式, 从而使得这些数字变得有意义. 比如: 我们可以把它们当做是单词,也可以把所有的数字都用占位符字符串"NUM"来替代. 为了达到我们去除标点和数字的目的, 我们这里采用正则化表达式(Python的re模块), 这个模块是python里面自建的模块, 不需要另外安装. Python的re模块的官方文档见这里.

import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub('[^a-zA-Z]', # 搜寻的pattern
' ', # 用来替代的pattern(空格)
example1.get_text()) # 待搜索的text print letters_only

返回内容如下所示, 所有的除了a-z, A-Z, 空格之外的字符, 比如:数字, 标点符号都被去除了.

With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter

关于正则化表达式的语法这里就不讲了, 可以自行百度. 文本里面还包含了一些大写字母, 可以将所有的大写字母变成小写:

lower_case = letters_only.lower()  # Convert to lower case
words = lower_case.split() # Split into word

可以调用letters_only的lower()方法将文本里面的大写字母转变成小写, 然后调用lower_case的split()方法将段落的每个单词提取出来, 变成一个list类型的words.

图2 部分变量展示

最后, 我们需要考虑如何处理那些出现频率高,但是却没有多大意义的单词, 如a, and, the, is等. 这类单词称之为"stop words",尽管我们说stop words是一种语言中最常见到的单词,但是却没有任何一个统一的stop words列表被所有的自然语言处理工具所使用,有时候,一个工具甚至会使用多个stop words列表.NLTK包(Natural Language Toolkit)里面包含了stop words的列表,安装好nltk后,要用.download()来安装数据包,执行命令后,出现的界面如下所示,可能会下载较长的时间.

>>> import nltk
>>> nltk.download()

安装好了数据包以后,就可以使用nltk来查看stop words的列表:

参考文献:

[1]Bag of Words Meets Bogs of Popcorn: https://www.kaggle.com/c/word2vec-nlp-tutorial

[2]Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

[3] https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors

[4] Beautiful Soup 4.2.0 文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Kaggle案例分析3--Bag of Words Meets Bags of Popcorn的更多相关文章

  1. kaggle——Bag of Words Meets Bags of Popcorn(IMDB电影评论情感分类实践)

    kaggle链接:https://www.kaggle.com/c/word2vec-nlp-tutorial/overview 简介:给出 50,000 IMDB movie reviews,进行0 ...

  2. Kaggle案例分析1--Bestbuy

    1. 引言 Kaggle是一个进行数据挖掘和数据分析在线竞赛网站, 成立于2010年. 与Kaggle合作的公司可以提供一个数据+一个问题, 再加上适当的奖励, Kaggle上的计算机科学家和数据科学 ...

  3. 【第四课】kaggle案例分析四

    Evernote Export 比赛题目介绍 facebook想要准确的知道用户登录的地点,从而可以为用户提供更准确的服务 为了比赛,facebook创建了一个虚拟世界地图,地图面积为100km2,其 ...

  4. 【第三课】kaggle案例分析三

    Evernote Export 比赛题目介绍 TalkingData是中国最大的第三方移动数据平台,移动设备用户日常的选择和行为用户画像.目前,TalkingData正在寻求每天在中国活跃的5亿移动设 ...

  5. 【第二课】kaggle案例分析二

    Evernote Export 推荐系统比赛(常见比赛) 推荐系统分类 最能变现的机器学习应用 基于应用领域分类:电子商务推荐,社交好友推荐,搜索引擎推荐,信息内容推荐等 **基于设计思想:**基于协 ...

  6. Python核心技术与实战——十|面向对象的案例分析

    今天通过面向对象来对照一个案例分析一下,主要模拟敏捷开发过程中的迭代开发流程,巩固面向对象的程序设计思想. 我们从一个最简单的搜索做起,一步步的对其进行优化,首先我们要知道一个搜索引擎的构造:搜索器. ...

  7. ENode框架Conference案例分析系列之 - 文章索引

    ENode框架Conference案例分析系列之 - 业务简介 ENode框架Conference案例分析系列之 - 上下文划分和领域建模 ENode框架Conference案例分析系列之 - 架构设 ...

  8. SQL性能优化案例分析

    这段时间做一个SQL性能优化的案例分析, 整理了一下过往的案例,发现一个比较有意思的,拿出来给大家分享. 这个项目是我在项目开展2期的时候才加入的, 之前一期是个金融内部信息门户, 里面有个功能是收集 ...

  9. CSS3-3D制作案例分析实战

    一.前言 上一节,介绍了基础的CSS3 3D动画原理实现,也举了一个小小的例子来演示,但是有朋友跟我私信说想看看一些关于CSS3 3D的实例,所以在这里为了满足一下大家的需求,同时也为了以后能够更好的 ...

随机推荐

  1. vue (实战)登录1

    https://segmentfault.com/a/1190000009329619 https://www.jianshu.com/p/c51ffebeceed

  2. leetcode Database4

    一.Department Top Three Salaries The Employee table holds all employees. Every employee has an Id, an ...

  3. OneZero第二周第五次站立会议(2016.4.1)

    会议时间:2016年4月1日 会议成员:冉华,张敏,夏一鸣.(王请假). 会议目的:汇报前一天工作,会议成员评论. 会议内容: 1.前端,由夏,张负责汇报,完成前端功能,待命. 2.数据逻辑控制,由王 ...

  4. Js单元测试工具 以及 粗浅的对我的快乐运算进行测试

    1. Karma的介绍 Karma是Testacular的新名字,在2012年google开源了Testacular,2013年Testacular改名为Karma.Karma是一个让人感到非常神秘的 ...

  5. subversion & MacOS & Xcode 10

    subversion & MacOS Xcode 10 https://developer.apple.com/search/?q=subversion No SVN any more! ht ...

  6. 【大数据】SparkSql学习笔记

    第1章 Spark SQL概述 1.1 什么是Spark SQL Spark SQL是Spark用来处理结构化数据的一个模块,它提供了2个编程抽象:DataFrame和 DataSet,并且作为分布式 ...

  7. Django_博客项目 引入外部js文件内含模板语法无法正确获取值得说明和处理

    问题描述 : 项目中若存在对一段js代码复用多次的时候, 通常将此段代码移动到一个单独的静态文件中在被使用的地方利用 script 标签的 src 属性进行外部调用 但是如果此文件中存在使用 HTML ...

  8. Python之Numpy数组拼接,组合,连接

    转自:https://www.douban.com/note/518335786/?type=like ============改变数组的维度==================已知reshape函数 ...

  9. 自学Aruba1.2-WLAN一些基本常识802.11n速率计算方式、802.11n及802.11AC速率表

    点击返回:自学Aruba之路 自学Aruba1.2-WLAN一些基本常识802.11n速率计算方式.802.11n及802.11AC速率表 1. 802.11n速率计算方式 以802.11g的54M最 ...

  10. Qt ------ 自定义QVector<T>中的T

    #ifndef FREQUENCYSPECTRUM_H #define FREQUENCYSPECTRUM_H #include <QtCore/QVector> /** * Repres ...