Sentiment analysis in nlp
Sentiment analysis in nlp
The goal of the program is to analysis the article title is Sarcasm or not, i use tensorflow 2.5 to solve this problem.
Dataset download url: https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/home
a sample of the dataset:
{
"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5",
"headline": "former versace store clerk sues over secret 'black code' for minority shoppers",
"is_sarcastic": 0
}
we want to depend on headline to predict the is_sarcastic
, 1 means True,0 means False.
preprocessing
use pandas to read json file.
import pandas as pd
# lines = True means headle the json for each line
df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json" ,lines="True")
df
'''
is_sarcastic headline article_link
0 1 thirtysomething sci... https://www.theonion.co...
1 0 dem rep. totally ... https://www.huffingtonpos..
'''build list for each column
labels = []
sentences = []
urls = []
# a tips for convert series to list
'''
type(df['is_sarcastic'])
# Series
type(df['is_sarcastic'].values)
# ndarray
type(df['is_sarcastic'].values.tolist())
# list
'''
labels = df['is_sarcastic'].values.tolist()
sentences = df['headline'].values.tolist()
urls = df['article_link'].values.tolist()
len(labels) # 28619
len(sentences) # 28619split dataset into train set and test set
# train size is the 2/3 of the all dataset.
train_size = int(len(labels) / 3 * 2)
train_sentences = sentences[0: train_size]
test_sentences = sentences[train_size:]
train_y = labels[0:train_size]
test_y = labels[train_size:]init some parameter
# some parameter
vocab_size = 10000
# input layer to embedding
embedding_dim = 16
# each input sentence length
max_length = 100
# padding method
trunc_type='post'
padding_type='post'
# token the unfamiliar word
oov_tok = "<OOV>"preprocessing on train set and test set
# processing on train set and test set
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token = oov_tok)
tokenizer.fit_on_texts(train_sentences)
train_X = tokenizer.texts_to_sequences(train_sentences)
# padding the data
train_X = pad_sequences(train_X,
maxlen = max_length,
truncating = trunc_type,
padding = padding_type)
train_X[:2]
# convery the list to nparray
train_y = np.array(train_y)
# same operator to test set
test_X = tokenizer.texts_to_sequences(test_sentences)
test_X = pad_sequences(test_X ,
maxlen = max_length,
truncating = trunc_type,
padding = padding_type)
test_y = np.array(test_y)
build the model
some important functions and args:
tf.keras.layers.Dense # Dense
implements the operation:
output = activation(dot(input, kernel) + bias) , a NN layeractivation # Activation function to use. If you don't specify anything, no activation is applied (ie. "linear" activation:
a(x) = x
).use_bias # Boolean, whether the layer uses a bias vector.
tf.keras.Sequential # contain a linear stack of layer into a
tf.keras.Model
.tf.keras.Model # to train and predict
config the model with losses and metrics with
model.compile(args)
optimizer
some args
Adam
RMSprop
SGD
Adagrad
loss # The loss value that will be minimized by the model will then be the sum of all individual losses.
metrices # List of metrics to be evaluated by the model during training and testing.
train the model with
model.fit(x=None,y=None)
batch_size # Number of samples per gradient update. If unspecified,
batch_size
will default to 32.epochs # Number of epochs to train the model
verbose # Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch,verbose=2 is recommended when not running interactively
validation_data #( valid_X, valid_y )
tf.keras.layers.Embedding # Turns positive integers (indexes) into dense vectors of fixed size. as shown in following figure
the purpose of the embedding is making the 1-dim integer proceed the muti-dim vectors add. can find the hide feature and connect to predict the labels. in this program ,every word's emotion direction can be trained many times.
tf.keras.layer.GlobalAveragePooling1D # add all muti-dim vectors ,if the output layer shape is (32, 10, 64), after the pooling, the shape will be changed as (32,64), as shown in following figure
code is more simple then theory
# build the model
model = tf.keras.Sequential(
[
# make a word became a 64-dim vector
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
# add all word vector
tf.keras.layers.GlobalAveragePooling1D(),
# NN
tf.keras.layers.Dense(24, activation = 'relu'),
tf.keras.layers.Dense(1, activation = 'sigmoid')
]
)
model.compile(loss = 'binary_crossentropy', optimizer = 'adam' , metrics = ['accuracy'])
train the model
num_epochs = 30
history = model.fit(train_X, train_y, epochs = num_epochs,
validation_data = (test_X, test_y),
verbose = 2)
after the 30 epochs
Epoch 30/30
597/597 - 8s - loss: 1.8816e-04 - accuracy: 1.0000 - val_loss: 1.2858 - val_accuracy: 0.8216
predict our sentence
mytest_sentence = ["you are so cute", "you are so cute but looks like stupid"]
mytest_X = tokenizer.texts_to_sequences(mytest_sentence)
mytest_X = pad_sequences(mytest_X ,
maxlen = max_length,
truncating = trunc_type,
padding = padding_type)
mytest_y = model.predict(mytest_X)
# if result is bigger then 0.5 ,it means the title is Sarcasm
print(mytest_y > 0.5)
'''
[[False]
[ True]]
'''
reference:
tensorflow API: https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
colab: bit.ly/tfw-sarcembed
Sentiment analysis in nlp的更多相关文章
- Sentiment Analysis resources
Wikipedia: Sentiment analysis (also known as opinion mining) refers to the use of natural language p ...
- NAACL 2013 Paper Mining User Relations from Online Discussions using Sentiment Analysis and PMF
中文简单介绍:本文对怎样基于情感分析和概率矩阵分解从网络论坛讨论中挖掘用户关系进行了深入研究. 论文出处:NAACL'13. 英文摘要: Advances in sentiment analysis ...
- 【Deep Learning Nanodegree Foundation笔记】第 10 课:Sentiment Analysis with Andrew Trask
In this lesson, Andrew Trask, the author of Grokking Deep Learning, will walk you through using neur ...
- 论文阅读:Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
论文标题:Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis 论文链接:http://arxi ...
- 使用RNN进行imdb影评情感识别--use RNN to sentiment analysis
原创帖子,转载请说明出处 一.RNN神经网络结构 RNN隐藏层神经元的连接方式和普通神经网路的连接方式有一个非常明显的区别,就是同一层的神经元的输出也成为了这一层神经元的输入.当然同一时刻的输出是不可 ...
- Deep Learning for NLP 文章列举
Deep Learning for NLP 文章列举 原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://w ...
- 转 Deep Learning for NLP 文章列举
原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://www.socher.org/ http://deepl ...
- Standford CoreNLP--Sentiment Analysis初探
Stanford CoreNLP功能之一是Sentiment Analysis(情感分析),可以标识出语句的正面或者负面情绪,包括:Positive,Neutral,Negative三个值. 运行有两 ...
- Java自然语言处理NLP工具包
1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实 ...
随机推荐
- 2021.11.03 P6175 无向图的最小环问题
2021.11.03 P6175 无向图的最小环问题 P6175 无向图的最小环问题 - 洛谷 | 计算机科学教育新生态 (luogu.com.cn) 题意: 给定一张无向图,求图中一个至少包含 33 ...
- Flex 的 多种对齐属性
1. html 结构 <div id="container"> <div class="item item-1"> <h3> ...
- 如何配置JAVA环境并安装IEAD软件
安装IDEA软件之前需要做哪些准备? 在安装IDEA软件之前,需要先确定电脑中有没有JDK,如果没有需要先安装JDK. JDK是整个JAVA的核心,包括了Java运行环境,Java工具(javac/j ...
- XCTF练习题---CRYPTO---wtc_rsa_bbq
XCTF练习题---CRYPTO---wtc_rsa_bbq flag:flag{how_d0_you_7urn_this_0n?} 解题步骤: 1.观察题目,下载附件 2.下载后是一个文件,不清楚格 ...
- plicp 点云迭代最近邻点配准法
输入参数 点云A的极坐标集合 点云A对应Lidar所在pose 点云B的极坐标集合 点云B对应Lidar所在pose Features 根据两个点云的弧度关系确定找点的起始位置 根据两个点云的弧度关系 ...
- Springboot 整合 MyBatisPlus[详细过程]
Springboot 整合 MyBatisPlus[详细过程] 提要 这里已经将Springboot环境创建好 这里只是整合MyBatis过程 引入Maven依赖 添加MyBatisPlus启动依赖, ...
- Focal and Global Knowledge Distillation for Detectors
一. 概述 论文地址:链接 代码地址:链接 论文简介: 此篇论文是在CGNet上增加部分限制loss而来 核心部分是将gt框变为mask进行蒸馏 注释:仅为阅读论文和代码,未进行试验,如有漏错请不吝指 ...
- navicat软件、 python操作MySQL
查询关键字之having过滤 having与where的功能是一模一样的 都是对数据进行筛选 where用在分组之前的筛选 havng用在分组之后的筛选 为了更好的区分 所以将where说成筛选 ha ...
- MVC 与 Vue
MVC 与 Vue 本文写于 2020 年 7 月 27 日 首先有个问题:Vue 是 MVC 还是 MVVM 框架? 维基百科告诉我们:MVVM 是 PM 的变种,而 PM 又是 MVC 的变种. ...
- vscode修改括号对颜色,自定义括号颜色
新版的vscode 1.67(2022年4月更新的版本),自带括号颜色匹配,十分的方便. 至于怎么开启,已经有人写过,这里就不写了,更新到新版默认开启~ 括号颜色默认只有3种颜色,有时候感觉不够用. ...