Amazon评论数据的预处理代码,用于情感分析,代码改自

https://github.com/PaddlePaddle/Paddle/tree/develop/demo/quick_start/data

Amazon商品评论数据网址:

http://jmcauley.ucsd.edu/data/amazon/

Bash脚本文件

get_data.sh:

#!/bin/bash

# 1. size of pos : neg = 1:1.
# 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set.
# 3. distinct train set and test set. set -e # Download data
echo "Downloading Amazon Electronics reviews data..."
# http://jmcauley.ucsd.edu/data/amazon/
# wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
# wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_5.json.gz
echo "Downloading mosesdecoder..."
# https://github.com/moses-smt/mosesdecoder
# wget https://github.com/moses-smt/mosesdecoder/archive/master.zip # unzip master.zip
# rm master.zip ##################
# Preprocess data
echo "Preprocess data..."
export LC_ALL=C
UNAME_STR=`uname` if [ ${UNAME_STR} == 'Linux' ]; then
SHUF_PROG='shuf'
else
SHUF_PROG='gshuf'
fi mkdir -p tmp
# python preprocess.py -i reviews_Electronics_5.json.gz
python preprocess.py -i reviews_Digital_Music_5.json.gz
# uniq and shuffle
cd tmp
echo 'Uniq and shuffle...'
cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed
cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed min_len=`sed -n '$=' neg.shuffed`
echo `sed -n '$=' neg.shuffed`
test_num=$((min_len/10))
if [ $test_num -gt 12500 ];then
test_num=12500
fi
train_num=$((min_len-test_num)) head -n$train_num pos.shuffed >train.pos
head -n$train_num neg.shuffed >train.neg
tail -n$test_num pos.shuffed >test.pos
tail -n$test_num neg.shuffed >test.neg cat train.pos train.neg | ${SHUF_PROG} >../train.txt
cat test.pos test.neg | ${SHUF_PROG} >../test.txt cd -
echo 'train.txt' > train.list
echo 'test.txt' > test.list # use 30k dict
# rm -rf tmp
mv dict.txt dict_all.txt
cat dict_all.txt | head -n 30001 > dict.txt
echo 'Done.'

数据处理文件:preprocess.py:

# -*- coding: UTF-8 -*-

"""
1. Tokenize the words and punctuation
Usage:
python preprocess.py -i data_file [random seed]
""" import sys
import os
import operator
import gzip
from subprocess import Popen, PIPE
from optparse import OptionParser
import json
from multiprocessing import Queue
from multiprocessing import Pool
import multiprocessing batch_size = 5000
word_count = {}
num_tokenize = max(1,
multiprocessing.cpu_count() - 2) # parse + tokenize + save
max_queue_size = 8
parse_queue = Queue(maxsize=max_queue_size + num_tokenize)
tokenize_queue = Queue(maxsize=max_queue_size + num_tokenize) def create_dict(data):
"""
Create dictionary based on data, and saved in data_dir/dict.txt.
The first line is unk \t -1.
data: list, input data by batch.
"""
for seq in data:
try:
for w in seq.lower().split():
if w not in word_count:
word_count[w] = 1
else:
word_count[w] += 1
except:
sys.stderr.write(seq + "\tERROR\n") def parse(path):
"""
Open .gz file.
"""
sys.stderr.write(path)
g = gzip.open(path, 'r')
for l in g:
yield json.loads(l)
g.close() def tokenize(sentences):
"""
Use tokenizer.perl to tokenize input sentences.
tokenizer.perl is tool of Moses.
sentences : a list of input sentences.
return: a list of processed text.
"""
dir = './mosesdecoder-master/scripts/tokenizer/tokenizer.perl'
if not os.path.exists(dir):
sys.exit(
"The ./mosesdecoder-master/scripts/tokenizer/tokenizer.perl does not exists."
)
tokenizer_cmd = [dir, '-l', 'en', '-q', '-']
assert isinstance(sentences, list)
text = "\n".join(sentences)
tokenizer = Popen(tokenizer_cmd, stdin=PIPE, stdout=PIPE)
tok_text, _ = tokenizer.communicate(text)
toks = tok_text.split('\n')[:-1]
return toks def save_data(instance, data_dir, pre_fix, batch_num):
"""
save data by batch
"""
label = ['1' if pre_fix == 'pos' else '0' for i in range(len(instance))]
lines = ['%s\t%s' % (label[i], instance[i]) for i in range(len(label))]
file_name = os.path.join(data_dir, "%s_%s.txt" % (pre_fix, batch_num))
file(file_name, 'w').write('\n'.join(lines) + '\n') def tokenize_batch(id):
"""
tokenize data by batch
"""
while True:
num_batch, instance, pre_fix = parse_queue.get()
if num_batch == -1: ### parse_queue finished
tokenize_queue.put((-1, None, None))
sys.stderr.write("Thread %s finish\n" % (id))
break
tokenize_instance = tokenize(instance)
tokenize_queue.put((num_batch, tokenize_instance, pre_fix))
sys.stderr.write('.') def save_batch(data_dir, num_tokenize, data_dir_dict):
"""
save data by batch
build dict.txt
"""
token_count = 0
while True:
num_batch, instance, pre_fix = tokenize_queue.get()
if num_batch == -1:
token_count += 1
if token_count == num_tokenize: #### tokenize finished.
break
else:
continue
save_data(instance, data_dir, pre_fix, num_batch)
create_dict(instance) ## update dict sys.stderr.write("save file finish\n")
f = open(data_dir_dict, 'w')
f.write('%s\t%s\n' % ('unk', '-1'))
for k, v in sorted(word_count.items(), key=operator.itemgetter(1), \
reverse=True):
f.write('%s\t%s\n' % (k, v))
f.close()
sys.stderr.write("build dict finish\n") def parse_batch(data, num_tokenize):
"""
parse data by batch
parse -> tokenize -> save
"""
raw_txt = parse(data)
neg, pos = [], []
count = 0
sys.stderr.write("extract raw data\n")
for l in raw_txt:
rating = l["overall"]
text = l["reviewText"].lower() # # convert words to lower case
if rating == 5.0 and text:
pos.append(text)
if rating < 3.0 and text:
neg.append(text)
if len(pos) == batch_size or len(neg) == batch_size:
if len(pos) == batch_size:
batch = pos
pre_fix = 'pos'
else:
batch = neg
pre_fix = 'neg' parse_queue.put((count, batch, pre_fix))
count += 1
if pre_fix == 'pos':
pos = []
else:
neg = [] if len(pos) > 0:
parse_queue.put((count, pos, 'pos'))
count += 1
if len(neg) > 0:
parse_queue.put((count, neg, 'neg'))
count += 1
for i in range(num_tokenize):
parse_queue.put((-1, None, None)) #### for tokenize's input finished
sys.stderr.write("parsing finish\n") def option_parser():
parser = OptionParser(usage="usage: python preprcoess.py "\
"-i data_path [options]")
parser.add_option(
"-i", "--data", action="store", dest="input", help="Input data path.")
parser.add_option(
"-s",
"--seed",
action="store",
dest="seed",
default=1024,
help="Set random seed.")
return parser.parse_args() def main():
reload(sys)
sys.setdefaultencoding('utf-8')
options, args = option_parser()
data = options.input
seed = options.seed
data_dir_dict = os.path.join(os.path.dirname(data), 'dict.txt')
data_dir = os.path.join(os.path.dirname(data), 'tmp')
pool = Pool(processes=num_tokenize + 2)
pool.apply_async(parse_batch, args=(data, num_tokenize))
for i in range(num_tokenize):
pool.apply_async(tokenize_batch, args=(str(i), ))
pool.apply_async(save_batch, args=(data_dir, num_tokenize, data_dir_dict))
pool.close()
pool.join() file(os.path.join(os.path.dirname(data), 'labels.list'),
'w').write('neg\t0\npos\t1\n') if __name__ == '__main__':
main()

Amazon评论数据的预处理代码(Positive & Negative)的更多相关文章

  1. 1294 - Positive Negative Sign(规律)

    1294 - Positive Negative Sign   PDF (English) Statistics Forum Time Limit: 2 second(s) Memory Limit: ...

  2. light oj 1294 - Positive Negative Sign

    1294 - Positive Negative Sign   PDF (English) Statistics Forum Time Limit: 2 second(s) Memory Limit: ...

  3. js小功能合集:计算指定时间距今多久、评论树核心代码、字符串替换和去除。

    1.计算指定时间距今多久 var date1=new Date('2017/02/08 17:00'); //开始时间 var date2=new Date(); //当前时间 var date3=d ...

  4. 阳/阴性预测值Positive/negative Predictive Value(推荐AA)

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&am ...

  5. 智课雅思短语---二、exert positive/ negative effects on…

    智课雅思短语---二.exert positive/ negative effects on… 一.总结 一句话总结:对…产生有利/不利的影响 1.the advantages far outweig ...

  6. LightOJ - 1294 - Positive Negative Sign(规律)

    链接: https://vjudge.net/problem/LightOJ-1294 题意: Given two integers: n and m and n is divisible by 2m ...

  7. 零宽断言 -- Lookahead/Lookahead Positive/Negative

    http://www.vaikan.com/regular-expression-to-match-string-not-containing-a-word/ 经常我们会遇到想找出不包含某个字符串的文 ...

  8. ACM 中 矩阵数据的预处理 && 求子矩阵元素和问题

            我们考虑一个$N\times M$的矩阵数据,若要对矩阵中的部分数据进行读取,比如求某个$a\times b$的子矩阵的元素和,通常我们可以想到$O(ab)$的遍历那个子矩阵,对它的各 ...

  9. 利用 pandas 进行数据的预处理——离散数据哑编码、连续数据标准化

    数据的标准化 数据标准化就是将不同取值范围的数据,在保留各自数据相对大小顺序不变的情况下,整体映射到一个固定的区间中.根据具体的实现方法不同,有的时候会映射到 [ 0 ,1 ],有时映射到 0 附近的 ...

随机推荐

  1. git 远程库命令

    git 常用命令在这里就不在说了,初学者点击http://www.cnblogs.com/Vdiao/p/5267250.html Git是分布式版本控制系统,同一个Git仓库,可以分布到不同的机器上 ...

  2. 采用重写tostring方法使ComboBox显示对象属性

    当ComboBox中添加的是对象集合的时候,如果运行就会发现显示是的命令空间.类名,而如果我们想显示对象属性名的时候,我们就可以在对象类中重写object基类中的tostring方法.

  3. angularjs的三目运算

    前言:前几天写代码的时候遇到一个问题,有一个按钮,有"已关注"和"+关注"两种状态,需要对这两种状态的按钮的背景颜色进行区分,单后点击"已关注&quo ...

  4. MongoDB 可视化工具RoboMongo --- windows

    去官网下载安装包https://robomongo.org/download随便找一个目录进行安装(当然不要在c盘,和mongo安装路径无关) 安装完成后,启动MongoDB MongoDB的安装和使 ...

  5. easyui from 缓存问题处理

    1 这是ie低版本,缓存了easyui form load事件获取的服务器端数据,给ajax时间加上清除缓存就ok. 找到easyui 中的form load事件  添加cache:false, /* ...

  6. JAVA基础知识之Annotation

    基本Annotation Annotation必须使用工具(APT, Annotation tool)才能处理,Annotation可以在编译,类加载,运行时被读取,并执行相应处理. 下面介绍一些常用 ...

  7. oracle杀用户建用户改密码脚本

    # ******************************** # * dba_oracle_awr.sh # ******************************** # Usage: ...

  8. 2017年1月4日 星期三 --出埃及记 Exodus 21:30

    2017年1月4日 星期三 --出埃及记 Exodus 21:30 However, if payment is demanded of him, he may redeem his life by ...

  9. 如何在win7上安装ant-design

    1.首先要安装务必确认 Node.js 已经升级到 v4.x 或以上. 2.打开cmd,输入"npm install antd-init -g",安装antd(可以自己先指定安装目 ...

  10. NPOI分层导出

    using NPOI.HSSF.UserModel; using NPOI.POIFS.FileSystem; using org.in2bits.MyXls; using System; using ...