Amazon评论数据的预处理代码（Positive & Negative）

Amazon评论数据的预处理代码，用于情感分析，代码改自

https://github.com/PaddlePaddle/Paddle/tree/develop/demo/quick_start/data

Amazon商品评论数据网址：

http://jmcauley.ucsd.edu/data/amazon/

Bash脚本文件

get_data.sh：

#!/bin/bash

# 1. size of pos : neg = 1:1.

# 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set.

# 3. distinct train set and test set.

set -e

# Download data

echo "Downloading Amazon Electronics reviews data..."

# http://jmcauley.ucsd.edu/data/amazon/

# wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz

# wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_5.json.gz

echo "Downloading mosesdecoder..."

# https://github.com/moses-smt/mosesdecoder

# wget https://github.com/moses-smt/mosesdecoder/archive/master.zip

# unzip master.zip

# rm master.zip

##################

# Preprocess data

echo "Preprocess data..."

export LC_ALL=C

UNAME_STR=`uname`

if [ ${UNAME_STR} == 'Linux' ]; then

  SHUF_PROG='shuf'

else

  SHUF_PROG='gshuf'

fi

mkdir -p tmp

# python preprocess.py -i reviews_Electronics_5.json.gz

python preprocess.py -i reviews_Digital_Music_5.json.gz

# uniq and shuffle

cd tmp

echo 'Uniq and shuffle...'

cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed

cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed

min_len=`sed -n '$=' neg.shuffed`

echo `sed -n '$=' neg.shuffed`

test_num=$((min_len/10))

if [ $test_num -gt 12500 ];then

 test_num=12500

fi

train_num=$((min_len-test_num))

head -n$train_num pos.shuffed >train.pos

head -n$train_num neg.shuffed >train.neg

tail -n$test_num pos.shuffed >test.pos

tail -n$test_num neg.shuffed >test.neg

cat train.pos train.neg | ${SHUF_PROG} >../train.txt

cat test.pos test.neg | ${SHUF_PROG} >../test.txt

cd -

echo 'train.txt' > train.list

echo 'test.txt' > test.list

# use 30k dict

# rm -rf tmp

mv dict.txt dict_all.txt

cat dict_all.txt | head -n 30001 > dict.txt

echo 'Done.'

数据处理文件：preprocess.py：

# -*- coding: UTF-8 -*-

"""

1. Tokenize the words and punctuation

Usage:

    python preprocess.py -i data_file [random seed]

"""

import sys

import os

import operator

import gzip

from subprocess import Popen, PIPE

from optparse import OptionParser

import json

from multiprocessing import Queue

from multiprocessing import Pool

import multiprocessing

batch_size = 5000

word_count = {}

num_tokenize = max(1,

                   multiprocessing.cpu_count() - 2)  # parse + tokenize + save

max_queue_size = 8

parse_queue = Queue(maxsize=max_queue_size + num_tokenize)

tokenize_queue = Queue(maxsize=max_queue_size + num_tokenize)

def create_dict(data):

    """

    Create dictionary based on data, and saved in data_dir/dict.txt.

    The first line is unk \t -1.

    data: list, input data by batch.

    """

    for seq in data:

        try:

            for w in seq.lower().split():

                if w not in word_count:

                    word_count[w] = 1

                else:

                    word_count[w] += 1

        except:

            sys.stderr.write(seq + "\tERROR\n")

def parse(path):

    """

    Open .gz file.

    """

    sys.stderr.write(path)

    g = gzip.open(path, 'r')

    for l in g:

        yield json.loads(l)

    g.close()

def tokenize(sentences):

    """

    Use tokenizer.perl to tokenize input sentences.

    tokenizer.perl is tool of Moses.

    sentences : a list of input sentences.

    return: a list of processed text.

    """

    dir = './mosesdecoder-master/scripts/tokenizer/tokenizer.perl'

    if not os.path.exists(dir):

        sys.exit(

            "The ./mosesdecoder-master/scripts/tokenizer/tokenizer.perl does not exists."

        )

    tokenizer_cmd = [dir, '-l', 'en', '-q', '-']

    assert isinstance(sentences, list)

    text = "\n".join(sentences)

    tokenizer = Popen(tokenizer_cmd, stdin=PIPE, stdout=PIPE)

    tok_text, _ = tokenizer.communicate(text)

    toks = tok_text.split('\n')[:-1]

    return toks

def save_data(instance, data_dir, pre_fix, batch_num):

    """

    save data by batch

    """

    label = ['1' if pre_fix == 'pos' else '0' for i in range(len(instance))]

    lines = ['%s\t%s' % (label[i], instance[i]) for i in range(len(label))]

    file_name = os.path.join(data_dir, "%s_%s.txt" % (pre_fix, batch_num))

    file(file_name, 'w').write('\n'.join(lines) + '\n')

def tokenize_batch(id):

    """

    tokenize data by batch

    """

    while True:

        num_batch, instance, pre_fix = parse_queue.get()

        if num_batch == -1:  ### parse_queue finished

            tokenize_queue.put((-1, None, None))

            sys.stderr.write("Thread %s finish\n" % (id))

            break

        tokenize_instance = tokenize(instance)

        tokenize_queue.put((num_batch, tokenize_instance, pre_fix))

        sys.stderr.write('.')

def save_batch(data_dir, num_tokenize, data_dir_dict):

    """

        save data by batch

        build dict.txt

    """

    token_count = 0

    while True:

        num_batch, instance, pre_fix = tokenize_queue.get()

        if num_batch == -1:

            token_count += 1

            if token_count == num_tokenize:  #### tokenize finished.

                break

            else:

                continue

        save_data(instance, data_dir, pre_fix, num_batch)

        create_dict(instance)  ## update dict

    sys.stderr.write("save file finish\n")

    f = open(data_dir_dict, 'w')

    f.write('%s\t%s\n' % ('unk', '-1'))

    for k, v in sorted(word_count.items(), key=operator.itemgetter(1), \

                       reverse=True):

        f.write('%s\t%s\n' % (k, v))

    f.close()

    sys.stderr.write("build dict finish\n")

def parse_batch(data, num_tokenize):

    """

    parse data by batch

    parse -> tokenize -> save

    """

    raw_txt = parse(data)

    neg, pos = [], []

    count = 0

    sys.stderr.write("extract raw data\n")

    for l in raw_txt:

        rating = l["overall"]

        text = l["reviewText"].lower()  # # convert words to lower case

        if rating == 5.0 and text:

            pos.append(text)

        if rating < 3.0 and text:

            neg.append(text)

        if len(pos) == batch_size or len(neg) == batch_size:

            if len(pos) == batch_size:

                batch = pos

                pre_fix = 'pos'

            else:

                batch = neg

                pre_fix = 'neg'

            parse_queue.put((count, batch, pre_fix))

            count += 1

            if pre_fix == 'pos':

                pos = []

            else:

                neg = []

    if len(pos) > 0:

        parse_queue.put((count, pos, 'pos'))

        count += 1

    if len(neg) > 0:

        parse_queue.put((count, neg, 'neg'))

        count += 1

    for i in range(num_tokenize):

        parse_queue.put((-1, None, None))  #### for tokenize's input finished

    sys.stderr.write("parsing finish\n")

def option_parser():

    parser = OptionParser(usage="usage: python preprcoess.py "\

                                "-i data_path [options]")

    parser.add_option(

        "-i", "--data", action="store", dest="input", help="Input data path.")

    parser.add_option(

        "-s",

        "--seed",

        action="store",

        dest="seed",

        default=1024,

        help="Set random seed.")

    return parser.parse_args()

def main():

    reload(sys)

    sys.setdefaultencoding('utf-8')

    options, args = option_parser()

    data = options.input

    seed = options.seed

    data_dir_dict = os.path.join(os.path.dirname(data), 'dict.txt')

    data_dir = os.path.join(os.path.dirname(data), 'tmp')

    pool = Pool(processes=num_tokenize + 2)

    pool.apply_async(parse_batch, args=(data, num_tokenize))

    for i in range(num_tokenize):

        pool.apply_async(tokenize_batch, args=(str(i), ))

    pool.apply_async(save_batch, args=(data_dir, num_tokenize, data_dir_dict))

    pool.close()

    pool.join()

    file(os.path.join(os.path.dirname(data), 'labels.list'),

         'w').write('neg\t0\npos\t1\n')

if __name__ == '__main__':

    main()

Amazon评论数据的预处理代码（Positive & Negative）的更多相关文章

1294 - Positive Negative Sign（规律）
1294 - Positive Negative Sign PDF (English) Statistics Forum Time Limit: 2 second(s) Memory Limit: ...
light oj 1294 - Positive Negative Sign
1294 - Positive Negative Sign PDF (English) Statistics Forum Time Limit: 2 second(s) Memory Limit: ...
js小功能合集：计算指定时间距今多久、评论树核心代码、字符串替换和去除。
1.计算指定时间距今多久 var date1=new Date('2017/02/08 17:00'); //开始时间 var date2=new Date(); //当前时间 var date3=d ...
阳/阴性预测值Positive/negative Predictive Value（推荐AA）
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&am ...
智课雅思短语---二、exert positive/ negative effects on…
智课雅思短语---二.exert positive/ negative effects on… 一.总结一句话总结:对…产生有利/不利的影响 1.the advantages far outweig ...
LightOJ - 1294 - Positive Negative Sign（规律）
链接: https://vjudge.net/problem/LightOJ-1294 题意: Given two integers: n and m and n is divisible by 2m ...
零宽断言 -- Lookahead/Lookahead Positive/Negative
http://www.vaikan.com/regular-expression-to-match-string-not-containing-a-word/ 经常我们会遇到想找出不包含某个字符串的文 ...
ACM 中矩阵数据的预处理 && 求子矩阵元素和问题
我们考虑一个$N\times M$的矩阵数据,若要对矩阵中的部分数据进行读取,比如求某个$a\times b$的子矩阵的元素和,通常我们可以想到$O(ab)$的遍历那个子矩阵,对它的各 ...
利用 pandas 进行数据的预处理——离散数据哑编码、连续数据标准化
数据的标准化数据标准化就是将不同取值范围的数据,在保留各自数据相对大小顺序不变的情况下,整体映射到一个固定的区间中.根据具体的实现方法不同,有的时候会映射到 [ 0 ,1 ],有时映射到 0 附近的 ...

随机推荐

java 编译期常量
今天在看书的时候遇到了一个不是很懂的名词,是在think in java 这本书的第七章讲final关键字时讲到的.然后自己在网上查了一下知道了一些. 编译器常量就是:它的值在编译期就可以确定的常量. ...
JavaScript简介
JavaScript JavaScript 是一种轻量级的编程语言,是可插入 HTML 页面的编程代码,这门语言可用于 HTML 和 web,更可广泛用于服务器.PC.笔记本电脑.平板电脑和智能手机等 ...
MySQL时间段查询
现实中我们会遇到统计报表.比如查询当月每一天的数据数量...等等之类的.以下内容就是有关这方面的咯. 首先要知道几个函数 mysql> select now(); //这个是显示的当前时间 +- ...
jQuery禁用快捷键例如禁用F5刷新禁用右键菜单等
禁用鼠标右键菜单栏 $("body").bind("contextmenu", function(event) { return false; }); 禁用快捷 ...
一个ubuntu phper的自我修养（atom）
将atom打造成二十一世纪最装那啥的php IDE 之前在windows平台使用的php IDE一直是eclipse for php,因为之前做java开发,所以对eclipse很有感情,debug. ...
phpMyAdmin的配置
好久没写东西了,上来记录一下今天学的一点小东西吧~ 之前搞php开发的时候,一直用的是SQLyog来操作mysql数据库的,但是今天发现sqlyog功能不是很完善,主要是我想修改数据库名,但是sqly ...
Android动态方式破解apk终极篇(加固apk破解方式)
一.前言今天总算迎来了破解系列的最后一篇文章了,之前的两篇文章分别为: 第一篇:如何使用Eclipse动态调试smali源码第二篇:如何使用IDA动态调试SO文件现在要说的就是最后一篇了,如何应 ...
Android动态方式破解apk前奏篇(Eclipse动态调试smail源码)
一.前言今天我们开始apk破解的另外一种方式:动态代码调试破解,之前其实已经在一篇文章中说到如何破解apk了: Android中使用静态方式破解Apk 主要采用的是静态方式,步骤也很简单,首先使用 ...
vs远程发布
安装IIS管理服务Web Management Service 在IIS中,选择服务器结点,然后在内容里面打开[管理服务],右边操作栏里面停止服务,把[启用远程连接]前面复选框选上.然后选在下面的使用 ...
Eclipse 调试的时候Tomcat报错启动不了
Eclipse 调试的时候Tomcat报错启动不了 1.把所有的断点删掉 2.清理工程 3.在Tomcat里面删除项目 4.删除Tomcat的配置,重新配置一下

Amazon评论数据的预处理代码（Positive & Negative）

Amazon评论数据的预处理代码（Positive & Negative）的更多相关文章

随机推荐

热门专题