使用 NLTK 对文本进行清洗，索引工具

EN_WHITELIST = '0123456789abcdefghijklmnopqrstuvwxyz ' # space is included in whitelist
EN_BLACKLIST = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\''
FILENAME = 'data/chat.txt'
limit = {
        'maxq' : 20,
        'minq' : 0,
        'maxa' : 20,
        'mina' : 3
        }
UNK = 'unk'
VOCAB_SIZE = 6000
import random
import sys
import nltk
import itertools
from collections import defaultdict
import numpy as np
import pickle
def ddefault():
    return 1
'''
 read lines from file
     return [list of lines]
'''
def read_lines(filename):
    return open(filename).read().split('\n')[:-1]
'''
 split sentences in one line
  into multiple lines
    return [list of lines]
'''
def split_line(line):
    return line.split('.')
'''
 remove anything that isn't in the vocabulary
    return str(pure ta/en)
'''
def filter_line(line, whitelist):
    return ''.join([ ch for ch in line if ch in whitelist ])
'''
 read list of words, create index to word,
  word to index dictionaries
    return tuple( vocab->(word, count), idx2w, w2idx )
'''
def index_(tokenized_sentences, vocab_size):
    # get frequency distribution
    freq_dist = nltk.FreqDist(itertools.chain(*tokenized_sentences))
    # get vocabulary of 'vocab_size' most used words
    vocab = freq_dist.most_common(vocab_size)
    # index2word
    index2word = ['_'] + [UNK] + [ x[0] for x in vocab ]
    # word2index
    word2index = dict([(w,i) for i,w in enumerate(index2word)] )
    return index2word, word2index, freq_dist
'''
 filter too long and too short sequences
    return tuple( filtered_ta, filtered_en )
'''
def filter_data(sequences):
    filtered_q, filtered_a = [], []
    raw_data_len = len(sequences)//2
    for i in range(0, len(sequences), 2):
        qlen, alen = len(sequences[i].split(' ')), len(sequences[i+1].split(' '))
        if qlen >= limit['minq'] and qlen <= limit['maxq']:
            if alen >= limit['mina'] and alen <= limit['maxa']:
                filtered_q.append(sequences[i])
                filtered_a.append(sequences[i+1])
    # print the fraction of the original data, filtered
    filt_data_len = len(filtered_q)
    filtered = int((raw_data_len - filt_data_len)*100/raw_data_len)
    print(str(filtered) + '% filtered from original data')
    return filtered_q, filtered_a
'''
 create the final dataset :
  - convert list of items to arrays of indices
  - add zero padding
      return ( [array_en([indices]), array_ta([indices]) )
'''
def zero_pad(qtokenized, atokenized, w2idx):
    # num of rows
    data_len = len(qtokenized)
    # numpy arrays to store indices
    idx_q = np.zeros([data_len, limit['maxq']], dtype=np.int32)
    idx_a = np.zeros([data_len, limit['maxa']], dtype=np.int32)
    for i in range(data_len):
        q_indices = pad_seq(qtokenized[i], w2idx, limit['maxq'])
        a_indices = pad_seq(atokenized[i], w2idx, limit['maxa'])
        #print(len(idx_q[i]), len(q_indices))
        #print(len(idx_a[i]), len(a_indices))
        idx_q[i] = np.array(q_indices)
        idx_a[i] = np.array(a_indices)
    return idx_q, idx_a
'''
 replace words with indices in a sequence
  replace with unknown if word not in lookup
    return [list of indices]
'''
def pad_seq(seq, lookup, maxlen):
    indices = []
    for word in seq:
        if word in lookup:
            indices.append(lookup[word])
        else:
            indices.append(lookup[UNK])
    return indices + [0]*(maxlen - len(seq))
def process_data():
    print('\n>> Read lines from file')
    lines = read_lines(filename=FILENAME)
    # change to lower case (just for en)
    lines = [ line.lower() for line in lines ]
    print('\n:: Sample from read(p) lines')
    print(lines[121:125])
    # filter out unnecessary characters
    print('\n>> Filter lines')
    lines = [ filter_line(line, EN_WHITELIST) for line in lines ]
    print(lines[121:125])
    # filter out too long or too short sequences
    print('\n>> 2nd layer of filtering')
    qlines, alines = filter_data(lines)
    print('\nq : {0} ; a : {1}'.format(qlines[60], alines[60]))
    print('\nq : {0} ; a : {1}'.format(qlines[61], alines[61]))
    # convert list of [lines of text] into list of [list of words ]
    print('\n>> Segment lines into words')
    qtokenized = [ wordlist.split(' ') for wordlist in qlines ]
    atokenized = [ wordlist.split(' ') for wordlist in alines ]
    print('\n:: Sample from segmented list of words')
    print('\nq : {0} ; a : {1}'.format(qtokenized[60], atokenized[60]))
    print('\nq : {0} ; a : {1}'.format(qtokenized[61], atokenized[61]))
    # indexing -> idx2w, w2idx : en/ta
    print('\n >> Index words')
    idx2w, w2idx, freq_dist = index_( qtokenized + atokenized, vocab_size=VOCAB_SIZE)
    print('\n >> Zero Padding')
    idx_q, idx_a = zero_pad(qtokenized, atokenized, w2idx)
    print('\n >> Save numpy arrays to disk')
    # save them
    np.save('idx_q.npy', idx_q)
    np.save('idx_a.npy', idx_a)
    # let us now save the necessary dictionaries
    metadata = {
            'w2idx' : w2idx,
            'idx2w' : idx2w,
            'limit' : limit,
            'freq_dist' : freq_dist
                }
    # write to disk : data control dictionaries
    with open('metadata.pkl', 'wb') as f:
        pickle.dump(metadata, f)
def load_data(PATH=''):
    # read data control dictionaries
    with open(PATH + 'metadata.pkl', 'rb') as f:
        metadata = pickle.load(f)
    # read numpy arrays
    idx_ta = np.load(PATH + 'idx_q.npy')
    idx_en = np.load(PATH + 'idx_a.npy')
    return metadata, idx_q, idx_a
if __name__ == '__main__':
    process_data()

使用 NLTK 对文本进行清洗，索引工具的更多相关文章

【NLP】Python NLTK获取文本语料和词汇资源
Python NLTK 获取文本语料和词汇资源作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...
bash文本查看及处理工具
文本查看及处理工具: wc [OPTION] FILE... -c: 字节数 -l:行数 -w: 单词数 who | w ...
js实现去文本换行符小工具
js实现去文本换行符小工具一.总结一句话总结: 1.vertical属性使用的时候注意看清定义,也注意父元素的基准线问题.vertical-align:top; 2.获取textareaEleme ...
基于COCA词频表的文本词汇分布测试工具v0.1
美国语言协会对美国人日常使用的英语单词做了一份详细的统计,按照日常使用的频率做成了一张表,称为COCA词频表.排名越低的单词使用频率越高,该表可以用来统计词汇量. 如果你的词汇量约为6000,那么这张 ...
MySQL检查重复索引工具-pt-duplicate-key-checker
在MySQL中是允许在同一个列上创建多个索引的,示例如下: mysql --socket=/tmp/mysql5173.sock -uroot -p mysql> SELECT VERSION( ...
Linux Shell处理文本最常用的工具大盘点
导读本文将介绍Linux下使用Shell处理文本时最常用的工具:find.grep.xargs.sort.uniq.tr.cut.paste.wc.sed.awk:提供的例子和参数都是最常用和最为实 ...
NLTK和Stanford NLP两个工具的安装配置
这里安装的是两个自然语言处理工具,NLTK和Stanford NLP. 声明:笔者操作系统是Windows10,理论上Windows都可以: 版本号:NLTK 3.2 Stanford NLP 3.6 ...
谈谈开发文本转URL小工具的思路
URL提供了一种定位互联网上任意资源的手段,由于采用HTTP协议的URL能在互联网上自由传播和使用,所以能大行其道.在软件开发.测试甚至部署的环节,URL几乎可以说无处不再,其中用来定位文本的URL数 ...
nltk处理文本
nltk(Natural Language Toolkit)是处理文本的利器. 安装 pip install nltk 进入python命令行,键入nltk.download()可以下载nltk需要的 ...

随机推荐

MUI使用H5+Api调取系统相册多图选择及转base64码
伟大的哲学家曾说过"写代码,一定要翻文档" 这次我们需要用到的是调取系统相册进行多图上传,先奉上html5+api关于系统相册的文档链接链接:HTML5+ API Referenc ...
vue+element tree(树形控件)组件(1)
最近做了第一个组内可以使用的组件,虽然是最简版,也废了不少力.各位前辈帮我解决问题,才勉强搞定.让我来记录这个树形组件的编写过程和期间用到的知识点. 首先说说需求,就是点击出现弹窗+蒙板,弹窗内容是一 ...
JS基础入门篇（十八）—日期对象
1.日期对象日期对象: 通过new Date()就能创建一个日期对象,这个对象中有当前系统时间的所有详细信息. 以下代码可以获取当前时间: <script> var t = new Da ...
Windows下利用virtualenvwrapper指定python版本创建虚拟环境
默认已安装virtualenvwrapper 一.添加环境变量(可选) 在系统环境变量中添加 WORKON_HOME ,用来指定新建的虚拟环境的存储位置,如过未添加,默认位置为 %USERPROFIL ...
agent判断用户请求设备
linux命令行界面如何安装图形化界面
linux命令行界面如何安装图形化界面目录问题描述解决方案安装包测试是否安装成功如何卸载图形化界面遭遇问题问题描述当我们在安装Linux系统时,我们一开始可能安装的是非图形界面的系统 ...
什么是FHS，Linux的文件系统目录标准是怎样的
Filesystem Hierarchy Standard(文件系统目录标准)的缩写,多数Linux版本采用这种文件组织形式,类似于Windows操作系统中c盘的文件目录,FHS采用树形结构组织文件. ...
Fortify Audit Workbench 笔记 SQL Injection SQL注入
SQL Injection SQL注入 Abstract 通过不可信来源的输入构建动态 SQL 指令,攻击者就能够修改指令的含义或者执行任意 SQL 命令. Explanation SQL injec ...
JVM01——JVM内存区域的构成
从本文开始将为各位带来JVM方面的知识点,关注我的公众号「Java面典」了解更多Java相关知识点. JVM内存主要分为三部分线程私有(Thread Local).线程共享(Thread Shared ...

使用 NLTK 对文本进行清洗，索引工具

使用 NLTK 对文本进行清洗，索引工具的更多相关文章

随机推荐

热门专题