这里做了一些小的修改,感谢谷歌rd的帮助,使得能够统一处理dense的数据,或者类似文本分类这样sparse的输入数据。后续会做进一步学习优化,比如如何多线程处理。

具体如何处理sparse 主要是使用embedding_lookup_sparse,参考

https://github.com/tensorflow/tensorflow/issues/342

两个文件

melt.py

binary_classification.py

代码和数据已经上传到 https://github.com/chenghuige/tensorflow-example , 关于sparse处理可以先参考 sparse_tensor.py

运行

python ./binary_classification.py --tr corpus/feature.trate.0_2.normed.txt --te corpus/feature.trate.1_2.normed.txt --batch_size 200 --method mlp --num_epochs 1000

... loading dataset: corpus/feature.trate.0_2.normed.txt

0

10000

20000

30000

40000

50000

60000

70000

finish loading train set corpus/feature.trate.0_2.normed.txt

... loading dataset: corpus/feature.trate.1_2.normed.txt

0

10000

finish loading test set corpus/feature.trate.1_2.normed.txt

num_features: 4762348

trainSet size: 70968

testSet size: 17742

batch_size: 200 learning_rate: 0.001 num_epochs: 1000

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24

I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24

I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24

0 auc: 0.503701159392 cost: 0.69074464019

1 auc: 0.574863035489 cost: 0.600787888115

2 auc: 0.615858601208 cost: 0.60036152958

3 auc: 0.641573172518 cost: 0.599917832685

4 auc: 0.657326531323 cost: 0.599433459447

5 auc: 0.666575623414 cost: 0.598856064529

6 auc: 0.671990014639 cost: 0.598072590816

7 auc: 0.675956442936 cost: 0.596850153855

8 auc: 0.681129512174 cost: 0.594744671454

9 auc: 0.689568680575 cost: 0.591011970184

10 auc: 0.70265083004 cost: 0.584730529957

11 auc: 0.720751242654 cost: 0.575319047846

12 auc: 0.740525668112 cost: 0.563041782476

13 auc: 0.756397606412 cost: 0.548790696159

14 auc: 0.76745782664 cost: 0.533633556673

15 auc: 0.776115284883 cost: 0.518648754985

16 auc: 0.783683301767 cost: 0.504702218341

17 auc: 0.79058754946 cost: 0.492255532423

18 auc: 0.796831772334 cost: 0.481419827863

19 auc: 0.802349672543 cost: 0.472143309749

20 auc: 0.807102186144 cost: 0.464346827091

21 auc: 0.811092646634 cost: 0.457953127862

22 auc: 0.814318813594 cost: 0.452874061637

23 auc: 0.816884839449 cost: 0.449003176388

24 auc: 0.818881302313 cost: 0.446225956373

从实验结果来看 简单的mlp 可以轻松超越linearSVM

mlt feature.trate.0_2.normed.txt -c tt -test feature.trate.1_2.normed.txt --iter 1000000

I1130 20:03:36.485967 18502 Melt.h:59] _cmd.randSeed --- [4281910087]

I1130 20:03:36.486151 18502 Melt.h:1209] omp_get_num_procs() --- [24]

I1130 20:03:36.486706 18502 Melt.h:1221] get_num_threads() --- [22]

I1130 20:03:36.486742 18502 Melt.h:1224] commandStr --- [tt]

I1130 20:03:36.486760 18502 time_util.h:102] TrainTest! started

I1130 20:03:36.486789 18502 time_util.h:102] ParseInputDataFile started

I1130 20:03:36.785362 18502 time_util.h:113] ParseInputDataFile finished using: [298.557 ms] (0.298551 s)

I1130 20:03:36.785481 18502 TrainerFactory.cpp:99] Creating LinearSVM trainer

I1130 20:03:36.785524 18502 time_util.h:102] Train started

MinMaxNormalizer prepare [ 70968 ] (0.193283 s)100% |******************************************|

I1130 20:03:37.064959 18502 time_util.h:102] Normalize started

I1130 20:03:37.096940 18502 time_util.h:113] Normalize finished using: [31.945 ms] (0.031939 s)

LinearSVM training [ 1000000 ] (1.14643 s)100% |******************************************|

Sigmoid/PlattCalibrator calibrating [ 70968 ] (0.139669 s)100% |******************************************|

I1130 20:03:38.383231 18502 Trainer.h:65] Param: [numIterations:1000000 learningRate:0.001 trainerTyper:peagsos loopType:stochastic sampleSize:1 performProjection:0 ]

I1130 20:03:38.457448 18502 time_util.h:113] Train finished using: [1671.9 ms] (1.6719 s)

I1130 20:03:38.506352 18502 time_util.h:102] ParseInputDataFile started

I1130 20:03:38.579484 18502 time_util.h:113] ParseInputDataFile finished using: [73.094 ms] (0.073092 s)

I1130 20:03:38.579563 18502 Melt.h:603] Test feature.trate.1_2.normed.txt and writting instance predict file to ./result/0.inst.txt

TEST POSITIVE RATIO:        0.2876 (5103/(5103+12639))

Confusion table:

||===============================||

|| PREDICTED ||

TRUTH || positive | negative || RECALL

||===============================||

positive|| 3195 | 1908 || 0.6261 (3195/5103)

negative|| 2137 | 10502 || 0.8309 (10502/12639)

||===============================||

PRECISION 0.5992 (3195/5332) 0.8463(10502/12410)

LOG-LOSS/instance:                0.4843

LOG-LOSS-PROB/instance:                0.6256

TEST-SET ENTROPY (prior LL/in):        0.6000

LOG-LOSS REDUCTION (RIG):        -4.2637%

OVERALL 0/1 ACCURACY:        0.7720 (13697/17742)

POS.PRECISION:                0.5992

POS.RECALL:                0.6261

NEG.PRECISION:                0.8463

NEG.RECALL:                0.8309

F1.SCORE:                 0.6124

OuputAUC: 0.7984

AUC: [0.7984]

----------------------------------------------------------------------------------------

I1130 20:03:38.729507 18502 time_util.h:113] TrainTest! finished using: [2242.72 ms] (2.24272 s)

#---------------------melt.py

#!/usr/bin/env python

#coding=gbk

# ==============================================================================

# \file melt.py

# \author chenghuige

# \date 2015-11-30 13:40:19.506009

# \Description

# ==============================================================================

import numpy as np

import os

#---------------------------melt load data

#Now support melt dense and sparse input file format, for sparse input no

#header

#for dense input will ignore header

#also support libsvm format @TODO

def guess_file_format(line):

is_dense = True

has_header = False

if line.startswith('#'):

has_header = True

return is_dense, has_header

elif line.find(':') > 0:

is_dense = False

return is_dense, has_header

def guess_label_index(line):

label_idx = 0

if line.startswith('_'):

label_idx = 1

return label_idx

#@TODO implement [a:b] so we can use [a:b] in application code

class Features(object):

def __init__(self):

self.data = []

def mini_batch(self, start, end):

return self.data[start: end]

def full_batch(self):

return self.data

class SparseFeatures(object):

def __init__(self):

self.sp_indices = []

self.start_indices = [0]

self.sp_ids_val = []

self.sp_weights_val = []

self.sp_shape = None

def mini_batch(self, start, end):

batch = SparseFeatures()

start_ = self.start_indices[start]

end_ = self.start_indices[end]

batch.sp_ids_val = self.sp_ids_val[start_: end_]

batch.sp_weights_val = self.sp_weights_val[start_: end_]

row_idx = 0

max_len = 0

#@TODO better way to construct sp_indices for each mini batch ?

for i in xrange(start + 1, end + 1):

len_ = self.start_indices[i] - self.start_indices[i - 1]

if len_ > max_len:

max_len = len_

for j in xrange(len_):

batch.sp_indices.append([i - start - 1, j])

row_idx += 1

batch.sp_shape = [end - start, max_len]

return batch

def full_batch(self):

if len(self.sp_indices) == 0:

row_idx = 0

max_len = 0

for i in xrange(1, len(self.start_indices)):

len_ = self.start_indices[i] - self.start_indices[i - 1]

if len_ > max_len:

max_len = len_

for j in xrange(len_):

self.sp_indices.append([i - 1, j])

row_idx += 1

self.sp_shape = [len(self.start_indices) - 1, max_len]

return self

class DataSet(object):

def __init__(self):

self.labels = []

self.features = None

self.num_features = 0

def num_instances(self):

return len(self.labels)

def full_batch(self):

return self.features.full_batch(), self.labels

def mini_batch(self, start, end):

if end < 0:

end = num_instances() + end

return self.features.mini_batch(start, end), self.labels[start: end]

def load_dense_dataset(lines):

dataset_x = []

dataset_y = []

nrows = 0

label_idx = guess_label_index(lines[0])

for i in xrange(len(lines)):

if nrows % 10000 == 0:

print nrows

nrows += 1

line = lines[i]

l = line.rstrip().split()

dataset_y.append([float(l[label_idx])])

dataset_x.append([float(x) for x in l[label_idx + 1:]])

dataset_x = np.array(dataset_x)

dataset_y = np.array(dataset_y)

dataset = DataSet()

dataset.labels = dataset_y

dataset.num_features = dataset_x.shape[1]

features = Features()

features.data = dataset_x

dataset.features = features

return dataset

def load_sparse_dataset(lines):

dataset_x = []

dataset_y = []

label_idx = guess_label_index(lines[0])

num_features = int(lines[0].split()[label_idx + 1])

features = SparseFeatures()

nrows = 0

start_idx = 0

for i in xrange(len(lines)):

if nrows % 10000 == 0:

print nrows

nrows += 1

line = lines[i]

l = line.rstrip().split()

dataset_y.append([float(l[label_idx])])

start_idx += (len(l) - label_idx - 2)

features.start_indices.append(start_idx)

for item in l[label_idx + 2:]:

id, val = item.split(':')

features.sp_ids_val.append(int(id))

features.sp_weights_val.append(float(val))

dataset_y = np.array(dataset_y)

dataset = DataSet()

dataset.labels = dataset_y

dataset.num_features = num_features

dataset.features = features

return dataset

def load_dataset(dataset, has_header=False):

print '... loading dataset:',dataset

lines = open(dataset).readlines()

if has_header:

return load_dense_dataset(lines[1:])

is_dense, has_header = guess_file_format(lines[0])

if is_dense:

return load_dense_dataset(lines[has_header:])

else:

return load_sparse_dataset(lines)

#-----------------------------------------melt for tensorflow

import tensorflow as tf

def init_weights(shape):

return tf.Variable(tf.random_normal(shape, stddev = 0.01))

def matmul(X, w):

if type(X) == tf.Tensor:

return tf.matmul(X,w)

else:

return tf.nn.embedding_lookup_sparse(w, X[0], X[1], combiner = "sum")

class BinaryClassificationTrainer(object):

def __init__(self, dataset):

self.labels = dataset.labels

self.features = dataset.features

self.num_features = dataset.num_features

self.X = tf.placeholder("float", [None, self.num_features])

self.Y = tf.placeholder("float", [None, 1])

def gen_feed_dict(self, trX, trY):

return {self.X: trX, self.Y: trY}

class SparseBinaryClassificationTrainer(object):

def __init__(self, dataset):

self.labels = dataset.labels

self.features = dataset.features

self.num_features = dataset.num_features

self.sp_indices = tf.placeholder(tf.int64)

self.sp_shape = tf.placeholder(tf.int64)

self.sp_ids_val = tf.placeholder(tf.int64)

self.sp_weights_val = tf.placeholder(tf.float32)

self.sp_ids = tf.SparseTensor(self.sp_indices, self.sp_ids_val, self.sp_shape)

self.sp_weights = tf.SparseTensor(self.sp_indices, self.sp_weights_val, self.sp_shape)

self.X = (self.sp_ids, self.sp_weights)

self.Y = tf.placeholder("float", [None, 1])

def gen_feed_dict(self, trX, trY):

return {self.Y: trY, self.sp_indices: trX.sp_indices, self.sp_shape: trX.sp_shape, self.sp_ids_val: trX.sp_ids_val, self.sp_weights_val: trX.sp_weights_val}

def gen_binary_classification_trainer(dataset):

if type(dataset.features) == Features:

return BinaryClassificationTrainer(dataset)

else:

return SparseBinaryClassificationTrainer(dataset)

#------------------------- binary_classification.py

#!/usr/bin/env python

#coding=gbk

# ==============================================================================

# \file binary_classification.py

# \author chenghuige

# \date 2015-11-30 16:06:52.693026

# \Description

# ==============================================================================

import sys

import tensorflow as tf

import numpy as np

from sklearn.metrics import roc_auc_score

import melt

flags = tf.app.flags

FLAGS = flags.FLAGS

flags.DEFINE_float('learning_rate', 0.001, 'Initial learning rate.')

flags.DEFINE_integer('num_epochs', 120, 'Number of epochs to run trainer.')

flags.DEFINE_integer('batch_size', 500, 'Batch size. Must divide evenly into the dataset sizes.')

flags.DEFINE_string('train', './corpus/feature.normed.rand.12000.0_2.txt', 'train file')

flags.DEFINE_string('test', './corpus/feature.normed.rand.12000.1_2.txt', 'test file')

flags.DEFINE_string('method', 'logistic', 'currently support logistic/mlp')

#----for mlp

flags.DEFINE_integer('hidden_size', 20, 'Hidden unit size')

trainset_file = FLAGS.train

testset_file = FLAGS.test

learning_rate = FLAGS.learning_rate

num_epochs = FLAGS.num_epochs

batch_size = FLAGS.batch_size

method = FLAGS.method

trainset = melt.load_dataset(trainset_file)

print "finish loading train set ",trainset_file

testset = melt.load_dataset(testset_file)

print "finish loading test set ", testset_file

assert(trainset.num_features == testset.num_features)

num_features = trainset.num_features

print 'num_features: ', num_features

print 'trainSet size: ', trainset.num_instances()

print 'testSet size: ', testset.num_instances()

print 'batch_size:', batch_size, ' learning_rate:', learning_rate, ' num_epochs:', num_epochs

trainer = melt.gen_binary_classification_trainer(trainset)

class LogisticRegresssion:

def model(self, X, w):

return melt.matmul(X,w)

def run(self, trainer):

w = melt.init_weights([trainer.num_features, 1])

py_x = self.model(trainer.X, w)

return py_x

class Mlp:

def model(self, X, w_h, w_o):

h = tf.nn.sigmoid(melt.matmul(X, w_h)) # this is a basic mlp, think 2 stacked logistic regressions

return tf.matmul(h, w_o) # note that we dont take the softmax at the end because our cost fn does that for us

def run(self, trainer):

w_h = melt.init_weights([trainer.num_features, FLAGS.hidden_size]) # create symbolic variables

w_o = melt.init_weights([FLAGS.hidden_size, 1])

py_x = self.model(trainer.X, w_h, w_o)

return py_x

def gen_algo(method):

if method == 'logistic':

return LogisticRegresssion()

elif method == 'mlp':

return Mlp()

else:

print method, ' is not supported right now'

exit(-1)

algo = gen_algo(method)

py_x = algo.run(trainer)

Y = trainer.Y

cost = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(py_x, Y))

train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # construct optimizer

predict_op = tf.nn.sigmoid(py_x)

sess = tf.Session()

init = tf.initialize_all_variables()

sess.run(init)

teX, teY = testset.full_batch()

num_train_instances = trainset.num_instances()

for i in range(num_epochs):

predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))

print i, 'auc:', roc_auc_score(teY, predicts), 'cost:', cost_ / len(teY)

for start, end in zip(range(0, num_train_instances, batch_size), range(batch_size, num_train_instances, batch_size)):

trX, trY = trainset.mini_batch(start, end)

sess.run(train_op, feed_dict = trainer.gen_feed_dict(trX, trY))

predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))

print 'final ', 'auc:', roc_auc_score(teY, predicts),'cost:', cost_ / len(teY)

Tensorflow二分类处理dense或者sparse(文本分类)的输入数据的更多相关文章

  1. NLP(二十八)多标签文本分类

      本文将会讲述如何实现多标签文本分类. 什么是多标签分类?   在分类问题中,我们已经接触过二分类和多分类问题了.所谓二(多)分类问题,指的是y值一共有两(多)个类别,每个样本的y值只能属于其中的一 ...

  2. Spark ML下实现的多分类adaboost+naivebayes算法在文本分类上的应用

    1. Naive Bayes算法 朴素贝叶斯算法算是生成模型中一个最经典的分类算法之一了,常用的有Bernoulli和Multinomial两种.在文本分类上经常会用到这两种方法.在词袋模型中,对于一 ...

  3. 使用PyTorch建立你的第一个文本分类模型

    概述 学习如何使用PyTorch执行文本分类 理解解决文本分类时所涉及的要点 学习使用包填充(Pack Padding)特性 介绍 我总是使用最先进的架构来在一些比赛提交模型结果.得益于PyTorch ...

  4. 文本分类:Keras+RNN vs传统机器学习

    摘要:本文通过Keras实现了一个RNN文本分类学习的案例,并详细介绍了循环神经网络原理知识及与机器学习对比. 本文分享自华为云社区<基于Keras+RNN的文本分类vs基于传统机器学习的文本分 ...

  5. 万字总结Keras深度学习中文文本分类

    摘要:文章将详细讲解Keras实现经典的深度学习文本分类算法,包括LSTM.BiLSTM.BiLSTM+Attention和CNN.TextCNN. 本文分享自华为云社区<Keras深度学习中文 ...

  6. 文本分类实战(十)—— BERT 预训练模型

    1 大纲概述 文本分类这个系列将会有十篇左右,包括基于word2vec预训练的文本分类,与及基于最新的预训练模型(ELMo,BERT等)的文本分类.总共有以下系列: word2vec预训练词向量 te ...

  7. NLP(十六)轻松上手文本分类

    背景介绍   文本分类是NLP中的常见的重要任务之一,它的主要功能就是将输入的文本以及文本的类别训练出一个模型,使之具有一定的泛化能力,能够对新文本进行较好地预测.它的应用很广泛,在很多领域发挥着重要 ...

  8. AI - TensorFlow - 示例02:影评文本分类

    影评文本分类 文本分类(Text classification):https://www.tensorflow.org/tutorials/keras/basic_text_classificatio ...

  9. 文本分类实战(二)—— textCNN 模型

    1 大纲概述 文本分类这个系列将会有十篇左右,包括基于word2vec预训练的文本分类,与及基于最新的预训练模型(ELMo,BERT等)的文本分类.总共有以下系列: word2vec预训练词向量 te ...

随机推荐

  1. JVM_垃圾回收串行、并行、并发算法(总结)

    一.串行 JDK1.5前的默认算法 缺点是只有一个线程,执行垃圾回收时程序停止的时间比较长 语法 -XX:+UseSerialGC 新生代.老年代使用串行回收 新生代复制算法 老年代标记-压缩 示例图 ...

  2. jprofiler_监控远程linux服务器的JVM进程(实践)

    几天前写了一篇文章,jprofiler_监控远程linux服务器的tomcat进程(实践),介绍了使用jprofiler怎样监控远程linux的tomcat进程,这两天想了想,除了可以监控tomcat ...

  3. 【USACO 2.4】The Tamworth Two

    题意:C代表cows,F代表farmer,一开始都向北,每分钟前进1步,如果前方不能走,则这分钟顺时针转90°,问多少步能相遇,或者是否不可能相遇,10*10的地图. 题解:dfs,记录状态,C和F的 ...

  4. 如何判断自己的VPS是那种虚拟技术实现的

    我们知道VPS的虚拟技术有许多种,如Openvz.Xen.VMware vSphere.Hyper-V.KVM及Xen的HVM与PV等.在Xen中pv是半虚拟化,hvm是全虚拟化,pv只能用于linu ...

  5. ios开发网络知识 TCP,IP,HTTP,SOCKET区别和联系

    TCP,IP,HTTP,SOCKET区别和联系 网络由下往上分为:        对应 物理层-- 数据链路层-- 网络层--                       IP协议 传输层--     ...

  6. Android开发之《常用工具及文档汇总》

    GreenVPN:https://www.getgreenjsq.com/ Android开发工具.资料下载汇总:http://androiddevtools.cn/#img-size-handle- ...

  7. ubuntu14.04 安装pip

    参考链接: 1.http://www.liquidweb.com/kb/how-to-install-pip-on-ubuntu-14-04-lts/ 2.http://idroot.net/tuto ...

  8. js字符串方法

    字符串方法根据下标返回字符:str.charAt()//传入一个下标返回字符str.charCodeAt();// 传入一个下标获取编码String.formCharCode();//接受编码,编码转 ...

  9. 数据存储_ SQLite(3)

    SQLite的应用 一.简单说明 1.在iOS中使用SQLite3,首先要添加库文件 libsqlite3.dylib 2.导入主头文件 #import <sqlite3.h> 二.具体说 ...

  10. Extjs 学习总结-Ext.define自定义类

    本教程整理了extjs的一些基本概念及其使用,包括自定义类(Ext.define).数据模型.代理等.本节介绍使用Ext.define自定义类 使用Ext.define自定义类 1. 首先看看js中自 ...