Tensorflow二分类处理dense或者sparse(文本分类)的输入数据
这里做了一些小的修改,感谢谷歌rd的帮助,使得能够统一处理dense的数据,或者类似文本分类这样sparse的输入数据。后续会做进一步学习优化,比如如何多线程处理。
具体如何处理sparse 主要是使用embedding_lookup_sparse,参考
https://github.com/tensorflow/tensorflow/issues/342
两个文件
melt.py
binary_classification.py
代码和数据已经上传到 https://github.com/chenghuige/tensorflow-example , 关于sparse处理可以先参考 sparse_tensor.py
运行
python ./binary_classification.py --tr corpus/feature.trate.0_2.normed.txt --te corpus/feature.trate.1_2.normed.txt --batch_size 200 --method mlp --num_epochs 1000
... loading dataset: corpus/feature.trate.0_2.normed.txt
0
10000
20000
30000
40000
50000
60000
70000
finish loading train set corpus/feature.trate.0_2.normed.txt
... loading dataset: corpus/feature.trate.1_2.normed.txt
0
10000
finish loading test set corpus/feature.trate.1_2.normed.txt
num_features: 4762348
trainSet size: 70968
testSet size: 17742
batch_size: 200 learning_rate: 0.001 num_epochs: 1000
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24
0 auc: 0.503701159392 cost: 0.69074464019
1 auc: 0.574863035489 cost: 0.600787888115
2 auc: 0.615858601208 cost: 0.60036152958
3 auc: 0.641573172518 cost: 0.599917832685
4 auc: 0.657326531323 cost: 0.599433459447
5 auc: 0.666575623414 cost: 0.598856064529
6 auc: 0.671990014639 cost: 0.598072590816
7 auc: 0.675956442936 cost: 0.596850153855
8 auc: 0.681129512174 cost: 0.594744671454
9 auc: 0.689568680575 cost: 0.591011970184
10 auc: 0.70265083004 cost: 0.584730529957
11 auc: 0.720751242654 cost: 0.575319047846
12 auc: 0.740525668112 cost: 0.563041782476
13 auc: 0.756397606412 cost: 0.548790696159
14 auc: 0.76745782664 cost: 0.533633556673
15 auc: 0.776115284883 cost: 0.518648754985
16 auc: 0.783683301767 cost: 0.504702218341
17 auc: 0.79058754946 cost: 0.492255532423
18 auc: 0.796831772334 cost: 0.481419827863
19 auc: 0.802349672543 cost: 0.472143309749
20 auc: 0.807102186144 cost: 0.464346827091
21 auc: 0.811092646634 cost: 0.457953127862
22 auc: 0.814318813594 cost: 0.452874061637
23 auc: 0.816884839449 cost: 0.449003176388
24 auc: 0.818881302313 cost: 0.446225956373
从实验结果来看 简单的mlp 可以轻松超越linearSVM
mlt feature.trate.0_2.normed.txt -c tt -test feature.trate.1_2.normed.txt --iter 1000000
I1130 20:03:36.485967 18502 Melt.h:59] _cmd.randSeed --- [4281910087]
I1130 20:03:36.486151 18502 Melt.h:1209] omp_get_num_procs() --- [24]
I1130 20:03:36.486706 18502 Melt.h:1221] get_num_threads() --- [22]
I1130 20:03:36.486742 18502 Melt.h:1224] commandStr --- [tt]
I1130 20:03:36.486760 18502 time_util.h:102] TrainTest! started
I1130 20:03:36.486789 18502 time_util.h:102] ParseInputDataFile started
I1130 20:03:36.785362 18502 time_util.h:113] ParseInputDataFile finished using: [298.557 ms] (0.298551 s)
I1130 20:03:36.785481 18502 TrainerFactory.cpp:99] Creating LinearSVM trainer
I1130 20:03:36.785524 18502 time_util.h:102] Train started
MinMaxNormalizer prepare [ 70968 ] (0.193283 s)100% |******************************************|
I1130 20:03:37.064959 18502 time_util.h:102] Normalize started
I1130 20:03:37.096940 18502 time_util.h:113] Normalize finished using: [31.945 ms] (0.031939 s)
LinearSVM training [ 1000000 ] (1.14643 s)100% |******************************************|
Sigmoid/PlattCalibrator calibrating [ 70968 ] (0.139669 s)100% |******************************************|
I1130 20:03:38.383231 18502 Trainer.h:65] Param: [numIterations:1000000 learningRate:0.001 trainerTyper:peagsos loopType:stochastic sampleSize:1 performProjection:0 ]
I1130 20:03:38.457448 18502 time_util.h:113] Train finished using: [1671.9 ms] (1.6719 s)
I1130 20:03:38.506352 18502 time_util.h:102] ParseInputDataFile started
I1130 20:03:38.579484 18502 time_util.h:113] ParseInputDataFile finished using: [73.094 ms] (0.073092 s)
I1130 20:03:38.579563 18502 Melt.h:603] Test feature.trate.1_2.normed.txt and writting instance predict file to ./result/0.inst.txt
TEST POSITIVE RATIO: 0.2876 (5103/(5103+12639))
Confusion table:
||===============================||
|| PREDICTED ||
TRUTH || positive | negative || RECALL
||===============================||
positive|| 3195 | 1908 || 0.6261 (3195/5103)
negative|| 2137 | 10502 || 0.8309 (10502/12639)
||===============================||
PRECISION 0.5992 (3195/5332) 0.8463(10502/12410)
LOG-LOSS/instance: 0.4843
LOG-LOSS-PROB/instance: 0.6256
TEST-SET ENTROPY (prior LL/in): 0.6000
LOG-LOSS REDUCTION (RIG): -4.2637%
OVERALL 0/1 ACCURACY: 0.7720 (13697/17742)
POS.PRECISION: 0.5992
POS.RECALL: 0.6261
NEG.PRECISION: 0.8463
NEG.RECALL: 0.8309
F1.SCORE: 0.6124
OuputAUC: 0.7984
AUC: [0.7984]
----------------------------------------------------------------------------------------
I1130 20:03:38.729507 18502 time_util.h:113] TrainTest! finished using: [2242.72 ms] (2.24272 s)
#---------------------melt.py
#!/usr/bin/env python
#coding=gbk
# ==============================================================================
# \file melt.py
# \author chenghuige
# \date 2015-11-30 13:40:19.506009
# \Description
# ==============================================================================
import numpy as np
import os
#---------------------------melt load data
#Now support melt dense and sparse input file format, for sparse input no
#header
#for dense input will ignore header
#also support libsvm format @TODO
def guess_file_format(line):
is_dense = True
has_header = False
if line.startswith('#'):
has_header = True
return is_dense, has_header
elif line.find(':') > 0:
is_dense = False
return is_dense, has_header
def guess_label_index(line):
label_idx = 0
if line.startswith('_'):
label_idx = 1
return label_idx
#@TODO implement [a:b] so we can use [a:b] in application code
class Features(object):
def __init__(self):
self.data = []
def mini_batch(self, start, end):
return self.data[start: end]
def full_batch(self):
return self.data
class SparseFeatures(object):
def __init__(self):
self.sp_indices = []
self.start_indices = [0]
self.sp_ids_val = []
self.sp_weights_val = []
self.sp_shape = None
def mini_batch(self, start, end):
batch = SparseFeatures()
start_ = self.start_indices[start]
end_ = self.start_indices[end]
batch.sp_ids_val = self.sp_ids_val[start_: end_]
batch.sp_weights_val = self.sp_weights_val[start_: end_]
row_idx = 0
max_len = 0
#@TODO better way to construct sp_indices for each mini batch ?
for i in xrange(start + 1, end + 1):
len_ = self.start_indices[i] - self.start_indices[i - 1]
if len_ > max_len:
max_len = len_
for j in xrange(len_):
batch.sp_indices.append([i - start - 1, j])
row_idx += 1
batch.sp_shape = [end - start, max_len]
return batch
def full_batch(self):
if len(self.sp_indices) == 0:
row_idx = 0
max_len = 0
for i in xrange(1, len(self.start_indices)):
len_ = self.start_indices[i] - self.start_indices[i - 1]
if len_ > max_len:
max_len = len_
for j in xrange(len_):
self.sp_indices.append([i - 1, j])
row_idx += 1
self.sp_shape = [len(self.start_indices) - 1, max_len]
return self
class DataSet(object):
def __init__(self):
self.labels = []
self.features = None
self.num_features = 0
def num_instances(self):
return len(self.labels)
def full_batch(self):
return self.features.full_batch(), self.labels
def mini_batch(self, start, end):
if end < 0:
end = num_instances() + end
return self.features.mini_batch(start, end), self.labels[start: end]
def load_dense_dataset(lines):
dataset_x = []
dataset_y = []
nrows = 0
label_idx = guess_label_index(lines[0])
for i in xrange(len(lines)):
if nrows % 10000 == 0:
print nrows
nrows += 1
line = lines[i]
l = line.rstrip().split()
dataset_y.append([float(l[label_idx])])
dataset_x.append([float(x) for x in l[label_idx + 1:]])
dataset_x = np.array(dataset_x)
dataset_y = np.array(dataset_y)
dataset = DataSet()
dataset.labels = dataset_y
dataset.num_features = dataset_x.shape[1]
features = Features()
features.data = dataset_x
dataset.features = features
return dataset
def load_sparse_dataset(lines):
dataset_x = []
dataset_y = []
label_idx = guess_label_index(lines[0])
num_features = int(lines[0].split()[label_idx + 1])
features = SparseFeatures()
nrows = 0
start_idx = 0
for i in xrange(len(lines)):
if nrows % 10000 == 0:
print nrows
nrows += 1
line = lines[i]
l = line.rstrip().split()
dataset_y.append([float(l[label_idx])])
start_idx += (len(l) - label_idx - 2)
features.start_indices.append(start_idx)
for item in l[label_idx + 2:]:
id, val = item.split(':')
features.sp_ids_val.append(int(id))
features.sp_weights_val.append(float(val))
dataset_y = np.array(dataset_y)
dataset = DataSet()
dataset.labels = dataset_y
dataset.num_features = num_features
dataset.features = features
return dataset
def load_dataset(dataset, has_header=False):
print '... loading dataset:',dataset
lines = open(dataset).readlines()
if has_header:
return load_dense_dataset(lines[1:])
is_dense, has_header = guess_file_format(lines[0])
if is_dense:
return load_dense_dataset(lines[has_header:])
else:
return load_sparse_dataset(lines)
#-----------------------------------------melt for tensorflow
import tensorflow as tf
def init_weights(shape):
return tf.Variable(tf.random_normal(shape, stddev = 0.01))
def matmul(X, w):
if type(X) == tf.Tensor:
return tf.matmul(X,w)
else:
return tf.nn.embedding_lookup_sparse(w, X[0], X[1], combiner = "sum")
class BinaryClassificationTrainer(object):
def __init__(self, dataset):
self.labels = dataset.labels
self.features = dataset.features
self.num_features = dataset.num_features
self.X = tf.placeholder("float", [None, self.num_features])
self.Y = tf.placeholder("float", [None, 1])
def gen_feed_dict(self, trX, trY):
return {self.X: trX, self.Y: trY}
class SparseBinaryClassificationTrainer(object):
def __init__(self, dataset):
self.labels = dataset.labels
self.features = dataset.features
self.num_features = dataset.num_features
self.sp_indices = tf.placeholder(tf.int64)
self.sp_shape = tf.placeholder(tf.int64)
self.sp_ids_val = tf.placeholder(tf.int64)
self.sp_weights_val = tf.placeholder(tf.float32)
self.sp_ids = tf.SparseTensor(self.sp_indices, self.sp_ids_val, self.sp_shape)
self.sp_weights = tf.SparseTensor(self.sp_indices, self.sp_weights_val, self.sp_shape)
self.X = (self.sp_ids, self.sp_weights)
self.Y = tf.placeholder("float", [None, 1])
def gen_feed_dict(self, trX, trY):
return {self.Y: trY, self.sp_indices: trX.sp_indices, self.sp_shape: trX.sp_shape, self.sp_ids_val: trX.sp_ids_val, self.sp_weights_val: trX.sp_weights_val}
def gen_binary_classification_trainer(dataset):
if type(dataset.features) == Features:
return BinaryClassificationTrainer(dataset)
else:
return SparseBinaryClassificationTrainer(dataset)
#------------------------- binary_classification.py
#!/usr/bin/env python
#coding=gbk
# ==============================================================================
# \file binary_classification.py
# \author chenghuige
# \date 2015-11-30 16:06:52.693026
# \Description
# ==============================================================================
import sys
import tensorflow as tf
import numpy as np
from sklearn.metrics import roc_auc_score
import melt
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_float('learning_rate', 0.001, 'Initial learning rate.')
flags.DEFINE_integer('num_epochs', 120, 'Number of epochs to run trainer.')
flags.DEFINE_integer('batch_size', 500, 'Batch size. Must divide evenly into the dataset sizes.')
flags.DEFINE_string('train', './corpus/feature.normed.rand.12000.0_2.txt', 'train file')
flags.DEFINE_string('test', './corpus/feature.normed.rand.12000.1_2.txt', 'test file')
flags.DEFINE_string('method', 'logistic', 'currently support logistic/mlp')
#----for mlp
flags.DEFINE_integer('hidden_size', 20, 'Hidden unit size')
trainset_file = FLAGS.train
testset_file = FLAGS.test
learning_rate = FLAGS.learning_rate
num_epochs = FLAGS.num_epochs
batch_size = FLAGS.batch_size
method = FLAGS.method
trainset = melt.load_dataset(trainset_file)
print "finish loading train set ",trainset_file
testset = melt.load_dataset(testset_file)
print "finish loading test set ", testset_file
assert(trainset.num_features == testset.num_features)
num_features = trainset.num_features
print 'num_features: ', num_features
print 'trainSet size: ', trainset.num_instances()
print 'testSet size: ', testset.num_instances()
print 'batch_size:', batch_size, ' learning_rate:', learning_rate, ' num_epochs:', num_epochs
trainer = melt.gen_binary_classification_trainer(trainset)
class LogisticRegresssion:
def model(self, X, w):
return melt.matmul(X,w)
def run(self, trainer):
w = melt.init_weights([trainer.num_features, 1])
py_x = self.model(trainer.X, w)
return py_x
class Mlp:
def model(self, X, w_h, w_o):
h = tf.nn.sigmoid(melt.matmul(X, w_h)) # this is a basic mlp, think 2 stacked logistic regressions
return tf.matmul(h, w_o) # note that we dont take the softmax at the end because our cost fn does that for us
def run(self, trainer):
w_h = melt.init_weights([trainer.num_features, FLAGS.hidden_size]) # create symbolic variables
w_o = melt.init_weights([FLAGS.hidden_size, 1])
py_x = self.model(trainer.X, w_h, w_o)
return py_x
def gen_algo(method):
if method == 'logistic':
return LogisticRegresssion()
elif method == 'mlp':
return Mlp()
else:
print method, ' is not supported right now'
exit(-1)
algo = gen_algo(method)
py_x = algo.run(trainer)
Y = trainer.Y
cost = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(py_x, Y))
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # construct optimizer
predict_op = tf.nn.sigmoid(py_x)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
teX, teY = testset.full_batch()
num_train_instances = trainset.num_instances()
for i in range(num_epochs):
predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))
print i, 'auc:', roc_auc_score(teY, predicts), 'cost:', cost_ / len(teY)
for start, end in zip(range(0, num_train_instances, batch_size), range(batch_size, num_train_instances, batch_size)):
trX, trY = trainset.mini_batch(start, end)
sess.run(train_op, feed_dict = trainer.gen_feed_dict(trX, trY))
predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))
print 'final ', 'auc:', roc_auc_score(teY, predicts),'cost:', cost_ / len(teY)