Kaggle实战——点击率预估

https://blog.csdn.net/chengcheng1394/article/details/78940565

原创文章，转载请注明出处： http://blog.csdn.net/chengcheng1394/article/details/78940565

请安装TensorFlow1.0，Python3.5
项目地址：
https://github.com/chengstone/kaggle_criteo_ctr_challenge-

前言
点击率预估用来判断一条广告被用户点击的概率，对每次广告的点击做出预测，把用户最有可能点击的广告找出来，是广告技术最重要的算法之一。

数据集下载

这次我们使用Kaggle上的Display Advertising Challenge挑战的criteo数据集。
下载数据集请在终端输入下面命令(脚本文件路径：./data/download.sh)：
wget –no-check-certificate https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz
tar zxf dac.tar.gz
rm -f dac.tar.gz
mkdir raw
mv ./*.txt raw/

解压缩以后，train.txt文件11.7G，test.txt文件1.35G。
数据量太大了，我们只使用前100万条数据。
head -n 1000000 test.txt > test_sub100w.txt
head -n 1000000 train.txt > train_sub100w.txt
然后将文件名重新命名为train.txt和test.txt，文件位置不变。

Data fields
Label
Target variable that indicates if an ad was clicked (1) or not (0).

I1-I13
A total of 13 columns of integer features (mostly count features).

C1-C26
A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.

数据中含有Label字段，表示这条广告是否被点击，I1-I13一共13个数值特征（Dense Input），C1-C26共26个Categorical类别特征（Sparse Input）。

网络模型

模型包含三部分网络，一个是FFM(Field-aware Factorization Machines)，一个是FM(Factorization Machine)，另一个是DNN，其中FM网络包含GBDT和FM两个组件。通常在数据预处理的部分，需要做特征交叉组合等特征工程，以便找出帮助我们预测的特征出来，这绝对是技术活。

这次我们跳过特征工程的步骤，把这些组件和深度神经网络组合在一起，将挑选特征的工作交给模型来处理。其中FFM使用了LibFFM，FM使用了LibFM，GBDT使用了LightGBM，当然你也可以使用xgboost。

GBDT
给入训练数据后，GBDT会训练出若干棵树，我们要使用的是GBDT中每棵树输出的叶子结点，将这些叶子结点作为categorical类别特征输入给FM。有关决策树的使用，请参照Facebook的这篇文章Practical Lessons from Predicting Clicks on Ads at Facebook。

FM
FM用来解决数据量大并且特征稀疏下的特征组合问题，先来看看公式（只考虑二阶多项式的情况）：n代表样本的特征数量，xixi是第i个特征的值，w0w0、wiwi、wiwijj是模型参数。

从公式可以看出来这是在线性模型基础上，添加了特征组合xixjxixj，当然只有在特征xixi和xjxj都不为0时才有意义。然而在实际的应用场景中，训练组合特征的参数是很困难的。因为输入数据普遍存在稀疏性，这导致xixi和xjxj大部分情况都是0，而组合特征的参数wiwijj只有在特征不为0时才能训练出有意义的值。

比如跟购物相关的特征中，女性可能会更关注化妆品或者首饰之类的物品，而男性可能更关注体育用品或者电子产品等商品，这说明特征组合训练是有意义的。而商品特征可能存在几百上千种分类，通常我们将类别特征转成One hot编码的形式，这样一个特征就要变成几百维的特征，再加上其他的分类特征，这导致输入的特征空间急剧膨胀，所以数据的稀疏性是实际问题中不可避免的挑战。

为了解决二次项参数训练的问题，引入了矩阵分解的概念。在上一篇文章中我们讨论的是电影推荐系统，我们构造了用户特征向量和电影特征向量，通过两个特征向量的点积得到了用户对于某部电影的评分。如果将用户特征矩阵与电影特征矩阵相乘就会得到所有用户对所有影片的评分矩阵。

如果将上面的过程反过来看，实际上对于评分矩阵，我们可以分解成用户矩阵和电影矩阵，而评分矩阵中每一个数据点就相当于上面讨论的组合特征的参数wiwijj。

对于参数矩阵W，我们采用矩阵分解的方法，将每一个参数wiwijj分解成两个向量（称之为隐向量）的点积。这样矩阵就可以分解为W=VTVW=VTV，而每个参数wiwijj=⟨vivi,vjvj⟩，vivi是第i维特征的隐向量，这样FM的二阶公式就变成：

这就是FM模型的思想。

将GBDT输出的叶子节点作为训练数据的输入，来训练FM模型。这样对于我们的FM网络，需要训练GBDT和FM。看得出来，这次我们的点击率预测网络要复杂了许多，影响最终结果的因素和超参更多了。关于FM和GBDT两个组件的训练我们会在下文进行说明。

FFM
接下来需要训练FFM模型。FFM在FM的基础上增加了一个Field的概念，比如说一个商品字段，是一个分类特征，可以分成很多不同的feature，但是这些feature都属于同一个Field，或者说同一个categorical的分类特征都可以放到同一个Field。

这可以看成是1对多的关系，打个比方，比如职业字段，这是一个特征，经过One Hot以后，变成了N个特征。那这N个特征其实都属于职业，所以职业就是一个Field。

我们要通过特征组合来训练隐向量，这样每一维特征xixi，都会与其他特征的每一种Field fjfj学习一个隐向量vi,fjvi,fj。也就是说，隐向量不仅与特征有关，还与Field有关。模型的公式：

DNN
我们来看DNN的部分。将输入数据分成两部分，一部分是数值特征（Dense Input），一部分是类别特征（Sparse Input）。我们仍然不适用One Hot编码，将类别特征传入嵌入层，得到多个嵌入向量，再将这些嵌入向量和数值特征连接在一起，传入全连接层，一共连接三层全连接层，使用Relu激活函数。然后再将第三层全连接的输出和FFM、FM的全连接层的输出连接在一起，传入最后一层全连接层。

我们要学习的目标Label表示广告是否被点击了，只有1（点击）和0（没有点击）两种状态。所以我们网络的最后一层要做Logistic回归，在最后一层全连接层使用Sigmoid激活函数，得到广告被点击的概率。

使用LogLoss作为损失函数，FTRL作为学习算法。
FTRL有关的Paper：Ad_click_prediction_a_view_from_the_trenches

注意：LibFFM和LibFM的代码我做了修改，请使用代码库中我的相关代码。
**

预处理数据集
生成神经网络的输入
生成FFM的输入
生成GBDT的输入
首先要为DNN、FFM和GBDT的输入做预处理。对于数值特征，我们将I1-I13转成0-1之间的小数。类别特征我们将某类别使用次数少于cutoff（超参）的忽略掉，留下使用次数多的feature作为某类别字段的特征，然后将这些特征以各自字段为组进行编号。

比如有C1和C2两个类别字段，C1下面有特征a（大于cutoff次）、b（少于cutoff次）、c（大于cutoff次），C2下面有特征x和y（均大于cutoff次），这样留下来的特征就是C1：a、c和C2：x、y。然后以各自字段为分组进行编号，对于C1字段，a和c的特征id对应0和1；对于C2字段，x和y也是0和1。

对于类别特征的输入数据处理，FFM和GBDT各不相同，我们分别来说。

GBDT
GBDT的处理要简单一些，C1-C26每个字段各自的特征id值作为输入即可。 GBDT的输入数据格式是：Label I1-I13 C1-C26 所以实际输入可能是这样：0 小数1 小数2 ~ 小数13 1（C1特征Id） 0（C2特征Id） ~ C26特征Id 其中C1特征Id是1，说明此处C1字段的feature是c，而C2字段的feature是x。

下面是一段生成的真实数据： 0 0.05 0.004983 0.05 0 0.021594 0.008 0.15 0.04 0.362 0.166667 0.2 0 0.04 2 3 0 0 1 1 0 3 1 0 0 0 0 3 0 0 1 4 1 3 0 0 2 0 1 0

很抱歉，我的造句能力实在很差，要是上面一段文字看的你很混乱的话，那就直接看代码吧：）

FFM
FFM的输入数据要复杂一些，详细可以参看官方Github上的说明，摘抄如下：

It is important to understand the difference between field and feature. For example, if we have a raw data like this:

Click Advertiser Publisher
===== ========== =========
0 Nike CNN
1 ESPN BBC
1
2
3
4
Here, we have

* 2 fields: Advertiser and Publisher
* 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC
1
2
Usually you will need to build two dictionares, one for field and one for features, like this:

DictField[Advertiser] -> 0
DictField[Publisher] -> 1

DictFeature[Advertiser-Nike] -> 0
DictFeature[Publisher-CNN] -> 1
DictFeature[Advertiser-ESPN] -> 2
DictFeature[Publisher-BBC] -> 3
1
2
3
4
5
6
7
Then, you can generate FFM format data:

0 0:0:1 1:1:1
1 0:2:1 1:3:1
1
2
Note that because these features are categorical, the values here are all ones.

fields应该很好理解，features的划分跟之前GBDT有些不一样，在刚刚GBDT的处理中我们是每个类别内独立编号，C1有features 0~n，C2有features 0~n。而这次FFM是所有的features统一起来编号。你看它的例子，C1是Advertiser，有两个feature，C2是Publisher，有两个feature，统一起来编号就是0~3。而在GBDT我们要独立编号的，看起来像这样：

DictFeature[Advertiser-Nike] -> 0
DictFeature[Advertiser-ESPN] -> 1
DictFeature[Publisher-CNN] -> 0
DictFeature[Publisher-BBC] -> 1
1
2
3
4
现在我们假设有第三条数据，看看如何构造FFM的输入数据：

Click Advertiser Publisher
===== ========== =========
0 Nike CNN
1 ESPN BBC
0 Lining CNN
1
2
3
4
5
按照规则，应该是像下面这样：

DictFeature[Advertiser-Nike] -> 0
DictFeature[Publisher-CNN] -> 1
DictFeature[Advertiser-ESPN] -> 2
DictFeature[Publisher-BBC] -> 3
DictFeature[Advertiser-Lining] -> 4
1
2
3
4
5
在我们这次FFM的输入数据处理中，跟上面略有些区别，每个类别编号以后，下一个类别继续编号，所以最终的features编号是这样的：

DictFeature[Advertiser-Nike] -> 0
DictFeature[Advertiser-ESPN] -> 1
DictFeature[Advertiser-Lining] -> 2
DictFeature[Publisher-CNN] -> 3
DictFeature[Publisher-BBC] -> 4
1
2
3
4
5
对于我们的数据是从I1开始编号的，从I1-I13，所以C1的编号要从加13开始。

这是一条来自真实的FFM输入数据：
0 0:0:0.05 1:1:0.004983 2:2:0.05 3:3:0 4:4:0.021594 5:5:0.008 6:6:0.15 7:7:0.04 8:8:0.362 9:9:0.166667 10:10:0.2 11:11:0 12:12:0.04 13:15:1 14:29:1 15:64:1 16:76:1 17:92:1 18:101:1 19:107:1 20:122:1 21:131:1 22:133:1 23:143:1 24:166:1 25:179:1 26:209:1 27:216:1 28:243:1 29:260:1 30:273:1 31:310:1 32:317:1 33:318:1 34:333:1 35:340:1 36:348:1 37:368:1 38:381:1

DNN
DNN的输入数据就没有那么复杂了，仍然是I1-I13的小数和C1-C26的统一编号，就像FFM一样，只是不需要从加13开始，最后是Label。
真实数据就像这样：
0.05,0.004983,0.05,0,0.021594,0.008,0.15,0.04,0.362,0.166667,0.2,0,0.04,2,16,51,63,79,88,94,109,118,120,130,153,166,196,203,230,247,260,297,304,305,320,327,335,355,368,0

要说明的就这么多了，我们来看看代码吧，因为要同时生成训练数据、验证数据和测试数据，所以要运行一段时间。

核心代码讲解
完整代码请参见项目地址。
以下代码来自百度deep_fm的preprocess.py，稍稍添了些代码，我就不重复造轮子了：）

# There are 13 integer features and 26 categorical features
continous_features = range(1, 14)
categorial_features = range(14, 40)

# Clip integer features. The clip point for each integer feature
# is derived from the 95% quantile of the total values in each feature
continous_clip = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]

class ContinuousFeatureGenerator:
"""
Normalize the integer features to [0, 1] by min-max normalization
"""

def __init__(self, num_feature):
self.num_feature = num_feature
self.min = [sys.maxsize] * num_feature
self.max = [-sys.maxsize] * num_feature

def build(self, datafile, continous_features):
with open(datafile, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
for i in range(0, self.num_feature):
val = features[continous_features[i]]
if val != '':
val = int(val)
if val > continous_clip[i]:
val = continous_clip[i]
self.min[i] = min(self.min[i], val)
self.max[i] = max(self.max[i], val)

def gen(self, idx, val):
if val == '':
return 0.0
val = float(val)
return (val - self.min[idx]) / (self.max[idx] - self.min[idx])

class CategoryDictGenerator:
"""
Generate dictionary for each of the categorical features
"""

def __init__(self, num_feature):
self.dicts = []
self.num_feature = num_feature
for i in range(0, num_feature):
self.dicts.append(collections.defaultdict(int))

def build(self, datafile, categorial_features, cutoff=0):
with open(datafile, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
for i in range(0, self.num_feature):
if features[categorial_features[i]] != '':
self.dicts[i][features[categorial_features[i]]] += 1
for i in range(0, self.num_feature):
self.dicts[i] = filter(lambda x: x[1] >= cutoff,
self.dicts[i].items())

self.dicts[i] = sorted(self.dicts[i], key=lambda x: (-x[1], x[0]))
vocabs, _ = list(zip(*self.dicts[i]))
self.dicts[i] = dict(zip(vocabs, range(1, len(vocabs) + 1)))
self.dicts[i]['<unk>'] = 0

def gen(self, idx, key):
if key not in self.dicts[idx]:
res = self.dicts[idx]['<unk>']
else:
res = self.dicts[idx][key]
return res

def dicts_sizes(self):
return list(map(len, self.dicts))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def preprocess(datadir, outdir):
"""
All the 13 integer features are normalzied to continous values and these
continous features are combined into one vecotr with dimension 13.

Each of the 26 categorical features are one-hot encoded and all the one-hot
vectors are combined into one sparse binary vector.
"""
dists = ContinuousFeatureGenerator(len(continous_features))
dists.build(os.path.join(datadir, 'train.txt'), continous_features)

dicts = CategoryDictGenerator(len(categorial_features))
dicts.build(
os.path.join(datadir, 'train.txt'), categorial_features, cutoff=200)#200 50

dict_sizes = dicts.dicts_sizes()
categorial_feature_offset = [0]
for i in range(1, len(categorial_features)):
offset = categorial_feature_offset[i - 1] + dict_sizes[i - 1]
categorial_feature_offset.append(offset)

random.seed(0)

# 90% of the data are used for training, and 10% of the data are used
# for validation.
train_ffm = open(os.path.join(outdir, 'train_ffm.txt'), 'w')
valid_ffm = open(os.path.join(outdir, 'valid_ffm.txt'), 'w')

train_lgb = open(os.path.join(outdir, 'train_lgb.txt'), 'w')
valid_lgb = open(os.path.join(outdir, 'valid_lgb.txt'), 'w')

with open(os.path.join(outdir, 'train.txt'), 'w') as out_train:
with open(os.path.join(outdir, 'valid.txt'), 'w') as out_valid:
with open(os.path.join(datadir, 'train.txt'), 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
continous_feats = []
continous_vals = []
for i in range(0, len(continous_features)):

val = dists.gen(i, features[continous_features[i]])
continous_vals.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))
continous_feats.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

categorial_vals = []
categorial_lgb_vals = []
for i in range(0, len(categorial_features)):
val = dicts.gen(i, features[categorial_features[i]]) + categorial_feature_offset[i]
categorial_vals.append(str(val))
val_lgb = dicts.gen(i, features[categorial_features[i]])
categorial_lgb_vals.append(str(val_lgb))

continous_vals = ','.join(continous_vals)
categorial_vals = ','.join(categorial_vals)
label = features[0]
if random.randint(0, 9999) % 10 != 0:
out_train.write(','.join(
[continous_vals, categorial_vals, label]) + '\n')
train_ffm.write('\t'.join(label) + '\t')
train_ffm.write('\t'.join(
['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
train_ffm.write('\t'.join(
['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')

train_lgb.write('\t'.join(label) + '\t')
train_lgb.write('\t'.join(continous_feats) + '\t')
train_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

else:
out_valid.write(','.join(
[continous_vals, categorial_vals, label]) + '\n')
valid_ffm.write('\t'.join(label) + '\t')
valid_ffm.write('\t'.join(
['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
valid_ffm.write('\t'.join(
['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')

valid_lgb.write('\t'.join(label) + '\t')
valid_lgb.write('\t'.join(continous_feats) + '\t')
valid_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

train_ffm.close()
valid_ffm.close()

train_lgb.close()
valid_lgb.close()

test_ffm = open(os.path.join(outdir, 'test_ffm.txt'), 'w')
test_lgb = open(os.path.join(outdir, 'test_lgb.txt'), 'w')

with open(os.path.join(outdir, 'test.txt'), 'w') as out:
with open(os.path.join(datadir, 'test.txt'), 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')

continous_feats = []
continous_vals = []
for i in range(0, len(continous_features)):
val = dists.gen(i, features[continous_features[i] - 1])
continous_vals.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))
continous_feats.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

categorial_vals = []
categorial_lgb_vals = []
for i in range(0, len(categorial_features)):
val = dicts.gen(i,
features[categorial_features[i] -
1]) + categorial_feature_offset[i]
categorial_vals.append(str(val))

val_lgb = dicts.gen(i, features[categorial_features[i] - 1])
categorial_lgb_vals.append(str(val_lgb))

continous_vals = ','.join(continous_vals)
categorial_vals = ','.join(categorial_vals)

out.write(','.join([continous_vals, categorial_vals]) + '\n')

test_ffm.write('\t'.join(['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
test_ffm.write('\t'.join(
['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')

test_lgb.write('\t'.join(continous_feats) + '\t')
test_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

test_ffm.close()
test_lgb.close()
return dict_sizes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
训练FFM
数据准备好了，开始调用LibFFM，训练FFM模型。
learning rate是0.1，迭代32次，训练好后保存的模型文件是model_ffm。

cmd = './libffm/libffm/ffm-train --auto-stop -r 0.1 -t 32 -s {nr_thread} -p ./data/valid_ffm.txt ./data/train_ffm.txt model_ffm'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
1
2
训练结果：

['First check if the text file has already been converted to binary format (1.3 seconds)\n',
'Binary file found. Skip converting text to binary\n',
'First check if the text file has already been converted to binary format (0.2 seconds)\n',
'Binary file found. Skip converting text to binary\n',
'iter tr_logloss va_logloss tr_time\n',
' 1 0.49339 0.48196 12.8\n',
' 2 0.47621 0.47651 25.9\n',
' 3 0.47149 0.47433 39.0\n',
' 4 0.46858 0.47277 51.2\n',
' 5 0.46630 0.47168 63.0\n',
' 6 0.46447 0.47092 74.7\n',
' 7 0.46269 0.47038 86.4\n',
' 8 0.46113 0.47000 98.0\n',
' 9 0.45960 0.46960 109.6\n',
' 10 0.45811 0.46940 121.2\n',
' 11 0.45660 0.46913 132.5\n',
' 12 0.45509 0.46899 144.3\n',
' 13 0.45366 0.46903\n',
'Auto-stop. Use model at 12th iteration.\n']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FFM模型训练好了，我们把训练、验证和测试数据输入给FFM，得到FFM层的输出，输出的文件名为*.out.logit

cmd = './libffm/libffm/ffm-predict ./data/train_ffm.txt model_ffm tr_ffm.out'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
cmd = './libffm/libffm/ffm-predict ./data/valid_ffm.txt model_ffm va_ffm.out'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
cmd = './libffm/libffm/ffm-predict ./data/test_ffm.txt model_ffm te_ffm.out true'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
1
2
3
4
5
6
训练GBDT
现在调用LightGBM训练GBDT模型，因为决策树较容易过拟合，我们设置树的个数为32，叶子节点数设为30，深度就不设置了，学习率设为0.05。

def lgb_pred(tr_path, va_path, _sep = '\t', iter_num = 32):
# load or create your dataset
print('Load data...')
df_train = pd.read_csv(tr_path, header=None, sep=_sep)
df_test = pd.read_csv(va_path, header=None, sep=_sep)

y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'l2', 'auc', 'logloss'},
'num_leaves': 30,
# 'max_depth': 7,
'num_trees': 32,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}

print('Start training...')
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=iter_num,
valid_sets=lgb_eval,
feature_name=["I1","I2","I3","I4","I5","I6","I7","I8","I9","I10","I11","I12","I13","C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
categorical_feature=["C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
early_stopping_rounds=5)

print('Save model...')
# save model to file
gbm.save_model('lgb_model.txt')

print('Start predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

return gbm,y_pred,X_train,y_train
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
训练的结果：

[1] valid_0's l2: 0.241954 valid_0's auc: 0.70607
Training until validation scores don't improve for 5 rounds.
[2] valid_0's l2: 0.234704 valid_0's auc: 0.715608
[3] valid_0's l2: 0.228139 valid_0's auc: 0.717791
[4] valid_0's l2: 0.222168 valid_0's auc: 0.72273
[5] valid_0's l2: 0.216728 valid_0's auc: 0.724065
[6] valid_0's l2: 0.211819 valid_0's auc: 0.725036
[7] valid_0's l2: 0.207316 valid_0's auc: 0.727427
[8] valid_0's l2: 0.203296 valid_0's auc: 0.728583
[9] valid_0's l2: 0.199582 valid_0's auc: 0.730092
[10] valid_0's l2: 0.196185 valid_0's auc: 0.730792
[11] valid_0's l2: 0.193063 valid_0's auc: 0.732316
[12] valid_0's l2: 0.190268 valid_0's auc: 0.733773
[13] valid_0's l2: 0.187697 valid_0's auc: 0.734782
[14] valid_0's l2: 0.185351 valid_0's auc: 0.735636
[15] valid_0's l2: 0.183215 valid_0's auc: 0.736346
[16] valid_0's l2: 0.181241 valid_0's auc: 0.737393
[17] valid_0's l2: 0.179468 valid_0's auc: 0.737709
[18] valid_0's l2: 0.177829 valid_0's auc: 0.739096
[19] valid_0's l2: 0.176326 valid_0's auc: 0.740135
[20] valid_0's l2: 0.174948 valid_0's auc: 0.741065
[21] valid_0's l2: 0.173675 valid_0's auc: 0.742165
[22] valid_0's l2: 0.172499 valid_0's auc: 0.742672
[23] valid_0's l2: 0.171471 valid_0's auc: 0.743246
[24] valid_0's l2: 0.17045 valid_0's auc: 0.744415
[25] valid_0's l2: 0.169582 valid_0's auc: 0.744792
[26] valid_0's l2: 0.168746 valid_0's auc: 0.745478
[27] valid_0's l2: 0.167966 valid_0's auc: 0.746282
[28] valid_0's l2: 0.167264 valid_0's auc: 0.74675
[29] valid_0's l2: 0.166582 valid_0's auc: 0.747429
[30] valid_0's l2: 0.16594 valid_0's auc: 0.748392
[31] valid_0's l2: 0.165364 valid_0's auc: 0.748986
[32] valid_0's l2: 0.164844 valid_0's auc: 0.749362
Did not meet early stopping. Best iteration is:
[32] valid_0's l2: 0.164844 valid_0's auc: 0.749362
Save model...
Start predicting...
The rmse of prediction is: 0.406009502303
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
我们把每个特征的重要程度排个序看看
def ret_feat_impt(gbm):
gain = gbm.feature_importance("gain").reshape(-1, 1) / sum(gbm.feature_importance("gain"))
col = np.array(gbm.feature_name()).reshape(-1, 1)
return sorted(np.column_stack((col, gain)),key=lambda x: x[1],reverse=True)
1
2
3
4
[array(['I6', '0.1978774213012332'],
dtype='<U32'), array(['I11', '0.1892171073393491'],
dtype='<U32'), array(['C13', '0.09876586224832032'],
dtype='<U32'), array(['I7', '0.09328723289667494'],
dtype='<U32'), array(['C15', '0.07837089393651243'],
dtype='<U32'), array(['I1', '0.06896606612740637'],
dtype='<U32'), array(['C18', '0.03397325870627491'],
dtype='<U32'), array(['C4', '0.03194220375573926'],
dtype='<U32'), array(['I13', '0.027751948092299045'],
dtype='<U32'), array(['C14', '0.022884477973766117'],
dtype='<U32'), array(['C17', '0.01758709018584479'],
dtype='<U32'), array(['I3', '0.01745531293913725'],
dtype='<U32'), array(['C24', '0.015748415135270675'],
dtype='<U32'), array(['C7', '0.014203757070472703'],
dtype='<U32'), array(['I8', '0.013413268591324624'],
dtype='<U32'), array(['C11', '0.012366386458128355'],
dtype='<U32'), array(['C10', '0.011022221770323784'],
dtype='<U32'), array(['I5', '0.01042866903792042'],
dtype='<U32'), array(['C16', '0.010389410428237439'],
dtype='<U32'), array(['I9', '0.009918639946598076'],
dtype='<U32'), array(['C2', '0.006787009911825981'],
dtype='<U32'), array(['C12', '0.005168884905437884'],
dtype='<U32'), array(['I4', '0.00468917800335175'],
dtype='<U32'), array(['C26', '0.003364625407413743'],
dtype='<U32'), array(['C23', '0.0031263193710805628'],
dtype='<U32'), array(['C21', '0.0008737398560005959'],
dtype='<U32'), array(['C19', '0.00042059860405565207'],
dtype='<U32'), array(['I2', '0.0'],
dtype='<U32'), array(['I10', '0.0'],
dtype='<U32'), array(['I12', '0.0'],
dtype='<U32'), array(['C1', '0.0'],
dtype='<U32'), array(['C3', '0.0'],
dtype='<U32'), array(['C5', '0.0'],
dtype='<U32'), array(['C6', '0.0'],
dtype='<U32'), array(['C8', '0.0'],
dtype='<U32'), array(['C9', '0.0'],
dtype='<U32'), array(['C20', '0.0'],
dtype='<U32'), array(['C22', '0.0'],
dtype='<U32'), array(['C25', '0.0'],
dtype='<U32')]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
通过eli5分析参数
import eli5

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import csv
import numpy as np

with open('./data/train_eli5.csv', 'rt') as f:
data = list(csv.DictReader(f))

_all_xs = [{k: v for k, v in row.items() if k != 'clicked'} for row in data]
_all_ys = np.array([int(row['clicked']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
899991 items total, 25.5% true
1
# from xgboost import XGBClassifier
import warnings
# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+
warnings.filterwarnings('ignore', category=UserWarning)
class CSCTransformer:
def transform(self, xs):
# work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
return xs.tocsc()
def fit(self, *args):
return self

clf = lgb.LGBMClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, CSCTransformer(), clf)

def evaluate(_clf):
scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
_clf.fit(train_xs, train_ys) # so that parts of the original pipeline are fitted

evaluate(pipeline)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Accuracy: 0.776 ± 0.003
1
booster = clf.booster_ #如果运行出错请使用这句clf.booster()
original_feature_names = booster.feature_name
booster.feature_names = vec.get_feature_names()
# recover original feature names
booster.feature_names = original_feature_names
1
2
3
4
5
from eli5 import show_weights
show_weights(clf, vec=vec)
1
2

from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)
1
2

用LightGBM的输出生成FM数据
数据格式请参见libFM 1.4.2 manual中的说明，截取文档中的格式说明如下：

GBDT已经训练好了，我们需要GBDT输出的叶子节点作为输入数据X传给FM，一共30个叶子节点，那么输入给FM的数据格式就是X中不是0的数据的index:value。

一段真实数据如下：0 0:31 1:61 2:93 3:108 4:149 5:182 6:212 7:242 8:277 9:310 10:334 11:365 12:401 13:434 14:465 15:491 16:527 17:552 18:589 19:619 20:648 21:678 22:697 23:744 24:770 25:806 26:826 27:862 28:899 29:928 30:955 31:988

def generat_lgb2fm_data(outdir, gbm, dump, tr_path, va_path, te_path, _sep = '\t'):
with open(os.path.join(outdir, 'train_lgb2fm.txt'), 'w') as out_train:
with open(os.path.join(outdir, 'valid_lgb2fm.txt'), 'w') as out_valid:
with open(os.path.join(outdir, 'test_lgb2fm.txt'), 'w') as out_test:
df_train_ = pd.read_csv(tr_path, header=None, sep=_sep)
df_valid_ = pd.read_csv(va_path, header=None, sep=_sep)
df_test_= pd.read_csv(te_path, header=None, sep=_sep)

y_train_ = df_train_[0].values
y_valid_ = df_valid_[0].values

X_train_ = df_train_.drop(0, axis=1).values
X_valid_ = df_valid_.drop(0, axis=1).values
X_test_= df_test_.values

train_leaves= gbm.predict(X_train_, num_iteration=gbm.best_iteration, pred_leaf=True)
valid_leaves= gbm.predict(X_valid_, num_iteration=gbm.best_iteration, pred_leaf=True)
test_leaves= gbm.predict(X_test_, num_iteration=gbm.best_iteration, pred_leaf=True)

tree_info = dump['tree_info']
tree_counts = len(tree_info)
for i in range(tree_counts):
train_leaves[:, i] = train_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
valid_leaves[:, i] = valid_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
test_leaves[:, i] = test_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
# print(train_leaves[:, i])
# print(tree_info[i]['num_leaves'])

for idx in range(len(y_train_)):
out_train.write((str(y_train_[idx]) + '\t'))
out_train.write('\t'.join(
['{}:{}'.format(ii, val) for ii,val in enumerate(train_leaves[idx]) if float(val) != 0 ]) + '\n')

for idx in range(len(y_valid_)):
out_valid.write((str(y_valid_[idx]) + '\t'))
out_valid.write('\t'.join(
['{}:{}'.format(ii, val) for ii,val in enumerate(valid_leaves[idx]) if float(val) != 0 ]) + '\n')

for idx in range(len(X_test_)):
out_test.write('\t'.join(
['{}:{}'.format(ii, val) for ii,val in enumerate(test_leaves[idx]) if float(val) != 0 ]) + '\n')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
训练FM
为训练FM的数据已经准备好了，我们调用LibFM进行训练。
迭代64次，使用sgd训练，学习率是0.00000001，训练好的模型保存为文件fm_model。
训练输出的log，Train和Test的数值不是loss，是accuracy。

cmd = './libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 64 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -save_model fm_model'
os.popen(cmd).readlines()
1
2
训练结果：

['----------------------------------------------------------------------------\n',
'libFM\n',
' Version: 1.4.4\n',
' Author: Steffen Rendle, srendle@libfm.org\n',
' WWW: http://www.libfm.org/\n',
'This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.\n',
'This is free software, and you are welcome to redistribute it under certain\n',
'conditions; for details see license.txt.\n',
'----------------------------------------------------------------------------\n',
'Loading train...\t\n',
'has x = 1\n',
'has xt = 0\n',
'num_rows=899991\tnum_values=28799712\tnum_features=32\tmin_target=0\tmax_target=1\n',
'Loading test... \t\n',
'has x = 1\n',
'has xt = 0\n',
'num_rows=100009\tnum_values=3200288\tnum_features=32\tmin_target=0\tmax_target=1\n',
'#relations: 0\n',
'Loading meta data...\t\n',
'learnrate=1e-08\n',
'learnrates=1e-08,1e-08,1e-08\n',
'#iterations=64\n',
"SGD: DON'T FORGET TO SHUFFLE THE ROWS IN TRAINING DATA TO GET THE BEST RESULTS.\n",
'#Iter= 0\tTrain=0.625438\tTest=0.619484\n',
'#Iter= 1\tTrain=0.636596\tTest=0.632013\n',
'#Iter= 2\tTrain=0.627663\tTest=0.623114\n',
'#Iter= 3\tTrain=0.609776\tTest=0.606605\n',
'#Iter= 4\tTrain=0.563581\tTest=0.56092\n',
'#Iter= 5\tTrain=0.497907\tTest=0.495655\n',
'#Iter= 6\tTrain=0.461677\tTest=0.461408\n',
'#Iter= 7\tTrain=0.453666\tTest=0.452639\n',
'#Iter= 8\tTrain=0.454026\tTest=0.453419\n',
'#Iter= 9\tTrain=0.456836\tTest=0.455919\n',
'#Iter= 10\tTrain=0.46032\tTest=0.459339\n',
'#Iter= 11\tTrain=0.466546\tTest=0.465358\n',
'#Iter= 12\tTrain=0.473565\tTest=0.472317\n',
'#Iter= 13\tTrain=0.481726\tTest=0.480967\n',
'#Iter= 14\tTrain=0.492357\tTest=0.491216\n',
'#Iter= 15\tTrain=0.504419\tTest=0.502935\n',
'#Iter= 16\tTrain=0.517793\tTest=0.516214\n',
'#Iter= 17\tTrain=0.533604\tTest=0.532102\n',
'#Iter= 18\tTrain=0.552926\tTest=0.5515\n',
'#Iter= 19\tTrain=0.575645\tTest=0.573198\n',
'#Iter= 20\tTrain=0.59418\tTest=0.590887\n',
'#Iter= 21\tTrain=0.610691\tTest=0.607815\n',
'#Iter= 22\tTrain=0.626138\tTest=0.623384\n',
'#Iter= 23\tTrain=0.640751\tTest=0.637923\n',
'#Iter= 24\tTrain=0.65393\tTest=0.652141\n',
'#Iter= 25\tTrain=0.666099\tTest=0.6641\n',
'#Iter= 26\tTrain=0.677933\tTest=0.675419\n',
'#Iter= 27\tTrain=0.689539\tTest=0.687108\n',
'#Iter= 28\tTrain=0.700177\tTest=0.697397\n',
'#Iter= 29\tTrain=0.709265\tTest=0.706156\n',
'#Iter= 30\tTrain=0.716553\tTest=0.713266\n',
'#Iter= 31\tTrain=0.723218\tTest=0.719635\n',
'#Iter= 32\tTrain=0.729163\tTest=0.726065\n',
'#Iter= 33\tTrain=0.734428\tTest=0.731354\n',
'#Iter= 34\tTrain=0.738863\tTest=0.735844\n',
'#Iter= 35\tTrain=0.74284\tTest=0.740323\n',
'#Iter= 36\tTrain=0.746316\tTest=0.743793\n',
'#Iter= 37\tTrain=0.749123\tTest=0.746333\n',
'#Iter= 38\tTrain=0.751573\tTest=0.748493\n',
'#Iter= 39\tTrain=0.753264\tTest=0.750292\n',
'#Iter= 40\tTrain=0.754803\tTest=0.751642\n',
'#Iter= 41\tTrain=0.756011\tTest=0.753062\n',
'#Iter= 42\tTrain=0.756902\tTest=0.753892\n',
'#Iter= 43\tTrain=0.757642\tTest=0.754872\n',
'#Iter= 44\tTrain=0.758293\tTest=0.755372\n',
'#Iter= 45\tTrain=0.758855\tTest=0.755782\n',
'#Iter= 46\tTrain=0.759293\tTest=0.756322\n',
'#Iter= 47\tTrain=0.759695\tTest=0.756652\n',
'#Iter= 48\tTrain=0.760084\tTest=0.756982\n',
'#Iter= 49\tTrain=0.760343\tTest=0.757252\n',
'#Iter= 50\tTrain=0.76055\tTest=0.757332\n',
'#Iter= 51\tTrain=0.760706\tTest=0.757582\n',
'#Iter= 52\tTrain=0.760944\tTest=0.757842\n',
'#Iter= 53\tTrain=0.761035\tTest=0.757952\n',
'#Iter= 54\tTrain=0.761173\tTest=0.758152\n',
'#Iter= 55\tTrain=0.761291\tTest=0.758382\n',
'#Iter= 56\tTrain=0.76142\tTest=0.758412\n',
'#Iter= 57\tTrain=0.761541\tTest=0.758452\n',
'#Iter= 58\tTrain=0.761677\tTest=0.758572\n',
'#Iter= 59\tTrain=0.76175\tTest=0.758692\n',
'#Iter= 60\tTrain=0.761829\tTest=0.758822\n',
'#Iter= 61\tTrain=0.761855\tTest=0.758862\n',
'#Iter= 62\tTrain=0.761918\tTest=0.759002\n',
'#Iter= 63\tTrain=0.761988\tTest=0.758972\n',
'Final\tTrain=0.761988\tTest=0.758972\n',
'Writing FM model to fm_model\n']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
FM模型训练好了，我们把训练、验证和测试数据输入给FM，得到FM层的输出，输出的文件名为*.fm.logits

cmd = './libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix tr'
os.popen(cmd).readlines()
cmd = './libfm/libfm/bin/libFM -task c -train ./data/valid_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix va'
os.popen(cmd).readlines()
cmd = './libfm/libfm/bin/libFM -task c -train ./data/test_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix te -test2predict true'
os.popen(cmd).readlines()
1
2
3
4
5
6
开始构建模型
embed_dim = 32
sparse_max = 30000 # sparse_feature_dim = 117568
sparse_dim = 26
dense_dim = 13
out_dim = 400
1
2
3
4
5
定义输入占位符

import tensorflow as tf
def get_inputs():
dense_input = tf.placeholder(tf.float32, [None, dense_dim], name="dense_input")
sparse_input = tf.placeholder(tf.int32, [None, sparse_dim], name="sparse_input")
FFM_input = tf.placeholder(tf.float32, [None, 1], name="FFM_input")
FM_input = tf.placeholder(tf.float32, [None, 1], name="FM_input")

targets = tf.placeholder(tf.float32, [None, 1], name="targets")
LearningRate = tf.placeholder(tf.float32, name = "LearningRate")
return dense_input, sparse_input, FFM_input, FM_input, targets, LearningRate
1
2
3
4
5
6
7
8
9
10
输入类别特征，从嵌入层获得嵌入向量

def get_sparse_embedding(sparse_input):
with tf.name_scope("sparse_embedding"):
sparse_embed_matrix = tf.Variable(tf.random_uniform([sparse_max, embed_dim], -1, 1), name = "sparse_embed_matrix")
sparse_embed_layer = tf.nn.embedding_lookup(sparse_embed_matrix, sparse_input, name = "sparse_embed_layer")
sparse_embed_layer = tf.reshape(sparse_embed_layer, [-1, sparse_dim * embed_dim])
return sparse_embed_layer
1
2
3
4
5
6
输入数值特征，和嵌入向量链接在一起经过三层全连接层

def get_dnn_layer(dense_input, sparse_embed_layer):
with tf.name_scope("dnn_layer"):
input_combine_layer = tf.concat([dense_input, sparse_embed_layer], 1) #(?, 845 = 832 + 13)
fc1_layer = tf.layers.dense(input_combine_layer, out_dim, name = "fc1_layer", activation=tf.nn.relu)
fc2_layer = tf.layers.dense(fc1_layer, out_dim, name = "fc2_layer", activation=tf.nn.relu)
fc3_layer = tf.layers.dense(fc2_layer, out_dim, name = "fc3_layer", activation=tf.nn.relu)
return fc3_layer
1
2
3
4
5
6
7
构建计算图
如前所述，将FFM和FM层的输出经过全连接层，再和数值特征、嵌入向量的三层全连接层的输出连接在一起，做Logistic回归。
采用LogLoss损失，FtrlOptimizer优化损失。

tf.reset_default_graph()
train_graph = tf.Graph()
with train_graph.as_default():
dense_input, sparse_input, FFM_input, FM_input, targets, lr = get_inputs()
sparse_embed_layer = get_sparse_embedding(sparse_input)
fc3_layer = get_dnn_layer(dense_input, sparse_embed_layer)

ffm_fc_layer = tf.layers.dense(FFM_input, 1, name = "ffm_fc_layer")
fm_fc_layer = tf.layers.dense(FM_input, 1, name = "fm_fc_layer")
feature_combine_layer = tf.concat([ffm_fc_layer, fm_fc_layer, fc3_layer], 1) #(?, 402)

with tf.name_scope("inference"):
logits = tf.layers.dense(feature_combine_layer, 1, name = "logits_layer")
pred = tf.nn.sigmoid(logits, name = "prediction")

with tf.name_scope("loss"):
# LogLoss损失，Logistic回归到点击率
# cost = tf.losses.sigmoid_cross_entropy(targets, logits )
sigmoid_cost = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets, logits=logits, name = "sigmoid_cost")
logloss_cost = tf.losses.log_loss(labels=targets, predictions=pred)
cost = logloss_cost # + sigmoid_cost
loss = tf.reduce_mean(cost)
# 优化损失
# train_op = tf.train.AdamOptimizer(lr).minimize(loss) #cost
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.FtrlOptimizer(lr) #tf.train.FtrlOptimizer(lr) AdamOptimizer
gradients = optimizer.compute_gradients(loss) #cost
train_op = optimizer.apply_gradients(gradients, global_step=global_step)

# Accuracy
with tf.name_scope("score"):
correct_prediction = tf.equal(tf.to_float(pred > 0.5), targets)
accuracy = tf.reduce_mean(tf.to_float(correct_prediction), name="accuracy")

# auc, uop = tf.contrib.metrics.streaming_auc(pred, targets)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
超参
数据量太大，我们只跑一个epoch。

# Number of Epochs
num_epochs = 1
# Batch Size
batch_size = 32

# Learning Rate
learning_rate = 0.01
# Show stats for every n number of batches
show_every_n_batches = 25

save_dir = './save'

ffm_tr_out_path = './tr_ffm.out.logit'
ffm_va_out_path = './va_ffm.out.logit'
fm_tr_out_path = './tr.fm.logits'
fm_va_out_path = './va.fm.logits'
train_path = './data/train.txt'
valid_path = './data/valid.txt'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
读取FFM的输出

ffm_train = pd.read_csv(ffm_tr_out_path, header=None)
ffm_train = ffm_train[0].values

ffm_valid = pd.read_csv(ffm_va_out_path, header=None)
ffm_valid = ffm_valid[0].values
1
2
3
4
5
读取FM的输出

fm_train = pd.read_csv(fm_tr_out_path, header=None)
fm_train = fm_train[0].values

fm_valid = pd.read_csv(fm_va_out_path, header=None)
fm_valid = fm_valid[0].values
1
2
3
4
5
读取数据集
将DNN数据和FM、FFM的输出数据读取出来，并连接在一起

train_data = pd.read_csv(train_path, header=None)
train_data = train_data.values

valid_data = pd.read_csv(valid_path, header=None)
valid_data = valid_data.values

cc_train = np.concatenate((ffm_train.reshape(-1, 1), fm_train.reshape(-1, 1), train_data), 1)
cc_valid = np.concatenate((ffm_valid.reshape(-1, 1), fm_valid.reshape(-1, 1), valid_data), 1)

np.random.shuffle(cc_train)
np.random.shuffle(cc_valid)

train_y = cc_train[:,-1]
test_y = cc_valid[:,-1]

train_X = cc_train[:,0:-1]
test_X = cc_valid[:,0:-1]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
训练网络
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import time
import datetime
from sklearn.metrics import log_loss
from sklearn.learning_curve import learning_curve
from sklearn import metrics
def train_model(num_epochs):
losses = {'train':[], 'test':[]}
acc_lst = {'train':[], 'test':[]}
pred_lst = []

with tf.Session(graph=train_graph) as sess:

# Keep track of gradient values and sparsity
grad_summaries = []
for g, v in gradients:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name.replace(':', '_')), g)
sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name.replace(':', '_')), tf.nn.zero_fraction(g))
grad_summaries.append(grad_hist_summary)
grad_summaries.append(sparsity_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)

# Output directory for models and summaries
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))

# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", loss)
# acc_summary = tf.scalar_summary("accuracy", accuracy)

# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, grad_summaries_merged])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

# Inference summaries
inference_summary_op = tf.summary.merge([loss_summary])
inference_summary_dir = os.path.join(out_dir, "summaries", "inference")
inference_summary_writer = tf.summary.FileWriter(inference_summary_dir, sess.graph)

sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
saver = tf.train.Saver()
for epoch_i in range(num_epochs):

#将数据集分成训练集和测试集
train_batches = get_batches(train_X, train_y, batch_size)
test_batches = get_batches(test_X, test_y, batch_size)

#训练的迭代，保存训练损失
for batch_i in range(len(train_X) // batch_size):
x, y = next(train_batches)

feed = {
dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),
sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),
FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),
FM_input: np.reshape(x.take(1,1), [batch_size, 1]),
targets: np.reshape(y, [batch_size, 1]),
lr: learning_rate}
# _ = sess.run([train_op], feed) #cost
step, train_loss, summaries, _, prediction, acc = sess.run(
[global_step, loss, train_summary_op, train_op, pred, accuracy], feed) #cost

prediction = prediction.reshape(y.shape)
losses['train'].append(train_loss)

acc_lst['train'].append(acc)
train_summary_writer.add_summary(summaries, step) #

if(np.mean(y) != 0):
auc = metrics.roc_auc_score(y, prediction)
else:
auc = -1

# Show every <show_every_n_batches> batches
if (epoch_i * (len(train_X) // batch_size) + batch_i) % show_every_n_batches == 0:
time_str = datetime.datetime.now().isoformat()
print('{}: Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f} accuracy = {} auc = {}'.format(
time_str,
epoch_i,
batch_i,
(len(train_X) // batch_size),
train_loss,
acc,
auc))
# print(metrics.classification_report(y, np.float32(prediction > 0.5)))

#使用测试数据的迭代
for batch_i in range(len(test_X) // batch_size):
x, y = next(test_batches)

feed = {
dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),
sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),
FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),
FM_input: np.reshape(x.take(1,1), [batch_size, 1]),
targets: np.reshape(y, [batch_size, 1]),
lr: learning_rate}
# Get Prediction
step, test_loss, summaries, prediction, acc = sess.run(
[global_step, loss, inference_summary_op, pred, accuracy], feed) #cost

#保存测试损失和准确率
prediction = prediction.reshape(y.shape)
losses['test'].append(test_loss)

acc_lst['test'].append(acc)
inference_summary_writer.add_summary(summaries, step) #
pred_lst.append(prediction)

if(np.mean(y) != 0):
auc = metrics.roc_auc_score(y, prediction)
else:
auc = -1

time_str = datetime.datetime.now().isoformat()
if (epoch_i * (len(test_X) // batch_size) + batch_i) % show_every_n_batches == 0:
print('{}: Epoch {:>3} Batch {:>4}/{} test_loss = {:.3f} accuracy = {} auc = {}'.format(
time_str,
epoch_i,
batch_i,
(len(test_X) // batch_size),
test_loss,
acc,
auc))
print(metrics.classification_report(y, np.float32(prediction > 0.5)))

# Save Model
saver.save(sess, save_dir) #, global_step=epoch_i
print('Model Trained and Saved')
save_params((losses, acc_lst, pred_lst, save_dir))
return losses, acc_lst, pred_lst, save_dir
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
losses, acc_lst, pred_lst, load_dir = train_model(1)
1
输出验证集上的训练信息
平均准确率
平均损失
平均Auc
预测的平均点击率
精确率、召回率、F1 Score等信息
因为数据中大部分都是负例，正例较少，如果模型全部猜0就能有75%的准确率，所以准确率这个指标是不可信的。

我们需要关注正例的精确率和召回率，当然最主要还是要看LogLoss的值，因为比赛采用的评价指标是LogLoss，而不是采用AUC值。

def train_info():
print("Test Mean Acc : {}".format(np.mean(acc_lst['test']))) #test_pred_mean
print("Test Mean Loss : {}".format(np.mean(losses['test']))) #test_pred_mean
print("Mean Auc : {}".format(metrics.roc_auc_score(test_y[:-9], np.array(pred_lst).reshape(-1, 1))))
print("Mean prediction : {}".format(np.mean(np.array(pred_lst).reshape(-1, 1))))
print(metrics.classification_report(test_y[:-9], np.float32(np.array(pred_lst).reshape(-1, 1) > 0.5)))
1
2
3
4
5
6
Test Mean Acc : 0.7814300060272217
Test Mean Loss : 0.46838584542274475
Mean Auc : 0.7792937214782675
Mean prediction : 0.2552148997783661
precision recall f1-score support

0.0 0.81 0.93 0.86 74426
1.0 0.63 0.34 0.45 25574

avg / total 0.76 0.78 0.76 100000
1
2
3
4
5
6
7
8
9
10
TensorBoard中查看loss

总结
以上就是点击率预估的完整过程，没有进行完整数据的训练，并且有很多超参可以调整，从只跑了一次epoch的结果来看，验证集上的LogLoss是0.46，其他数据都在75%~80%之间，这跟FFM、GBDT和FM网络训练的准确率差不多。

扩展阅读
Code for the 3rd place finish for Avazu Click-Through Rate Prediction
Kaggle ： Display Advertising Challenge( ctr 预估 )
用机器学习对CTR预估建模
Beginner's Guide to Click-Through Rate Prediction with Logistic Regression
2nd place solution for Avazu click-through rate prediction competition
常见计算广告点击率预估算法总结
3 Idiots' Approach for Display Advertising Challenge
Solution to the Outbrain Click Prediction competition
Deep Interest Network for Click-Through Rate Prediction
Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction
重磅！阿里妈妈首次公开自研CTR预估核心算法MLR
阿里盖坤团队提出深度兴趣网络，更懂用户什么时候会剁手
深入FFM原理与实践
今天的分享就到这里，就酱~
---------------------
作者：你先等等
来源：CSDN
原文：https://blog.csdn.net/chengcheng1394/article/details/78940565
版权声明：本文为博主原创文章，转载请附上博文链接！

posted @ 2019-05-14 10:46 Django's blog 阅读(2156) 评论(0) 收藏举报

刷新页面返回顶部

Django's blog

Kaggle实战——点击率预估

公告