Event Recommendation Engine Challenge分步解析第七步

一、请知晓

 本文是基于:

  Event Recommendation Engine Challenge分步解析第一步

  Event Recommendation Engine Challenge分步解析第二步

  Event Recommendation Engine Challenge分步解析第三步

  Event Recommendation Engine Challenge分步解析第四步

  Event Recommendation Engine Challenge分步解析第五步

  Event Recommendation Engine Challenge分步解析第六步

 需要读者先阅读前六篇文章解析

 

二、模型构建和预测

 实际上在上述特征构造好了之后,我们有很多的办法去训练得到模型和完成预测,这里用了sklearn中的SGDClassifier 事实上xgboost有更好的效果(显然我们的特征大多是密集型的浮点数,很适合GBDT这样的模型)

 注意交叉验证,我们这里用了10折的交叉验证

 

import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import KFold
import warnings
warnings.filterwarnings('ignore')

def train():
    """
    在我们得到的特征上训练分类器,target为1(感兴趣),或者是0(不感兴趣)
    """
    trainDf = pd.read_csv('data_train.csv')
    X = np.matrix( pd.DataFrame(trainDf, index=None, columns=['invited', 'user_reco', 'evt_p_reco',
                    'evt_c_reco','user_pop', 'frnd_infl', 'evt_pop']) )
    y = np.array(trainDf.interested)
    
    clf = SGDClassifier(loss='log', penalty='l2')
    clf.fit(X, y)
    return clf

def validate():
    """
    10折的交叉验证,并输出交叉验证的平均准确率
    """
    trainDf = pd.read_csv('data_train.csv')
    X = np.matrix(pd.DataFrame(trainDf, index=None, columns=['invited', 'user_reco', 'evt_p_reco',
                    'evt_c_reco','user_pop', 'frnd_infl', 'evt_pop']) )
    y = np.array(trainDf.interested)
    
    nrows = len(trainDf)
    kfold = KFold(n_splits=10,shuffle=False)
    avgAccuracy = 0
    run = 0
    for train, test in kfold.split(X, y):
        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
        clf = SGDClassifier(loss='log', penalty='l2')
        clf.fit(Xtrain, ytrain)
        accuracy = 0
        ntest = len(ytest)
        for i in range(0, ntest):
            yt = clf.predict(Xtest[i, :])
            if yt == ytest[i]:
                accuracy += 1
                
        accuracy = accuracy / ntest
        print('accuracy(run %d) : %f' % (run, accuracy) )
        
def test(clf):
    """
    读取test数据,用分类器完成预测
    """
    origTestDf = pd.read_csv("test.csv")
    users = origTestDf.user
    events = origTestDf.event
    
    testDf = pd.read_csv("data_test.csv")
    fout = open("result.csv", 'w')
    fout.write(",".join(["user", "event", "outcome", "dist"]) + "\n")
    
    nrows = len(testDf)
    Xp = np.matrix(testDf)
    yp = np.zeros((nrows, 2))
    
    for i in range(0, nrows):
        xp = Xp[i, :]
        yp[i, 0] = clf.predict(xp)
        yp[i, 1] = clf.decision_function(xp)
        fout.write(",".join( map( lambda x: str(x), [users[i], events[i], yp[i, 0], yp[i, 1]] ) ) + "\n")
    fout.close()
        
clf = train()
validate()
test(clf)
print('done')

 

 

三、感谢

 本文参考请点击,感谢作者的分享,但是觉得里面有些小问题

 

 

 

posted @ 2019-03-12 18:16  1直在路上1  阅读(547)  评论(0编辑  收藏  举报