dc竞赛大学生奖学金预测
这个竞赛整个过程是用python和sql一起完成的,当然单独用python也是可以做到,sql主要是体现在数据的初处理时候。
先介绍接下来会涉及哪些方面:
1、数据特征的选取及处理(初处理)
2、建模(贝叶斯、随机森林)
2.1 整体过程的实现
2.2 cross-validation 验证
2.3调优的一些思考
--------------------------------------------------------------------------------------------------------------------------------------------
1、数据初处理:
了解原始数据:
http://www.pkbigdata.com/common/cmpt/%E5%A4%A7%E5%AD%A6%E7%94%9F%E5%8A%A9%E5%AD%A6%E9%87%91%E7%B2%BE%E5%87%86%E8%B5%84%E5%8A%A9%E9%A2%84%E6%B5%8B_%E7%AB%9E%E8%B5%9B%E4%BF%A1%E6%81%AF.html
看了网页知道一共有6分学生行为数据(借书,进出图书馆,进出寝室,平时消费,学院信息,奖学金信息)
首先对助学金这个“业务”,我们多多少少还是有一点了解,总体印象就是和成绩,家庭情况最为相关,而且是以学院或者班级作为一个选择单位。所以有了这些基本概念后,得到了一些初步的特征值:
# -*- coding:utf-8 -*- import numpy as np from pandas import DataFrame,Series import pandas as pd import datetime from datetime import datetime filedress = r'E:\BaiduYunDownload\finalTest\final_test\final_test\library_final_test.txt' file =open(filedress).readlines() df = [] for line in file: ordata=line.replace(',"','|').replace('"','').strip().split('|') df.append(ordata) dfile=pd.DataFrame(df) dfile.columns=['id','door','eztime'] dfile=dfile.drop_duplicates() ###区分周末 def workday(row): timetuple = datetime.strptime(row['eztime'],'%Y/%m/%d %H:%M:%S') wordays=timetuple.isoweekday() if wordays in [0,6]: bb=1 else: bb=0 return bb dfile['workday']=dfile.apply(workday,axis=1) ###时间区分(学期外的,考试月内的,考试外的) def timeclass(row): if datetime.strptime(row['eztime'],'%Y/%m/%d %H:%M:%S')>datetime.strptime('2015/09/01 00:00:00','%Y/%m/%d %H:%M:%S'): timecla='out2013' elif datetime.strptime('2014/12/01 00:00:00','%Y/%m/%d %H:%M:%S')<datetime.strptime(row['eztime'],'%Y/%m/%d %H:%M:%S')<datetime.strptime('2015/02/01 00:00:00','%Y/%m/%d %H:%M:%S') or datetime.strptime('2015/06/01 00:00:00','%Y/%m/%d %H:%M:%S')<datetime.strptime(row['eztime'],'%Y/%m/%d %H:%M:%S')<datetime.strptime('2015/08/01 00:00:00','%Y/%m/%d %H:%M:%S'): timecla='kaoshizhou' else: timecla='kaoshiwai' return timecla dfile['timeclass']=dfile.apply(timeclass,axis=1)
基本处理完之后,接下来就要将空值数据补上,另外一些数据因为差距太大要做标准化,以下是程序:
import pandas as pd import numpy as np import random as rmd df = pd.read_csv(r"C:\Users\ai\Desktop\mongo3.2\traindate.csv") df.columns del df[' '] btype_key ={ '1':99, 'V':1, 'H':2, 'R':3, 'U':4, 'P':5, 'I':6, 'K':7, 'TP':8, 'TN':9, 'D':10, 'Q':11, 'C':12, 'T':13, 'B':14, 'X':15, 'J':16, 'A':17, 'non':18, 'O':19, 'Z':20, 'N':21, 'F':22, 'G':23, 'E':24, 'S':25 } df['dbtype']=df.BTYPE.map(btype_key) df['SCORE'].fillna(0,inplace=True) df['rank'].fillna(9999,inplace=True) df['ACADEMY'].fillna(99,inplace=True) df['SUMMON'].fillna(df['SUMMON'].mean(),inplace=True) df['CARDACTION'].fillna(df['CARDACTION'].mean(),inplace=True) df['monmean'].fillna(df['monmean'].mean(),inplace=True) df['LIBTIMES'].fillna(0,inplace=True) df['LIBEARLY'].fillna(0,inplace=True) df['LIBLATE'].fillna(0,inplace=True) df['DOUTIMES'].fillna(0,inplace=True) df['DEARLYUP'].fillna(0,inplace=True) df['DLATEUP'].fillna(0,inplace=True) df['DWEEKENDOUT'].fillna(0,inplace=True) df['MAXBORR'].fillna(0,inplace=True) df['BORRACTION'].fillna(0,inplace=True) ##http://www.cnblogs.com/chaosimple/p/4153167.html from sklearn import preprocessing ####python sklearn 库中preprocessing可以做到标准化,具体可以参考上面网页 pp=preprocessing.scale(df['SUMMON']) p1=preprocessing.scale(df['CARDACTION']) p2=preprocessing.scale(df['monmean']) p3=preprocessing.scale(df['DWEEKENDOUT']) p4=preprocessing.scale(df['LIBTIMES']) p5=preprocessing.scale(df['BORRACTION']) df['newsummon']=pd.DataFrame(pp) df['newcardtion']=pd.DataFrame(p1) df['newmonmean']=pd.DataFrame(p2) df['newdweekout']=pd.DataFrame(p3) df['newlibtimes']=pd.DataFrame(p4) df['newborraction']=pd.DataFrame(p5)
数据到这里基本上已经是完成了初步处理。不过鉴于做的统计发现85%的学生其实是得不到助学金,这个时候这个比例的数据对于任何一份分类器来说都是一个不好的数据。那么这个时候我们就可以对数据做结构性的调整====分层抽样
具体步骤是:先把数据用train_test_split分为train,test的样本(其实train_test_split本身就是一个分层抽样的工具),然后对得到的train数据使得没有得助学金的比例稍微下降一点。一般来说大概做到2:1就可以了,以下是具体实现代码:
from sklearn.cross_validation import train_test_split data_train,data_test = train_test_split(df,test_size=0.3) ##得到test集 #自己写的随机抽取dataframe的样本 target=data_train[data_train.SCHOLARSHIP==0].reset_index(True) del(target['index']) target1=data_train[data_train.SCHOLARSHIP!=0].reset_index(True) del(target1['index']) tlist=rmd.sample(range(1,len(target)),2000) targetc=target.loc[tlist].reset_index(True) del(targetc['index']) newtarget=pd.concat([targetc,target1]).reset_index(True) del(newtarget['index']) data_train=newtarget ##最终的train 集
上面都是数据初处理方面的,下一篇是模型建立及调优的过程,明天续上。