酒店订房预测

提出问题:同一orderid下,有且仅有一个orderlabel为1的售卖房型。本项目的目的就是预测哪一个售卖房型(roomid)是用户最终预订的。
本项目主要分为两部分:
一.先对测试数据集的基本字段做一个简单的分析。
二.房型预测模型。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

file = '../input/ctrip/competition_train.txt'

一.先对测试数据集的基本字段做一个简单的分析。
首先将训练集中的13个基本字段orderid、uid、orderdate、hotelid、basicroomid、roomid、orderlabel、star、rank 、returnvalue、price_deduct、basic_minarea、basic_maxarea信息提取出来进行分析。

basic_colnames = ['orderid','uid','orderdate','hotelid','basicroomid','roomid', 'orderlabel', 'star', 
                  'rank', 'returnvalue', 'price_deduct', 'basic_minarea', 'basic_maxarea']
basic_data = pd.read_csv(file, sep='\t', header=0, usecols=basic_colnames)

#basic_data.info()
#basic_data['orderid'].value_counts()
#basic_data['orderlabel'].value_counts()
#basic_data['uid'].value_counts()
#basic_data['hotelid'].value_counts()
#basic_data['basicroomid'].value_counts()
#basic_data['roomid'].value_counts()
#basic_data['star'].value_counts()
#basic_data['orderdate'].value_counts()

训练集一共有7724875条记录;
一共有211580个订单号(oederid),其中ORDER_12473322的纪录最多有686条;
成功订购211580份订单(orderlable=1);
一共有187162个用户(uid),其中USER_615018的纪录最多有2105条;
一共有51209家酒店(hotelid),其中HOTEL_135032的记录最多有34376条;
一共有277373种物理房型号(basicroomid),其中BASIC_205234的纪录最多有6589条;
一共有2924400种房型号(roomid),其中ROOM_18387063的纪录最多有204条。
酒店的级别只有4种,5、7、9、11,并且纪录数依次上升;
数据集的订单订购时间是从2013年4月14号至4月20号一周的时间。

basic_data.describe()

orderlabel	star	rank	returnvalue	price_deduct	basic_minarea	basic_maxarea
count	7.724875e+06	7.724875e+06	7.724875e+06	7.724875e+06	7.724875e+06	6.855372e+06	6.855372e+06
mean	2.738944e-02	8.843464e+00	1.156930e+01	2.443093e+02	2.893330e+03	8.692999e+01	8.914625e+01
std	1.632154e-01	2.167494e+00	1.124343e+01	2.519877e+02	6.910387e+03	8.960929e+01	9.037341e+01
min	0.000000e+00	5.000000e+00	3.000000e+00	2.000000e+02	2.300000e+02	-1.000000e+00	-1.000000e+00
25%	0.000000e+00	7.000000e+00	5.000000e+00	2.000000e+02	1.034000e+03	5.300000e+01	5.700000e+01
50%	0.000000e+00	9.000000e+00	9.000000e+00	2.000000e+02	1.673000e+03	7.100000e+01	7.300000e+01
75%	0.000000e+00	1.100000e+01	1.500000e+01	2.000000e+02	3.029000e+03	9.900000e+01	1.010000e+02
max	1.000000e+00	1.100000e+01	2.450000e+02	6.088100e+04	6.001970e+05	1.416100e+04	1.416100e+04

发现异常数据,basic_minarea和basic_maxarea这两列数据缺失,且其最小值小于0,这是不合理的。

#basic_data.isnull().any()
#basic_data[train_data['basic_minarea']<=0]
#basic_data[train_data['basic_maxarea']<=0]
basic_minarea小于0的时候,basic_maxarea也小于0,且同时等于-1,说明应该是数据缺失的情况下用-1进行填补。将所有有无效值的纪录丢弃,并且将成功预定的数据提取出来(orderlabel=1)

basic_data1 = basic_data.loc[(basic_data['basic_minarea']>0) & (basic_data['basic_maxarea']>0)]
basic_data1 = basic_data1[basic_data1['orderlabel']==1]

#basic_data1.info()
#basic_data1.describe()
#basic_data1['uid'].value_counts()
#basic_data1['hotelid'].value_counts()
#basic_data1['basicroomid'].value_counts()
#basic_data1['roomid'].value_counts()
#basic_data1['star'].value_counts()

经过数据清洗可以看出成功订购的有效纪录为179667条;
一共有160471个用户(uid),其中USER_609501成功订购次数最多为48次;
一共有41837家酒店(hotelid),其中HOTEL_132727被成功订购次数最多为225次;
一共有76861种物理房型号(basicroomid),其中BASIC_463407房型被成功订购次数最多为209次;
一共有134512种房型号(roomid),其中ROOM_22089993房间型号被成功订购次数最多为76次。
查看订购房型的价格分布

plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
plt.hist(basic_data1['price_deduct'])
plt.xlabel('价格')
plt.ylabel('订单数')
plt.title('房型订单价格分布')

#观察到价格主要在5000以内,看5000以内的分布
price_data = basic_data1[basic_data1['price_deduct']<=5000]['price_deduct']
plt.subplot(1,2,2)
plt.hist(price_data,bins=100)
plt.xlabel('价格')
plt.ylabel('订单数')
plt.title('房型订单价格分布')
plt.show()

房型价格主要在5000以内,其中1000附近的房型最多。
In [7]:
#预定酒店的星级分布
plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
basic_data1['star'].value_counts().plot(kind='pie',autopct='%.1f%%')
plt.title('酒店星级分布')

#分析酒店星级与价格之间的关系
star = [5,7,9,11]
price = []
for i in star:
    tmp = basic_data1.loc[basic_data1['star']==i]['price_deduct'].mean()
    price.append(tmp)
plt.subplot(1,2,2)
plt.bar(star, price)
for (a, b) in zip(star, price):
    plt.text(a, b+0.001, '%.1f'% b, ha='center', va='bottom', fontsize=10)
plt.xlabel('酒店星级')
plt.xticks(star)
plt.ylabel('平均订购价格')
plt.title('酒店星级 VS 订购价格')
plt.show()

酒店星级越高,订购价格越贵。
二.房型预测模型
以上都是一些基本字段的描述性分析,并未考虑到用户的历史行为数据。
在训练过程中,需要考虑的特征量是很多的。
1.数据选取(训练集与测试集)
因为比赛给出的测试数据集中没有orderlabel这一列,即无法知道每个orderid对应的orderlabel=1的记录,无法判断算法的准确率,所以我们的训练集和测试集都从‘competition_train.txt’中选取。从前面的描述性分析中得知,测试数据中有接近800万条记录,没法处理这么大的数据量,所以我们仅选取50万条记录,并按照8:2的原则划分为训练集和测试集。

import pandas as pd
import numpy as np
import re
import lightgbm as lgb

#数据
data = pd.read_csv(file, sep = '\t',nrows = 500000)
#data.info()

2.数据清洗
由于数据量庞大,特征字段就达到157个。所以我们数据清洗的主要任务:
1.将数据集中某一个属性全部缺失的去掉;
2.将缺失值全部补充为-1;
3.纠正数据集中的错误属性。
4.为了方便算法训练,去除数据中的字母等文字,并全部转化为int或float类型;

#data.describe()
#data.isnull.all()
#将数据集中某一个属性全部缺失的去掉
data = data.dropna(how='all',axis=1)
#data.info()
#data.isnull().any()
#将缺失值全部补充为-1
data = data.fillna(-1)
#data.head()
去掉的属性一共有25个。
'orderbehavior_3_ratio_1month','orderbehavior_4_ratio_1month','orderbehavior_5_ratio_1month','orderid_lastord',
'orderdate_lastord','hotelid_lastord','roomid_lastord','basicroomid_lastord','rank_lastord','return_lastord',
'price_last_lastord','roomservice_2_lastord','roomservice_3_lastord','roomservice_4_lastord','roomservice_5_lastord',
'roomservice_6_lastord','roomservice_8_lastord','roomtag_2_lastord','roomtag_3_lastord','roomtag_4_lastord',
'roomtag_5_lastord','roomtag_6_lastord','star_lastord','hotel_minprice_lastord','basic_minprice_lastord'

#数据中有一处有错,因为roomservice_5的取值只有0,1,2,所以user_roomservice_5_345ratio应该是user_roomservice_8_345ratio
data["user_roomservice_8_345ratio"]=data["user_roomservice_5_345ratio"]
del data["user_roomservice_5_345ratio"]

#数据处理--去除字母等文字
def remove(x):
    try:
        return re.search("\d+", x).group()
    except:
        return 0

dt1=pd.to_datetime(data["orderdate"])
data["orderdate"]=dt1.dt.dayofyear
for c in ["orderid","uid","hotelid","basicroomid","roomid"]:
    data[c] = data[c].apply(remove)

data = data.apply(pd.to_numeric, errors='ignore')

3.构造特征
经过数据清洗之后,原有的157个特征量剩下了132个。我们可以构建一些新的特征量来提高预测模型的准确率。

# 每个basicid价格的中位数
def df_median(df):
    add = pd.DataFrame(df.groupby(["orderid", "basicroomid"]).price_deduct.median()).reset_index()
    add.columns = ["orderid", "basicroomid", "basicroomid_price_deduct_median"]
    df = df.merge(add, on=["orderid", "basicroomid"], how="left")
    return df

# 每个basicid价格的最小值
def df_min(df):
    add = pd.DataFrame(df.groupby(["orderid", "basicroomid"]).price_deduct.min()).reset_index()
    add.columns = ["orderid", "basicroomid", "basicroomid_price_deduct_min"]
    df = df.merge(add, on=["orderid", "basicroomid"], how="left")
    return df

# 每个orderid价格的最小值
def df_min_orderid(df):
    add = pd.DataFrame(df.groupby(["orderid"]).price_deduct.min()).reset_index()
    add.columns = ["orderid", "orderid_price_deduct_min"]
    df = df.merge(add, on=["orderid"], how="left")
    return df

#排序特征
def df_rank_mean(df):
    add = pd.DataFrame(df.groupby(["basicroomid"]).orderid_price_deduct_min_rank.mean()).reset_index()
    add.columns = ["basicroomid","orderid_price_deduct_min_rank_mean"]
    df = df.merge(add, on=["basicroomid"], how="left")
    return df

def df_roomrank_mean(df):
    add = pd.DataFrame(df.groupby(["roomid"]).basicroomid_price_rank.mean()).reset_index()
    add.columns = ["roomid","basicroomid_price_rank_mean"]
    df = df.merge(add, on=["roomid"], how="left")
    return df
    
In [7]:
def merge_count(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].count()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_nunique(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].nunique()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_median(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].median()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_mean(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].mean()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_sum(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].sum()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_max(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].max()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_min(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].min()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

def merge_std(df,columns,value,cname):
    add = pd.DataFrame(df.groupby(columns)[value].std()).reset_index()
    add.columns=columns+[cname]
    df=df.merge(add,on=columns,how="left")
    return df

all = data.copy()

all["user_roomservice_2_0ratio"]=1-all["user_roomservice_2_1ratio"]
all["user_roomservice_3_0ratio"]=1-all["user_roomservice_3_123ratio"]
all["user_roomservice_5_0ratio"]=1-all["user_roomservice_5_1ratio"]
all["user_roomservice_7_1ratio"]=1-all["user_roomservice_7_0ratio"]
all["user_roomservice_8_2ratio"]=1-all["user_roomservice_8_345ratio"]-all["user_roomservice_8_1ratio"]

all["user_roomservice_4_1ratio_3month"] = 1 - all["user_roomservice_4_0ratio_3month"] - all["user_roomservice_4_2ratio_3month"] - all["user_roomservice_4_3ratio_3month"] - all["user_roomservice_4_4ratio_3month"] - all["user_roomservice_4_5ratio_3month"]
all["user_roomservice_4_1ratio_1month"] = 1 - all["user_roomservice_4_0ratio_1month"] - all["user_roomservice_4_2ratio_1month"] - all["user_roomservice_4_3ratio_1month"] - all["user_roomservice_4_4ratio_1month"] - all["user_roomservice_4_5ratio_1month"]
all["user_roomservice_4_1ratio_1week"] = 1 - all["user_roomservice_4_0ratio_1week"] - all["user_roomservice_4_2ratio_1week"] - all["user_roomservice_4_3ratio_1week"] - all["user_roomservice_4_4ratio_1week"] - all["user_roomservice_4_5ratio_1week"]

all["user_roomservice_2_max"] = np.argmax(all[["user_roomservice_2_%sratio" % i for i in range(2)]].values, axis=1)
all["user_roomservice_3_max"] = np.argmax(all[["user_roomservice_3_%sratio" % i for i in [0,123]]].values, axis=1)
all["user_roomservice_4_max"] = np.argmax(all[["user_roomservice_4_%sratio" % i for i in range(6)]].values, axis=1)
all["user_roomservice_5_max"] = np.argmax(all[["user_roomservice_5_%sratio" % i for i in range(2)]].values, axis=1)
all["user_roomservice_6_max"] = np.argmax(all[["user_roomservice_6_%sratio" % i for i in range(3)]].values, axis=1)
all["user_roomservice_7_max"] = np.argmax(all[["user_roomservice_7_%sratio" % i for i in range(2)]].values, axis=1)
all["user_roomservice_8_max"] = np.argmax(all[["user_roomservice_8_%sratio" % i for i in [1,2,345]]].values, axis=1)

all["user_roomservice_4_max_1week"]=np.argmax(all[["user_roomservice_4_%sratio_1month"%i for i in range(6)]].values,axis=1)
all["user_roomservice_4_max_1month"]=np.argmax(all[["user_roomservice_4_%sratio_1month"%i for i in range(6)]].values,axis=1)
all["user_roomservice_4_max_3month"]=np.argmax(all[["user_roomservice_4_%sratio_3month"%i for i in range(6)]].values,axis=1)

all["roomservice_8"]=all["roomservice_8"].apply(lambda x:2 if x>2 else x-1)
all["roomservice_3"]=all["roomservice_3"].apply(lambda x:1 if x>0 else 0)

for i in range(2,9):
    all["service_equal_%s"%i] = list(map(lambda x, y: 1 if x == y else 0, all["roomservice_%s"%i], all["user_roomservice_%s_max"%i]))
del all["user_roomservice_2_0ratio"]
del all["user_roomservice_3_0ratio"]
del all["user_roomservice_5_0ratio"]
del all["user_roomservice_7_1ratio"]
In [9]:
#添加转化率特征
#提取basicroomid的转化率
feature_df=all[["orderid","basicroomid","orderlabel"]].copy()
feature_df.sort_values("orderlabel")
feature_df=feature_df.drop_duplicates(["orderid","basicroomid"],keep="last")
basicroom_mean=pd.DataFrame(feature_df.groupby("basicroomid").orderlabel.mean()).reset_index()
basicroom_mean.columns=["basicroomid","basicroomid_mean"]

basicroom_sum=pd.DataFrame(feature_df.groupby("basicroomid").orderlabel.sum()).reset_index()
basicroom_sum.columns=["basicroomid","basicroomid_sum"]
In [10]:
all = all.merge(basicroom_mean, on="basicroomid", how="left").fillna(0)
all = all.merge(basicroom_sum, on="basicroomid", how="left").fillna(0)

all=df_median(all)
all=df_min(all)
all=df_min_orderid(all)

all["basicroomid_price_rank"] = all['price_deduct'].groupby([all['orderid'], all['basicroomid']]).rank()
all["orderid_price_deduct_min_rank"] = all['orderid_price_deduct_min'].groupby(all['orderid']).rank()

all = df_rank_mean(all)
all = df_roomrank_mean(all)
In [11]:
all=merge_mean(all,["basicroomid"],"basic_week_ordernum_ratio","basic_week_ordernum_ratio_mean")
all=merge_mean(all,["basicroomid"],"basic_recent3_ordernum_ratio","basic_recent3_ordernum_ratio_mean")
all=merge_mean(all,["basicroomid"],"basic_comment_ratio","basic_comment_ratio_mean")
all=merge_mean(all,["basicroomid"],"basic_30days_ordnumratio","basic_30days_ordnumratio_mean")
all=merge_mean(all,["basicroomid"],"basic_30days_realratio","basic_30days_realratio_mean")
all=merge_mean(all,["roomid"],"room_30days_ordnumratio","room_30days_ordnumratio_mean")
all=merge_mean(all,["roomid"],"room_30days_realratio","room_30days_realratio_mean")


all["city_num"]=all["user_ordernum"]/all["user_citynum"]
all["area_price"]=all["user_avgprice"]/all["user_avgroomarea"]
all["price_max_min_rt"]=all["user_maxprice"]/all["user_minprice"]
all["basicroomid_price_deduct_min_minprice_rt"]=all["basicroomid_price_deduct_min"]/all["user_minprice"]

all["price_dif"]=all["basicroomid_price_deduct_min"]-all["price_deduct"]
all["price_dif_rt"]=all["basicroomid_price_deduct_min"]/all["price_deduct"]
all["price_dif_hotel"]=all["orderid_price_deduct_min"]-all["price_deduct"]
all["price_dif_hotel_rt"]=all["orderid_price_deduct_min"]/all["price_deduct"]
all["order_basic_minprice_dif"]=all["basicroomid_price_deduct_min"]-all["orderid_price_deduct_min"]
all["order_basic_minprice_rt"]=all["basicroomid_price_deduct_min"]/all["orderid_price_deduct_min"]
In [12]:
all["price_tail1"]=all["price_deduct"]%10
all["price_tail1"]=list(map(lambda x:1 if x==4 or x==7 else 0,all["price_tail1"]))
all["price_tail2"]=all["price_deduct"]%100

all["price_ori"] = list(map(lambda x, y:x+y, all["price_deduct"], all["returnvalue"]))
In [13]:
for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]:
    all["ordertype_%s_num"%i] = list(map(lambda x, y: x*y, all["ordertype_%s_ratio"%i], all["user_ordernum"]))
    del all["ordertype_%s_ratio" % i]

# 所有的
for c in ["orderbehavior_1_ratio", "orderbehavior_2_ratio", "orderbehavior_6_ratio", "orderbehavior_7_ratio"]:
    all[c] = list(map(lambda x, y: x*y, all[c], all["user_ordernum"]))

# 一周的
for c in ["orderbehavior_3_ratio_1week", "orderbehavior_4_ratio_1week", "orderbehavior_5_ratio_1week"]:
    all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_1week"]))

# 一个月的
#for c in ["orderbehavior_3_ratio_1month", "orderbehavior_4_ratio_1month", "orderbehavior_5_ratio_1month"]:
#    all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_1month"]))

# 三个月的
for c in ["orderbehavior_3_ratio_3month", "orderbehavior_4_ratio_3month", "orderbehavior_5_ratio_3month"]:
    all[c] = list(map(lambda x, y: x * y, all[c], all["user_ordnum_3month"]))


all["price_star"] = all["price_deduct"]/(all["star"]-1)

all["star_dif"] = all["user_avgstar"]-all["star"]

all["price_ave_dif_rt"] = all["price_deduct"]/all["user_avgdealprice"]
all["price_ave_star_dif"] = all["price_deduct"]/all["user_avgprice_star"]
all["price_h_w_rt"] = all["user_avgdealpriceholiday"] / all["user_avgdealpriceworkday"]

all["price_ave_dif"] = all["price_deduct"] - all["user_avgdealprice"]

all["user_roomservice_4_32_rt"] = all["user_roomservice_4_3ratio"] / all["user_roomservice_4_2ratio"]
all["user_roomservice_4_43_rt"] = all["user_roomservice_4_4ratio"] / all["user_roomservice_4_3ratio"]

4.建立模型
本项目采用的是lgb算法。用40万的数据进行训练。

train = all.iloc[:400000]
test = all.iloc[400000:]

#算法测试
train_data = train.copy()
train_y=train_data["orderlabel"].values
del train_data["orderlabel"]

#lgb算法
train_data1 = lgb.Dataset(train_data, label=train_y)
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'min_child_weight': 1.5,
    'num_leaves': 2 ** 5,
    'lambda_l2': 10,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'colsample_bylevel': 0.7,
    'learning_rate': 0.05,
    'tree_method': 'exact',
    'seed': 2019,
    'nthread': 12}
num_round = 500
model = lgb.train(params, train_data1, num_round)

5.预测结果
用10万数据进行测试,预测orderlabel为 0或1的概率,最后依据orderid分组,选择orderid组别中预测为1概率最大的roomid作为最终预测结果。

test_data = test.copy()
del test_data['orderlabel']
test_result = model.predict(test_data.values)
test_result = pd.DataFrame(test_result)

test_result.columns = ["prob"]
test_result["orderid"] = test_data["orderid"].values
test_result["pre_roomid"] = test_data["roomid"].values

result = test_result.sort_values(by=['orderid',"prob"],ascending = False)
result = result.drop_duplicates("orderid", keep="first")

#预测房型结果
test_predict = result.pivot_table(index='orderid', values='pre_roomid').copy()

#真实房型结果
test_data_tmp = test[test['orderlabel'] == 1][['orderid','roomid']]
test_truth = test_data_tmp.pivot_table(index='orderid', values='roomid').copy()

#测试集中orderid的总数为i,房型预测正确的orderid总数为j。
i = 0;j = 0
orderid = test_data_tmp['orderid']
for k in orderid:
    i = i+1
    if test_predict['pre_roomid'][k] == test_truth['roomid'][k]:
        j = j+1
print(i, j)
2924 1452

可以看到在测试集中一共有2924个orderid,其中订购房型预测正确的有1452个,准确率为49.66%。

posted @ 2019-09-12 10:35  Rener  阅读(537)  评论(0编辑  收藏  举报