Mercari Price Suggestion in Kaggle
Mercari Price Suggestion
-
最近看到了一个竞赛,竞赛的内容是根据已知的商品的描述,品牌,品类,物品的状态等特征来预测商品的价格
-
最后的评估标准为 平均算术平方根误差Root Mean Squared Logarithmic Error.
\[\epsilon = \sqrt { \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \left( \log \left( p _ { i } + 1 \right) - \log \left( a _ { i } + 1 \right) \right) ^ { 2 } } \] -
最后提交的文件为test_id ,price 包含两列数据,一列为测试数据中id,另一列为预测的价格
-
训练集或者测试集中包括以下特征
- train_id test_id 物品的编号,一个商品对应一个编号
- name 名称
- item_condition_id 物品状态
- category_name 品类
- brand_name 品牌
- price 物品售出的价格,测试集中不包含此列,此列也为我们要预测的值
- shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮,0代表不包邮
- item_description 物品的详细描述,描述中已经除去带有价格标签的值,已用[rm]代替
import pandas as pd
import numpy as np
df = pd.read_csv('input/train.tsv',sep='\t')
data information
df.head()
train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
---|---|---|---|---|---|---|---|---|
0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id 1482535 non-null int64
name 1482535 non-null object
item_condition_id 1482535 non-null int64
category_name 1476208 non-null object
brand_name 849853 non-null object
price 1482535 non-null float64
shipping 1482535 non-null int64
item_description 1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB
price distribution
df.price.describe()
count 1.482535e+06
mean 2.673752e+01
std 3.858607e+01
min 0.000000e+00
25% 1.000000e+01
50% 1.700000e+01
75% 2.900000e+01
max 2.009000e+03
Name: price, dtype: float64
import matplotlib.pyplot as plt
plt.subplot(1, 2, 1) # 要生成一行两列,这是第一个图plt.subplot('行','列','编号')
df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.subplot(1, 2, 2)
np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('log(Price+1) Distribution', fontsize=12)
Text(0.5, 1.0, 'log(Price+1) Distribution')
- 价格特征为左偏态,需要将其转化为正太分布的数据,价格的分布主要集中在10-20左右,而最大的价格在2009,需要将其做对数转化,转化后,其对数分布为较为规则的正态分布
包邮对于价格影响
df['shipping'].value_counts(normalize=True)
0 0.552726
1 0.447274
Name: shipping, dtype: float64
- 对于商家是否包邮,55%的商品不包邮,44.7%的商品包邮,需要看一下包邮是否对于价格影响
shipping_yes = df.loc[df['shipping'] == 1, 'price'] # 商家出运费
shipping_no = df.loc[df['shipping'] == 0, 'price'] # 买家出运费
fig,ax = plt.subplots(figsize=(8,5))
ax.hist(shipping_yes,color='r',alpha=0.5,bins=30,range=[0,100],label='shipping_yes')
ax.hist(shipping_no,color='green',alpha=0.5,bins=30,range=[0,100],label=
'shipping_no')
plt.xlabel('price',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('price_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()
print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))
print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))
不包邮平均的定价30.11 dollars
包邮平均的定价22.57 dollars
fig,ax = plt.subplots(figsize=(8,5))
ax.hist(np.log(shipping_yes+1),color='r',alpha=0.5,bins=50,label='shipping_yes')
ax.hist(np.log(shipping_no+1),color='green',alpha=0.5,bins=50,label=
'shipping_no')
plt.xlabel('log(price+1)',fontsize=12)
plt.ylabel('frequency',fontsize=12)
plt.title('log(price+1)_distribution by shipping method')
plt.tick_params(labelsize=12)
plt.legend()
plt.show()
处理category 数据
"总共的数据有{}条记录".format(df.shape[0])
'总共的数据有1482535条记录'
- 数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据
df['category_name'].value_counts()
# 总共有1287类型
Women/Athletic Apparel/Pants, Tights, Leggings 60177
Women/Tops & Blouses/T-Shirts 46380
Beauty/Makeup/Face 34335
Beauty/Makeup/Lips 29910
Electronics/Video Games & Consoles/Games 26557
Beauty/Makeup/Eyes 25215
Electronics/Cell Phones & Accessories/Cases, Covers & Skins 24676
Women/Underwear/Bras 21274
Women/Tops & Blouses/Tank, Cami 20284
Women/Tops & Blouses/Blouse 20284
Women/Dresses/Above Knee, Mini 20082
Women/Jewelry/Necklaces 19758
Women/Athletic Apparel/Shorts 19528
Beauty/Makeup/Makeup Palettes 19103
Women/Shoes/Boots 18864
Beauty/Fragrance/Women 18628
Beauty/Skin Care/Face 15836
Women/Women's Handbags/Shoulder Bag 15328
Men/Tops/T-shirts 15108
Women/Dresses/Knee-Length 14770
Women/Athletic Apparel/Shirts & Tops 14738
Women/Shoes/Sandals 14662
Women/Jewelry/Bracelets 14497
Men/Shoes/Athletic 14257
Kids/Toys/Dolls & Accessories 13957
Women/Women's Accessories/Wallets 13616
Women/Jeans/Slim, Skinny 13392
Home/Home Décor/Home Décor Accents 13004
Women/Swimwear/Two-Piece 12758
Women/Shoes/Athletic 12662
...
Men/Suits/Four Button 1
Handmade/Bags and Purses/Other 1
Handmade/Dolls and Miniatures/Primitive 1
Handmade/Furniture/Fixture 1
Handmade/Housewares/Bathroom 1
Handmade/Woodworking/Sculptures 1
Men/Suits/One Button 1
Handmade/Geekery/Housewares 1
Kids/Safety/Crib Netting 1
Vintage & Collectibles/Furniture/Entertainment 1
Home/Furniture/Bathroom Furniture 1
Handmade/Glass/Vases 1
Handmade/Geekery/Videogame 1
Handmade/Woodworking/Sports 1
Handmade/Art/Aceo 1
Vintage & Collectibles/Paper Ephemera/Map 1
Handmade/Patterns/Painting 1
Handmade/Housewares/Cleaning 1
Home/Home Décor/Doorstops 1
Handmade/Accessories/Belt 1
Handmade/Patterns/Accessories 1
Vintage & Collectibles/Housewares/Towel 1
Other/Automotive/RV Parts & Accessories 1
Handmade/Paper Goods/Pad 1
Handmade/Accessories/Cozy 1
Kids/Diapering/Washcloths & Towels 1
Handmade/Pets/Blanket 1
Handmade/Needlecraft/Clothing 1
Handmade/Furniture/Shelf 1
Handmade/Quilts/Bed 1
Name: category_name, Length: 1287, dtype: int64
it_conditon_id vs price
- 常见的箱型图 注释
import seaborn as sns
sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))
<matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>
- 不同的物品状态对应的价格千差外别
竞赛杀器lightgbm
- settings
NUM_BRANDS = 4000
NUM_CATEGORIES = 1000
NAME_MIN_DF =10
MAX_FEATURES_ITEM_DESCRIPTION =50000
"There are %d items that do not have a category name" % df['category_name'].isnull().sum()
'There are 6327 items that do not have a category name'
"There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()
'There are 632682 items that do not have a brand name'
"There are %d items that do not have a item_description " % df['item_description'].isnull().sum()
'There are 4 items that do not have a item_description '
def handling_missing_inplace(datasets):
datasets['category_name'].fillna('missing',inplace=True)
datasets['brand_name'].fillna('missing',inplace=True)
datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到
datasets['item_description'].fillna(value='missing', inplace=True)
def cutting(datasets):
pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]
datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'
pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]
datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'
def to_category(datasets):
datasets['category_name'] = datasets['category_name'].astype('category')
datasets['brand_name'] = datasets['brand_name'].astype('category')
datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')
- 查看价格的数量分布,发现竟然有价格为0的,所以需要去掉价格为0的数据
df['price'].value_counts().reset_index().sort_values(by='index').head()
index | price | |
---|---|---|
25 | 3.0 | 18703 |
28 | 4.0 | 16139 |
17 | 5.0 | 31502 |
261 | 5.5 | 33 |
16 | 6.0 | 32260 |
df=df[df['price']!=0].reset_index(drop=True)
df.head()
train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
---|---|---|---|---|---|---|---|---|
0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelBinarizer
import lightgbm as lgb
from scipy.sparse import csr_matrix, hstack # 解决稀疏矩阵
# referenc https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
import gc
import time
from sklearn.linear_model import Ridge
def main():
start_time = time.time()
train = pd.read_table('input/train.tsv', engine='c')
# train=train[train['price']!=0]
test = pd.read_table('input/test_stg2.tsv', engine='c')
print('[{}] Finished to load data'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
nrow_train = train.shape[0]
y = np.log1p(train["price"])
merge: pd.DataFrame = pd.concat([train, test])
submission: pd.DataFrame = test[['test_id']]
del train
del test
gc.collect()
handling_missing_inplace(merge)
print('[{}] Finished to handle missing'.format(time.time() - start_time))
cutting(merge)
print('[{}] Finished to cut'.format(time.time() - start_time))
to_category(merge)
print('[{}] Finished to convert categorical'.format(time.time() - start_time))
cv = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cv.fit_transform(merge['name'])
print('[{}] Finished count vectorize `name`'.format(time.time() - start_time))
cv = CountVectorizer()
X_category = cv.fit_transform(merge['category_name'])
print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time))
tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
ngram_range=(1, 3),
stop_words='english')
X_description = tv.fit_transform(merge['item_description'])
print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time))
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(merge['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))
X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],
sparse=True).values)
print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time))
sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
print('[{}] Finished to create sparse merge'.format(time.time() - start_time))
X = sparse_merge[:nrow_train]
X_test = sparse_merge[nrow_train:]
#train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144)
d_train = lgb.Dataset(X, label=y)
#d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)
#watchlist = [d_train, d_valid]
params = {
'learning_rate': 0.73,
'application': 'regression',
'max_depth': 3,
'num_leaves': 100,
'verbosity': -1,
'metric': 'RMSE',
}
model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100)
preds = 0.56*model.predict(X_test)
model = Ridge(solver="sag", fit_intercept=True, random_state=42)
model.fit(X, y)
print('[{}] Finished to train ridge'.format(time.time() - start_time))
preds += 0.44*model.predict(X=X_test)
print('[{}] Finished to predict ridge'.format(time.time() - start_time))
submission['price'] = np.expm1(preds)
submission.loc[submission['price'] < 0.0, 'price'] = 0.0
submission.to_csv("sample_submission_stg2.csv", index=False)
if __name__ == '__main__':
main()