https://www.kaggle.com/thykhuely/mercari-interactive-eda-topic-modelling

该题为的目的在于通过商家给予的商品的信息,建立一个商品的估价模型。

所给数据的大致分析:

列含义的分析:
name:商品名

item_condition_id:卖方提供的物品的状况(不是很懂这个数据,可能是商品好坏状况)

category_name:每个商品有相应的三个标签

brand_name:品牌名

price:价格(即我们要预测的值)

shipping:1为运费为卖方支付,0为运费为买方支付

item_description:对物品的描述

 

1.首先对price我们的目标进行一个分析

train.price.describe()

可以清楚的看出数据的一些特性

 

然后观察price的分布情况

plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution - Training Set', fontsize=17)

plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log(Price) Distribution - Training Set', fontsize=17)
plt.show()

此处使用log(price+1)的方法使数据更加接近正态分布。也可以尝试Box-Cox.

 

2.对shipping运费进行分析

统计两种运费方式的分布情况

train.shipping.value_counts()/len(train)

 

对于两种运费方式进行price的分析

prc_shipBySeller = train.loc[train.shipping==1, 'price']
prc_shipByBuyer = train.loc[train.shipping==0, 'price']

fig, ax = plt.subplots(figsize=(20,10))
ax.hist(np.log(prc_shipBySeller+1), color='#8CB4E1', alpha=1.0, bins=50,
       label='Price when Seller pays Shipping')
ax.hist(np.log(prc_shipByBuyer+1), color='#007D00', alpha=0.7, bins=50,
       label='Price when Buyer pays Shipping')
ax.set(title='Histogram Comparison', ylabel='% of Dataset in Bin')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.title('Price Distribution by Shipping Type', fontsize=17)
plt.tick_params(labelsize=15)
plt.show()

 

3.对item_categroy(标签)的分析

对所有标签种类的分析

print("There are %d unique values in the category column." % train['category_name'].nunique())

显示总数量前5的标签,以及标签的数量

# TOP 5 RAW CATEGORIES
train['category_name'].value_counts()[:5]

查看空标签数量

1 # missing categories
2 print("There are %d items that do not have a label." % train['category_name'].isnull().sum())

对标签的拆分

# reference: BuryBuryZymon at https://www.kaggle.com/maheshdadhich/i-will-sell-everything-for-free-0-55
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")

train['general_cat'], train['subcat_1'], train['subcat_2'] = \
zip(*train['category_name'].apply(lambda x: split_cat(x)))
train.head()

对每个小标签进行分析

print("There are %d unique first sub-categories." % train['subcat_1'].nunique())

print("There are %d unique second sub-categories." % train['subcat_2'].nunique())

 

对general_cat进行分析

x = train['general_cat'].value_counts().index.values.astype('str')
y = train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]
trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Number of Items by Main Category',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
y = train['subcat_1'].value_counts().values[:15]
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]
trace1 = go.Bar(x=x, y=y, text=pct,
                marker=dict(
                color = y,colorscale='Portland',showscale=True,
                reversescale = False
                ))
layout = dict(title= 'Number of Items by Sub Category (Top 15)',
              yaxis = dict(title='Count'),
              xaxis = dict(title='SubCategory'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

使用箱型图对标签数据进行分析

general_cats = train['general_cat'].unique()
x = [train.loc[train['general_cat']==cat, 'price'] for cat in general_cats]

data = [go.Box(x=np.log(x[i]+1), name=general_cats[i]) for i in range(len(general_cats))]

layout = dict(title="Price Distribution by General Category",
              yaxis = dict(title='Frequency'),
              xaxis = dict(title='Category'))
fig = dict(data=data, layout=layout)
py.iplot(fig)

对brand_name(品牌名)的分析

print("There are %d unique brand names in the training dataset." % train['brand_name'].nunique())

x = train['brand_name'].value_counts().index.values.astype('str')[:10]
y = train['brand_name'].value_counts().values[:10]

# trace1 = go.Bar(x=x, y=y, 
#                 marker=dict(
#                 color = y,colorscale='Portland',showscale=True,
#                 reversescale = False
#                 ))
# layout = dict(title= 'Top 10 Brand by Number of Items',
#               yaxis = dict(title='Brand Name'),
#               xaxis = dict(title='Count'))
# fig=dict(data=[trace1], layout=layout)
# py.iplot(fig)

 

posted on 2018-05-03 09:03  郑哲  阅读(786)  评论(0编辑  收藏  举报