大数据分析——二手汽车价格

一、选题背景
当今是大数据的时代，随着数据分析工具和技术的不断改进，掌握大数据分析技能可以为个人的职业发展带来很大的好处
我分析的是二手汽车的价格预测，汽车作为现在普遍的代步工具，在我们的出行无论是远门或平时出门都基本上会选择的
一个交通工具。

根据现在公安交管局的最新统计数，全国家用车的普及率为37.4%。而还有很多公共的交通工具，滴滴出行、出租车，

所以要置换下来的汽车的数量也是一个庞大的基数。所以通过大数据的分析对于二手汽车价格的预测也是一个需要发展的地方。

二、大数据分析设计方案

1.本数据集的数据内容

Car_Name 车辆型号
Selling_Price(lacs) 车主想要出售汽车的价格
Present_Price(lacs) 这是该车当前的出厂价
Kms_Driven 汽车行驶的公里数
Fuel_Type 汽车燃料类型（汽油/柴油/CNG/LPG/电动）
Seller_Type 个人还是经销商
Transmission 汽车齿轮传动（自动/手动）
Past_Owners 汽车的前车主数量
Year 购车年份

2.数据分析的课程设计方案概述

（1）先对数据集的数据进行预处理和清洗，并计算数据集中各种数据与二手汽车价格的相关性。

（2）对数据集进行python可视化处理，判断影响二手汽车价格的因素。

（3）通过模型训练探求影响二手汽车价格的因素。

三、数据分析步骤

数据来源
数据源:https://www.kaggle.com/datasets?search=car-price-prediction
导入数据集

 1 #import libraries
 2 import pandas as pd
 3 import numpy as np
 4 import matplotlib.pyplot as plt
 5 import seaborn as sns
 6 import os
 7 import warnings
 8 %matplotlib inline
 9 pd.set_option("display.max_rows", None,"display.max_columns", None)
10 warnings.simplefilter(action='ignore')
11 plt.style.use('seaborn')
12 #load dataset
13 df_main = pd.read_csv('./car data.csv')
14 df_main.head()

数据共有301条共有9条特征

缺失值分析

df_main.info()

数据统计

#numerical stats df_main.describe()

汽车数据集缺失值查找

1 df_main.isna().sum()

数据预处理

fig, axes  = plt.subplots(nrows = 3 , ncols = 2 )
fig.set_size_inches( 25 ,  13 )
3
sns.barplot(x = df_main[ 'Year' ],y = df_main[ 'Selling_Price' ] ,ax = axes[ 0 ][ 0 ])
sns.barplot(x = df_main[ 'Fuel_Type' ], y = df_main[ 'Selling_Price' ], ax = axes[ 0 ][ 1 ])
sns.barplot(x = df_main[ 'Seller_Type' ], y = df_main[ 'Selling_Price' ], ax = axes[ 1 ][ 0 ])
sns.barplot(x = df_main[ 'Transmission' ], y = df_main[ 'Selling_Price' ], ax = axes[ 1 ][ 1 ])
sns.barplot(x = df_main[ 'Owner' ], y = df_main[ 'Selling_Price' ], ax = axes[ 2 ][ 0 ])
sns.scatterplot(x = df_main[ 'Kms_Driven' ], y = df_main[ 'Selling_Price' ], ax = axes[ 2 ][ 1 ])

从这些图中我们可以得出结论：

随着汽车老化，价格开始下降，年份图看起来不成比例，因为并非所有数据都同样可用，否则会看到线性图。
与其他汽车相比，柴油车的价格最高，这很奇怪，因为柴油车有特定的使用寿命。
经销商提供的自动驾驶汽车价格很高，这是有道理的，因为很少有人喜欢在驾驶时换挡。我们可以清楚地看到，
并不是所有的数据对于所有的属性都是均匀分布的。
例如-2003年至2011年的数据很少。2018年的数据也很少与汽油相比，CNG的数据也非常少。

自动挡vs手动挡的数据

fig, (ax1, ax2)  = plt.subplots(nrows = 2 )
fig.set_size_inches( 22 ,  15 )
sns.barplot(x = df_main[ 'Year' ], y = df_main[ 'Selling_Price' ],hue = df_main[ 'Transmission' ], ax = ax1)
sns.scatterplot(x = df_main[ 'Present_Price' ], y = df_main[ 'Selling_Price' ], ax = ax2)

从2012年起，自动驾驶汽车一直占据汽车行业的主导地位；汽车的售价与汽车的当前价格成正比。

<br>1 #把车辆年份转换为车辆使用的年数 数据集是2020年的所以用2020-去age 随后删除year这一列
2 df_main[ 'Age' ]  = 2020 - df_main[ 'Year' ]
3 df_main.drop( 'Year' ,axis = 1 ,inplace  = True )
4 df_main.rename(columns  = { 'Selling_Price' : 'Selling_Price(lacs)' , 'Present_Price' :
'Present_Price(lacs)' , 'Owner' : 'Past_Owners' },inplace  = True )

探索性数据分析

单变量分析

df_main.columns

cat_cols  = [ 'Fuel_Type' , 'Seller_Type' , 'Transmission' , 'Past_Owners' ]
i = 0
while i <  4 :
    fig  = plt.figure(figsize = [ 10 , 4 ])
    #ax1 = fig.add_subplot(121)
    #ax2 = fig.add_subplot(122)
    #ax1.title.set_text(cat_cols[i])
    plt.subplot( 1 , 2 , 1 )
    sns.countplot(x = cat_cols[i], data = df_main)
    i  + = 1
    #ax2.title.set_text(cat_cols[i])
    plt.subplot( 1 , 2 , 2 )
    sns.countplot(x = cat_cols[i], data = df_main)
    i  + = 1
    plt.show()

从中可以看出不同数据的占比是不同的，所以在模型的预测上存在着不平衡数据集的问题，比如前一任雇主绝大多数是0，绝大多数是手动挡，绝大多数是商人售卖，绝大多数是汽油

对于数值型变量绘制箱型图进行异常值检测

num_cols  = [ 'Selling_Price(lacs)' , 'Present_Price(lacs)' , 'Kms_Driven' , 'Age' ]
i = 0
while i <  4 :
    fig  = plt.figure(figsize = [ 13 , 3 ])
    #ax1 = fig.add_subplot(121)
    #ax2 = fig.add_subplot(122)
    
    #ax1.title.set_text(num_cols[i])
    plt.subplot( 1 , 2 , 1 )
    sns.boxplot(x = num_cols[i], data = df_main)
    i  + = 1
    
    #ax2.title.set_text(num_cols[i])
    plt.subplot( 1 , 2 , 2 )
    sns.boxplot(x = num_cols[i], data = df_main)
    i  + = 1
    
    plt.show()

异常值在数据集中是存在的，可以先保留不进行替换，具体也可以考验模型能否很好地你和这两类数据。

将车辆销售相关变量进行不同分位数展示

def num_summary(dataframe, numerical_col):
quantiles  = [ 0.05 ,  0.10 ,  0.20 ,  0.30 ,  0.40 ,  0.50 ,  0.60 ,  0.70 ,  0.80 ,  0.90 ]
print (dataframe[numerical_col].describe(quantiles).T)
for num_col  in df_main[[ 'Present_Price(lacs)' , 'Selling_Price(lacs)' , 'Kms_Driven' ]].columns:
num_summary(df_main, num_col)

多变量分析

sns.heatmap(df_main.corr(), annot = True , cmap = "RdBu" )

plt.show()

可以发现汽车当前的售价与卖家想要售出的价格有比较大的关联度，这当然很合理，因为这两者都反映着汽车的价值，相关性较高，其他变量的相关性不那么显著

结果显示

df_main.corr()[ 'Selling_Price(lacs)' ]

1	`df_main.pivot_table(values` `=` `'Selling_Price(lacs)'` `, index` `=` `'Seller_Type'` `, columns` `=` `'Fuel_Type'` `)`

1	`df_main.pivot_table(values` `=` `'Selling_Price(lacs)'` `, index` `=` `'Seller_Type'` `, columns` `=` `'Transmission'` `)`

数据集划分

y  = df_main[ 'Selling_Price(lacs)' ]
X  = df_main.drop( 'Selling_Price(lacs)' ,axis = 1 )
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size = 0.2 , random_state = 1 )
print ( "x train: " ,X_train.shape)
print ( "x test: " ,X_test.shape)
print ( "y train: " ,y_train.shape)
print ( "y test: " ,y_test.shape)

from sklearn.metrics  import r2_score
from sklearn.model_selection  import cross_val_score
CV  = []
R2_train  = []
R2_test  = []
 
def car_pred_model(model,model_name):
    # 模型训练
    model.fit(X_train,y_train)
            
    # 训练集R2分数
    y_pred_train  = model.predict(X_train)
    R2_train_model  = r2_score(y_train,y_pred_train)
    R2_train.append( round (R2_train_model, 2 ))
    
    # 测试集R2分数
    y_pred_test  = model.predict(X_test)
    R2_test_model  = r2_score(y_test,y_pred_test)
    R2_test.append( round (R2_test_model, 2 ))
    
    # R2交叉验证分数
    cross_val  = cross_val_score(model ,X_train ,y_train ,cv = 5 )
    cv_mean  = cross_val.mean()
    CV.append( round (cv_mean, 2 ))
    
    # 结果打印
    print ( "Train R2-score :" , round (R2_train_model, 2 ))
    print ( "Test R2-score :" , round (R2_test_model, 2 ))
    print ( "Train CV scores :" ,cross_val)
    print ( "Train CV mean :" , round (cv_mean, 2 ))
    
    # Plotting Graphs 
    # Residual Plot of train data
    fig, ax  = plt.subplots( 1 , 2 ,figsize  = ( 10 , 4 ))
    ax[ 0 ].set_title( 'Residual Plot of Train samples' )
    sns.distplot((y_train - y_pred_train),hist  = False ,ax  = ax[ 0 ])
    ax[ 0 ].set_xlabel( 'y_train - y_pred_train' )
    
    # Y_test vs Y_train scatter plot
    ax[ 1 ].set_title( 'y_test vs y_pred_test' )
    ax[ 1 ].scatter(x  = y_test, y  = y_pred_test)
    ax[ 1 ].set_xlabel( 'y_test' )
    ax[ 1 ].set_ylabel( 'y_pred_test' )
    
    plt.show()

标准线性回归或普通最小二乘标准线性回归或普通最小二乘

from sklearn.linear_model  import LinearRegression
 
lr  = LinearRegression()
car_pred_model(lr, "Linear_regressor.pkl" )

Lasso回归模型

from sklearn.linear_model  import Lasso
from sklearn.model_selection  import RandomizedSearchCV
 
ls  = Lasso()
alpha  = np.logspace( - 3 , 3 ,num = 14 )  # range for alpha
 
ls_rs  = RandomizedSearchCV(estimator  = ls, param_distributions  = dict (alpha = alpha))
car_pred_model(ls_rs, "lasso.pkl" )

准确率还有可以提升的空间，不过也还不错，在±5之间波动

随机森林

1 from sklearn.ensemble  import RandomForestRegressor
 2 from sklearn.model_selection  import RandomizedSearchCV
 3
 4 rf  = RandomForestRegressor()
 5
 6 # 基学习器个数
 7 n_estimators = list ( range ( 500 , 1000 , 100 ))
 8 # 最大深度
 9 max_depth = list ( range ( 4 , 9 , 4 ))
10 # 最小分类样本数
11 min_samples_split = list ( range ( 4 , 9 , 2 ))
12 # 叶子节点最少样本数
13 min_samples_leaf = [ 1 , 2 , 5 , 7 ]
14 # 每次拆分时需要考虑的特征数量
15 max_features = [ 'auto' , 'sqrt' ]
16
17 # 网格搜素字典
18 param_grid  = { "n_estimators" :n_estimators,
19 "max_depth" :max_depth,
20 "min_samples_split" :min_samples_split,
21 "min_samples_leaf" :min_samples_leaf,
22 "max_features" :max_features}
23
24 rf_rs  = RandomizedSearchCV(estimator  = rf, param_distributions  = param_grid)

car_pred_model(rf_rs,'random_forest.pkl')

准确度相当高，基本上都分布在0的周围相比较而言说明随机森林的拟合能力很强

梯度提升树

from sklearn.ensemble  import GradientBoostingRegressor
from sklearn.model_selection  import RandomizedSearchCV
 
gb  = GradientBoostingRegressor()
 
# 学习率
learning_rate  = [ 0.001 ,  0.01 ,  0.1 ,  0.2 ]
# 基学习器个数
n_estimators = list ( range ( 500 , 1000 , 100 ))
# 最大深度
max_depth = list ( range ( 4 , 9 , 4 ))
# 最小分类样本数
min_samples_split = list ( range ( 4 , 9 , 2 ))
# 叶子节点最少样本数
min_samples_leaf = [ 1 , 2 , 5 , 7 ]
# 每次拆分时需要考虑的特征数量
max_features = [ 'auto' , 'sqrt' ]
 
# 网格搜索字典
param_grid  = { "learning_rate" :learning_rate,
              "n_estimators" :n_estimators,
              "max_depth" :max_depth,
              "min_samples_split" :min_samples_split,
              "min_samples_leaf" :min_samples_leaf,
              "max_features" :max_features}
 
gb_rs  = RandomizedSearchCV(estimator  = gb, param_distributions  = param_grid)
car_pred_model(gb_rs, "gradient_boosting.pkl" )

结果显示:这个绘图的比例尺与之前都不同，但是不难看出，GBDT的精度超过了前面所有的精度，

每个误差在±0.2之间分布

# Technique = ["LinearRegression","Ridge","Lasso","RandomForestRegressor","GradientBoostingRegressor"]
results = pd.DataFrame({ 'Model' : Technique, 'R Squared(Train)' : R2_train, 'R Squared(Test)' : R2_test, 'CV score mean(Train)' : CV})
display(results)
print ( len (Technique), len (R2_train), len (R2_test), len (CV))

四、总结

我们先后进行了数据预处理、数据可视化和模型训练来探求二手汽车价格的影响因素。随着二手汽车年份越久，前任的雇主越多，汽车的价格就会相对比较低。也可以看出绝大多数是手动挡，但是现在自动挡是越来越普及。通过经销商处理的手动挡汽油类的汽车会价格较高。与当前汽车售价才是最紧密联系的。

对于本次实验，我学到了如何对数据进行预处理和可视化，学会了如何根据模型与数据的匹配程度进行调参，以及更加熟悉了整个模型训练和数据分析过程。我还学会了如何根据分析结果得出有益的结论并提出建议。在未来的工作中，我觉得对数据进行更深入的分析，对数据进行实时更新，以便更好地反映市场变化并进行及时调整。

这一次动手进行实验和体会，让我对大数据分析有了更多的体会，但还需要在之后进行更多的学习。

五、完整代码：

1 #import libraries
  2 import pandas as pd
  3 import numpy as np
  4 import matplotlib.pyplot as plt
  5 import seaborn as sns
  6 import os
  7 import warnings
  8 
  9 %matplotlib inline
 10 pd.set_option("display.max_rows", None,"display.max_columns", None)
 11 warnings.simplefilter(action='ignore')
 12 plt.style.use('seaborn')
 13 #load dataset
 14 df_main = pd.read_csv('./car data.csv')
 15 
 16 df_main.head()
 17 
 18 df_main.shape
 19 
 20 df_main.info()
 21 
 22 #numerical stats
 23 df_main.describe()
 24 
 25 df_main.isna().sum()
 26 
 27 fig, axes = plt.subplots(nrows=3, ncols=2)
 28 fig.set_size_inches(25, 13)
 29 
 30 sns.barplot(x=df_main['Year'],y=df_main['Selling_Price'] ,ax=axes[0][0])
 31 sns.barplot(x=df_main['Fuel_Type'], y=df_main['Selling_Price'], ax=axes[0][1])
 32 sns.barplot(x=df_main['Seller_Type'], y=df_main['Selling_Price'], ax=axes[1][0])
 33 sns.barplot(x=df_main['Transmission'], y=df_main['Selling_Price'], ax=axes[1][1])
 34 sns.barplot(x=df_main['Owner'], y=df_main['Selling_Price'], ax=axes[2][0])
 35 sns.scatterplot(x=df_main['Kms_Driven'], y=df_main['Selling_Price'], ax=axes[2][1])
 36 
 37 #自动挡vs手动挡
 38 fig, (ax1, ax2) = plt.subplots(nrows=2)
 39 fig.set_size_inches(22, 15)
 40 sns.barplot(x=df_main['Year'], y=df_main['Selling_Price'],hue=df_main['Transmission'], ax=ax1)
 41 sns.scatterplot(x=df_main['Present_Price'], y=df_main['Selling_Price'], ax=ax2)
 42 
 43 #把车辆年份转换为车辆使用的年数 数据集是2020年的所以用2020-去age 随后删除year这一列
 44 df_main['Age'] = 2020 - df_main['Year']
 45 df_main.drop('Year',axis=1,inplace = True)
 46 
 47 df_main.rename(columns = {'Selling_Price':'Selling_Price(lacs)','Present_Price':'Present_Price(lacs)',

'Owner':'Past_Owners'},inplace = True)
 48 
 49 df_main.columns
 50 
 51 cat_cols = ['Fuel_Type','Seller_Type','Transmission','Past_Owners']
 52 i=0
 53 while i < 4:
 54     fig = plt.figure(figsize=[10,4])
 55     #ax1 = fig.add_subplot(121)
 56     #ax2 = fig.add_subplot(122)
 57     
 58     #ax1.title.set_text(cat_cols[i])
 59     plt.subplot(1,2,1)
 60     sns.countplot(x=cat_cols[i], data=df_main)
 61     i += 1
 62     
 63     #ax2.title.set_text(cat_cols[i])
 64     plt.subplot(1,2,2)
 65     sns.countplot(x=cat_cols[i], data=df_main)
 66     i += 1
 67     
 68     plt.show()
 69 
 70 #对于数值型变量绘制箱型图进行异常值检测
 71 num_cols = ['Selling_Price(lacs)','Present_Price(lacs)','Kms_Driven','Age']
 72 i=0
 73 while i < 4:
 74     fig = plt.figure(figsize=[13,3])
 75     #ax1 = fig.add_subplot(121)
 76     #ax2 = fig.add_subplot(122)
 77     
 78     #ax1.title.set_text(num_cols[i])
 79     plt.subplot(1,2,1)
 80     sns.boxplot(x=num_cols[i], data=df_main)
 81     i += 1
 82     
 83     #ax2.title.set_text(num_cols[i])
 84     plt.subplot(1,2,2)
 85     sns.boxplot(x=num_cols[i], data=df_main)
 86     i += 1
 87     
 88     plt.show()
 89 
 90 #将车辆销售相关变量进行不同分位数展示
 91 def num_summary(dataframe, numerical_col):
 92     quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90]
 93     print(dataframe[numerical_col].describe(quantiles).T)
 94 for num_col in df_main[['Present_Price(lacs)','Selling_Price(lacs)','Kms_Driven']].columns:
 95     num_summary(df_main, num_col)
 96 
 97 sns.heatmap(df_main.corr(), annot=True, cmap="RdBu")
 98 plt.show()
 99 
100 df_main.corr()['Selling_Price(lacs)']
101 
102 df_main.pivot_table(values='Selling_Price(lacs)', index = 'Seller_Type', columns= 'Fuel_Type')
103 
104 df_main.pivot_table(values='Selling_Price(lacs)', index = 'Seller_Type', columns= 'Transmission')
105 
106 df=df_main.copy()#数据备份
107 # Fuel Type
108 df['Fuel_Type']=df['Fuel_Type'].map({
109     'Petrol':0,
110     'CNG':1,
111     'Diesel':2
112 })
113 df['Fuel_Type']=df['Fuel_Type'].astype(int)
114 
115 # Seller Type
116 df['Seller_Type'] = df['Seller_Type'].map({
117     'Dealer': 0,
118     'Individual': 1,
119 })
120 df['Seller_Type'] = df['Seller_Type'].astype(int)
121 
122 # Transmission
123 df['Transmission'] = df['Transmission'].map({
124     'Manual': 0,
125     'Automatic': 1,
126 })
127 df['Seller_Type'] = df['Seller_Type'].astype(int)
128 
129 import scipy.stats as stat
130 import pylab
131 def plot_data(df, feature):
132     # This is for the left graph
133     plt.figure(figsize=(10, 6))
134     plt.subplot(1, 2, 1)
135     print(feature)
136     df[feature].hist()
137     plt.subplot(1, 2, 2)
138     stat.probplot(df[feature], dist='norm', plot=pylab)
139     plt.show()
140 
141 
142 plot_data(df, 'Selling_Price(lacs)')
143 plot_data(df, 'Present_Price(lacs)')
144 plot_data(df, 'Kms_Driven')
145 df_main.drop(labels='Car_Name',axis= 1, inplace = True)
146 
147 df_main.head()
148 
149 df_main = pd.get_dummies(data = df_main,drop_first=True) 
150 
151 df_main.head()
152 
153 y = df_main['Selling_Price(lacs)']
154 X = df_main.drop('Selling_Price(lacs)',axis=1)
155 
156 from sklearn.model_selection import train_test_split
157 
158 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
159 print("x train: ",X_train.shape)
160 print("x test: ",X_test.shape)
161 print("y train: ",y_train.shape)
162 print("y test: ",y_test.shape)
163 
164 from sklearn.metrics import r2_score
165 from sklearn.model_selection import cross_val_score 
166 
167 CV = []
168 R2_train = []
169 R2_test = []
170 
171 def car_pred_model(model,model_name):
172     # 模型训练
173     model.fit(X_train,y_train)
174             
175     # 训练集R2分数
176     y_pred_train = model.predict(X_train)
177     R2_train_model = r2_score(y_train,y_pred_train)
178     R2_train.append(round(R2_train_model,2))
179     
180     # 测试集R2分数
181     y_pred_test = model.predict(X_test)
182     R2_test_model = r2_score(y_test,y_pred_test)
183     R2_test.append(round(R2_test_model,2))
184     
185     # R2交叉验证分数
186     cross_val = cross_val_score(model ,X_train ,y_train ,cv=5)
187     cv_mean = cross_val.mean()
188     CV.append(round(cv_mean,2))
189     
190     # 结果打印
191     print("Train R2-score :",round(R2_train_model,2))
192     print("Test R2-score :",round(R2_test_model,2))
193     print("Train CV scores :",cross_val)
194     print("Train CV mean :",round(cv_mean,2))
195     
196     # Plotting Graphs 
197     # Residual Plot of train data
198     fig, ax = plt.subplots(1,2,figsize = (10,4))
199     ax[0].set_title('Residual Plot of Train samples')
200     sns.distplot((y_train-y_pred_train),hist = False,ax = ax[0])
201     ax[0].set_xlabel('y_train - y_pred_train')
202     
203     # Y_test vs Y_train scatter plot
204     ax[1].set_title('y_test vs y_pred_test')
205     ax[1].scatter(x = y_test, y = y_pred_test)
206     ax[1].set_xlabel('y_test')
207     ax[1].set_ylabel('y_pred_test')
208     
209     plt.show()
210 
211 from sklearn.linear_model import LinearRegression
212 
213 lr = LinearRegression()
214 car_pred_model(lr,"Linear_regressor.pkl")
215 
216 from sklearn.linear_model import Ridge
217 from sklearn.model_selection import RandomizedSearchCV
218 
219 # Creating Ridge model object
220 rg = Ridge()
221 # range of alpha 
222 alpha = np.logspace(-3,3,num=14)
223 
224 # Creating RandomizedSearchCV to find the best estimator of hyperparameter
225 rg_rs = RandomizedSearchCV(estimator = rg, param_distributions = dict(alpha=alpha))
226 
227 car_pred_model(rg_rs,"ridge.pkl")
228 
229 from sklearn.linear_model import Lasso
230 from sklearn.model_selection import RandomizedSearchCV
231 
232 ls = Lasso()
233 alpha = np.logspace(-3,3,num=14) # range for alpha
234 
235 ls_rs = RandomizedSearchCV(estimator = ls, param_distributions = dict(alpha=alpha))
236 
237 car_pred_model(ls_rs,"lasso.pkl")
238 
239 from sklearn.ensemble import RandomForestRegressor
240 from sklearn.model_selection import RandomizedSearchCV
241 
242 rf = RandomForestRegressor()
243 
244 # 基学习器个数
245 n_estimators=list(range(500,1000,100))
246 # 最大深度
247 max_depth=list(range(4,9,4))
248 # 最小分类样本数
249 min_samples_split=list(range(4,9,2))
250 # 叶子节点最少样本数
251 min_samples_leaf=[1,2,5,7]
252 # 每次拆分时需要考虑的特征数量
253 max_features=['auto','sqrt']
254 
255 # 网格搜素字典
256 param_grid = {"n_estimators":n_estimators,
257               "max_depth":max_depth,
258               "min_samples_split":min_samples_split,
259               "min_samples_leaf":min_samples_leaf,
260               "max_features":max_features}
261 
262 rf_rs = RandomizedSearchCV(estimator = rf, param_distributions = param_grid)
263 
264 car_pred_model(rf_rs,'random_forest.pkl')
265 
266 print(rf_rs.best_estimator_)
267 
268 from sklearn.ensemble import GradientBoostingRegressor
269 from sklearn.model_selection import RandomizedSearchCV
270 
271 gb = GradientBoostingRegressor()
272 
273 # 学习率
274 learning_rate = [0.001, 0.01, 0.1, 0.2]
275 # 基学习器个数
276 n_estimators=list(range(500,1000,100))
277 # 最大深度
278 max_depth=list(range(4,9,4))
279 # 最小分类样本数
280 min_samples_split=list(range(4,9,2))
281 # 叶子节点最少样本数
282 min_samples_leaf=[1,2,5,7]
283 # 每次拆分时需要考虑的特征数量
284 max_features=['auto','sqrt']
285 
286 # 网格搜索字典
287 param_grid = {"learning_rate":learning_rate,
288               "n_estimators":n_estimators,
289               "max_depth":max_depth,
290               "min_samples_split":min_samples_split,
291               "min_samples_leaf":min_samples_leaf,
292               "max_features":max_features}
293 
294 gb_rs = RandomizedSearchCV(estimator = gb, param_distributions = param_grid)
295 car_pred_model(gb_rs,"gradient_boosting.pkl")
296 
297 # Technique = ["LinearRegression","Ridge","Lasso","RandomForestRegressor","GradientBoostingRegressor"]
298 results=pd.DataFrame({'Model': Technique,'R Squared(Train)': R2_train,'R Squared(Test)': R2_test,
'CV score mean(Train)': CV})
299 display(results)
300 
301 print(len(Technique),len(R2_train),len(R2_test),len(CV))
302