大数据分析——二手汽车价格
一、选题背景
当今是大数据的时代,随着数据分析工具和技术的不断改进,掌握大数据分析技能可以为个人的职业发展带来很大的好处
我分析的是二手汽车的价格预测,汽车作为现在普遍的代步工具,在我们的出行无论是远门或平时出门都基本上会选择的
一个交通工具。
根据现在公安交管局的最新统计数,全国家用车的普及率为37.4%。而还有很多公共的交通工具,滴滴出行、出租车,
所以要置换下来的汽车的数量也是一个庞大的基数。所以通过大数据的分析对于二手汽车价格的预测也是一个需要发展的地方。
二、大数据分析设计方案
1.本数据集的数据内容
- Car_Name 车辆型号
- Selling_Price(lacs) 车主想要出售汽车的价格
- Present_Price(lacs) 这是该车当前的出厂价
- Kms_Driven 汽车行驶的公里数
- Fuel_Type 汽车燃料类型(汽油/柴油/CNG/LPG/电动)
- Seller_Type 个人还是经销商
- Transmission 汽车齿轮传动(自动/手动)
- Past_Owners 汽车的前车主数量
- Year 购车年份
2.数据分析的课程设计方案概述
(1)先对数据集的数据进行预处理和清洗,并计算数据集中各种数据与二手汽车价格的相关性。
(2)对数据集进行python可视化处理,判断影响二手汽车价格的因素。
(3)通过模型训练探求影响二手汽车价格的因素。
三、数据分析步骤
数据来源
数据源:https://www.kaggle.com/datasets?search=car-price-prediction
导入数据集
1 #import libraries 2 import pandas as pd 3 import numpy as np 4 import matplotlib.pyplot as plt 5 import seaborn as sns 6 import os 7 import warnings 8 %matplotlib inline 9 pd.set_option("display.max_rows", None,"display.max_columns", None) 10 warnings.simplefilter(action='ignore') 11 plt.style.use('seaborn') 12 #load dataset 13 df_main = pd.read_csv('./car data.csv') 14 df_main.head()

数据共有301条 共有9条特征
缺失值分析
df_main.info()
数据统计
#numerical stats df_main.describe()
汽车数据集缺失值查找
1 df_main.isna().sum()
数据预处理
1 fig, axes = plt.subplots(nrows = 3 , ncols = 2 ) 2 fig.set_size_inches( 25 , 13 ) 3 4 sns.barplot(x = df_main[ 'Year' ],y = df_main[ 'Selling_Price' ] ,ax = axes[ 0 ][ 0 ]) 5 sns.barplot(x = df_main[ 'Fuel_Type' ], y = df_main[ 'Selling_Price' ], ax = axes[ 0 ][ 1 ]) 6 sns.barplot(x = df_main[ 'Seller_Type' ], y = df_main[ 'Selling_Price' ], ax = axes[ 1 ][ 0 ]) 7 sns.barplot(x = df_main[ 'Transmission' ], y = df_main[ 'Selling_Price' ], ax = axes[ 1 ][ 1 ]) 8 sns.barplot(x = df_main[ 'Owner' ], y = df_main[ 'Selling_Price' ], ax = axes[ 2 ][ 0 ]) 9 sns.scatterplot(x = df_main[ 'Kms_Driven' ], y = df_main[ 'Selling_Price' ], ax = axes[ 2 ][ 1 ]) |


从这些图中我们可以得出结论:
- 随着汽车老化,价格开始下降,年份图看起来不成比例,因为并非所有数据都同样可用,否则会看到线性图。
- 与其他汽车相比,柴油车的价格最高,这很奇怪,因为柴油车有特定的使用寿命。
- 经销商提供的自动驾驶汽车价格很高,这是有道理的,因为很少有人喜欢在驾驶时换挡。 我们可以清楚地看到,
- 并不是所有的数据对于所有的属性都是均匀分布的。
- 例如-2003年至2011年的数据很少。2018年的数据也很少与汽油相比,CNG的数据也非常少。
自动挡vs手动挡的数据
1 fig, (ax1, ax2) = plt.subplots(nrows = 2 ) 2 fig.set_size_inches( 22 , 15 ) 3 sns.barplot(x = df_main[ 'Year' ], y = df_main[ 'Selling_Price' ],hue = df_main[ 'Transmission' ], ax = ax1) 4 sns.scatterplot(x = df_main[ 'Present_Price' ], y = df_main[ 'Selling_Price' ], ax = ax2) |
从2012年起,自动驾驶汽车一直占据汽车行业的主导地位;汽车的售价与汽车的当前价格成正比。
<br> 1 #把车辆年份转换为车辆使用的年数 数据集是2020年的所以用2020-去age 随后删除year这一列 2 df_main[ 'Age' ] = 2020 - df_main[ 'Year' ] 3 df_main.drop( 'Year' ,axis = 1 ,inplace = True ) 4 df_main.rename(columns = { 'Selling_Price' : 'Selling_Price(lacs)' , 'Present_Price' : 'Present_Price(lacs)' , 'Owner' : 'Past_Owners' },inplace = True ) |
探索性数据分析
单变量分析
df_main.columns
cat_cols = [ 'Fuel_Type' , 'Seller_Type' , 'Transmission' , 'Past_Owners' ] i = 0 while i < 4 : fig = plt.figure(figsize = [ 10 , 4 ]) #ax1 = fig.add_subplot(121) #ax2 = fig.add_subplot(122) #ax1.title.set_text(cat_cols[i]) plt.subplot( 1 , 2 , 1 ) sns.countplot(x = cat_cols[i], data = df_main) i + = 1 #ax2.title.set_text(cat_cols[i]) plt.subplot( 1 , 2 , 2 ) sns.countplot(x = cat_cols[i], data = df_main) i + = 1 plt.show() |
从中可以看出不同数据的占比是不同的,所以在模型的预测上存在着不平衡数据集的问题,比如前一任雇主绝大多数是0,绝大多数是手动挡,绝大多数是商人售卖,绝大多数是汽油
对于数值型变量绘制箱型图进行异常值检测
num_cols = [ 'Selling_Price(lacs)' , 'Present_Price(lacs)' , 'Kms_Driven' , 'Age' ] i = 0 while i < 4 : fig = plt.figure(figsize = [ 13 , 3 ]) #ax1 = fig.add_subplot(121) #ax2 = fig.add_subplot(122) #ax1.title.set_text(num_cols[i]) plt.subplot( 1 , 2 , 1 ) sns.boxplot(x = num_cols[i], data = df_main) i + = 1 #ax2.title.set_text(num_cols[i]) plt.subplot( 1 , 2 , 2 ) sns.boxplot(x = num_cols[i], data = df_main) i + = 1 plt.show() |
异常值在数据集中是存在的,可以先保留不进行替换,具体也可以考验模型能否很好地你和这两类数据。
将车辆销售相关变量进行不同分位数展示
1 def num_summary(dataframe, numerical_col): 2 quantiles = [ 0.05 , 0.10 , 0.20 , 0.30 , 0.40 , 0.50 , 0.60 , 0.70 , 0.80 , 0.90 ] 3 print (dataframe[numerical_col].describe(quantiles).T) 4 for num_col in df_main[[ 'Present_Price(lacs)' , 'Selling_Price(lacs)' , 'Kms_Driven' ]].columns: 5 num_summary(df_main, num_col) |
多变量分析
sns.heatmap(df_main.corr(), annot = True , cmap = "RdBu" ) plt.show() |
可以发现汽车当前的售价与卖家想要售出的价格有比较大的关联度,这当然很合理,因为这两者都反映着汽车的价值,相关性较高,其他变量的相关性不那么显著
结果显示
df_main.corr()[ 'Selling_Price(lacs)' ] |
1 | df_main.pivot_table(values = 'Selling_Price(lacs)' , index = 'Seller_Type' , columns = 'Fuel_Type' ) |
1 | df_main.pivot_table(values = 'Selling_Price(lacs)' , index = 'Seller_Type' , columns = 'Transmission' ) |
数据集划分
1 2 3 4 5 6 7 8 | y = df_main[ 'Selling_Price(lacs)' ] X = df_main.drop( 'Selling_Price(lacs)' ,axis = 1 ) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 1 ) print ( "x train: " ,X_train.shape) print ( "x test: " ,X_test.shape) print ( "y train: " ,y_train.shape) print ( "y test: " ,y_test.shape) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | from sklearn.metrics import r2_score from sklearn.model_selection import cross_val_score CV = [] R2_train = [] R2_test = [] def car_pred_model(model,model_name): # 模型训练 model.fit(X_train,y_train) # 训练集R2分数 y_pred_train = model.predict(X_train) R2_train_model = r2_score(y_train,y_pred_train) R2_train.append( round (R2_train_model, 2 )) # 测试集R2分数 y_pred_test = model.predict(X_test) R2_test_model = r2_score(y_test,y_pred_test) R2_test.append( round (R2_test_model, 2 )) # R2交叉验证分数 cross_val = cross_val_score(model ,X_train ,y_train ,cv = 5 ) cv_mean = cross_val.mean() CV.append( round (cv_mean, 2 )) # 结果打印 print ( "Train R2-score :" , round (R2_train_model, 2 )) print ( "Test R2-score :" , round (R2_test_model, 2 )) print ( "Train CV scores :" ,cross_val) print ( "Train CV mean :" , round (cv_mean, 2 )) # Plotting Graphs # Residual Plot of train data fig, ax = plt.subplots( 1 , 2 ,figsize = ( 10 , 4 )) ax[ 0 ].set_title( 'Residual Plot of Train samples' ) sns.distplot((y_train - y_pred_train),hist = False ,ax = ax[ 0 ]) ax[ 0 ].set_xlabel( 'y_train - y_pred_train' ) # Y_test vs Y_train scatter plot ax[ 1 ].set_title( 'y_test vs y_pred_test' ) ax[ 1 ].scatter(x = y_test, y = y_pred_test) ax[ 1 ].set_xlabel( 'y_test' ) ax[ 1 ].set_ylabel( 'y_pred_test' ) plt.show() |
标准线性回归或普通最小二乘标准线性回归或普通最小二乘
1 2 3 4 | from sklearn.linear_model import LinearRegression lr = LinearRegression() car_pred_model(lr, "Linear_regressor.pkl" ) |
Lasso回归模型
1 2 3 4 5 6 7 8 | from sklearn.linear_model import Lasso from sklearn.model_selection import RandomizedSearchCV ls = Lasso() alpha = np.logspace( - 3 , 3 ,num = 14 ) # range for alpha ls_rs = RandomizedSearchCV(estimator = ls, param_distributions = dict (alpha = alpha)) car_pred_model(ls_rs, "lasso.pkl" ) |
准确率还有可以提升的空间,不过也还不错,在±5之间波动
随机森林
1 from sklearn.ensemble import RandomForestRegressor 2 from sklearn.model_selection import RandomizedSearchCV 3 4 rf = RandomForestRegressor() 5 6 # 基学习器个数 7 n_estimators = list ( range ( 500 , 1000 , 100 )) 8 # 最大深度 9 max_depth = list ( range ( 4 , 9 , 4 )) 10 # 最小分类样本数 11 min_samples_split = list ( range ( 4 , 9 , 2 )) 12 # 叶子节点最少样本数 13 min_samples_leaf = [ 1 , 2 , 5 , 7 ] 14 # 每次拆分时需要考虑的特征数量 15 max_features = [ 'auto' , 'sqrt' ] 16 17 # 网格搜素字典 18 param_grid = { "n_estimators" :n_estimators, 19 "max_depth" :max_depth, 20 "min_samples_split" :min_samples_split, 21 "min_samples_leaf" :min_samples_leaf, 22 "max_features" :max_features} 23 24 rf_rs = RandomizedSearchCV(estimator = rf, param_distributions = param_grid) |
car_pred_model(rf_rs,'random_forest.pkl')
准确度相当高,基本上都分布在0的周围 相比较而言说明随机森林的拟合能力很强
梯度提升树
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import RandomizedSearchCV gb = GradientBoostingRegressor() # 学习率 learning_rate = [ 0.001 , 0.01 , 0.1 , 0.2 ] # 基学习器个数 n_estimators = list ( range ( 500 , 1000 , 100 )) # 最大深度 max_depth = list ( range ( 4 , 9 , 4 )) # 最小分类样本数 min_samples_split = list ( range ( 4 , 9 , 2 )) # 叶子节点最少样本数 min_samples_leaf = [ 1 , 2 , 5 , 7 ] # 每次拆分时需要考虑的特征数量 max_features = [ 'auto' , 'sqrt' ] # 网格搜索字典 param_grid = { "learning_rate" :learning_rate, "n_estimators" :n_estimators, "max_depth" :max_depth, "min_samples_split" :min_samples_split, "min_samples_leaf" :min_samples_leaf, "max_features" :max_features} gb_rs = RandomizedSearchCV(estimator = gb, param_distributions = param_grid) car_pred_model(gb_rs, "gradient_boosting.pkl" ) |
结果显示:这个绘图的比例尺与之前都不同,但是不难看出,GBDT的精度超过了前面所有的精度,
每个误差在±0.2之间分布
1 2 3 4 | # Technique = ["LinearRegression","Ridge","Lasso","RandomForestRegressor","GradientBoostingRegressor"] results = pd.DataFrame({ 'Model' : Technique, 'R Squared(Train)' : R2_train, 'R Squared(Test)' : R2_test, 'CV score mean(Train)' : CV}) display(results) print ( len (Technique), len (R2_train), len (R2_test), len (CV)) |
四、总结
我们先后进行了数据预处理、数据可视化和模型训练来探求二手汽车价格的影响因素。随着二手汽车年份越久,前任的雇主越多,汽车的价格就会相对比较低。也可以看出绝大多数是手动挡,但是现在自动挡是越来越普及。通过经销商处理的手动挡汽油类的汽车会价格较高。与当前汽车售价才是最紧密联系的。
对于本次实验,我学到了如何对数据进行预处理和可视化,学会了如何根据模型与数据的匹配程度进行调参,以及更加熟悉了整个模型训练和数据分析过程。我还学会了如何根据分析结果得出有益的结论并提出建议。在未来的工作中,我觉得对数据进行更深入的分析,对数据进行实时更新,以便更好地反映市场变化并进行及时调整。
这一次动手进行实验和体会,让我对大数据分析有了更多的体会,但还需要在之后进行更多的学习。
五、完整代码:
1 #import libraries 2 import pandas as pd 3 import numpy as np 4 import matplotlib.pyplot as plt 5 import seaborn as sns 6 import os 7 import warnings 8 9 %matplotlib inline 10 pd.set_option("display.max_rows", None,"display.max_columns", None) 11 warnings.simplefilter(action='ignore') 12 plt.style.use('seaborn') 13 #load dataset 14 df_main = pd.read_csv('./car data.csv') 15 16 df_main.head() 17 18 df_main.shape 19 20 df_main.info() 21 22 #numerical stats 23 df_main.describe() 24 25 df_main.isna().sum() 26 27 fig, axes = plt.subplots(nrows=3, ncols=2) 28 fig.set_size_inches(25, 13) 29 30 sns.barplot(x=df_main['Year'],y=df_main['Selling_Price'] ,ax=axes[0][0]) 31 sns.barplot(x=df_main['Fuel_Type'], y=df_main['Selling_Price'], ax=axes[0][1]) 32 sns.barplot(x=df_main['Seller_Type'], y=df_main['Selling_Price'], ax=axes[1][0]) 33 sns.barplot(x=df_main['Transmission'], y=df_main['Selling_Price'], ax=axes[1][1]) 34 sns.barplot(x=df_main['Owner'], y=df_main['Selling_Price'], ax=axes[2][0]) 35 sns.scatterplot(x=df_main['Kms_Driven'], y=df_main['Selling_Price'], ax=axes[2][1]) 36 37 #自动挡vs手动挡 38 fig, (ax1, ax2) = plt.subplots(nrows=2) 39 fig.set_size_inches(22, 15) 40 sns.barplot(x=df_main['Year'], y=df_main['Selling_Price'],hue=df_main['Transmission'], ax=ax1) 41 sns.scatterplot(x=df_main['Present_Price'], y=df_main['Selling_Price'], ax=ax2) 42 43 #把车辆年份转换为车辆使用的年数 数据集是2020年的所以用2020-去age 随后删除year这一列 44 df_main['Age'] = 2020 - df_main['Year'] 45 df_main.drop('Year',axis=1,inplace = True) 46 47 df_main.rename(columns = {'Selling_Price':'Selling_Price(lacs)','Present_Price':'Present_Price(lacs)', 'Owner':'Past_Owners'},inplace = True) 48 49 df_main.columns 50 51 cat_cols = ['Fuel_Type','Seller_Type','Transmission','Past_Owners'] 52 i=0 53 while i < 4: 54 fig = plt.figure(figsize=[10,4]) 55 #ax1 = fig.add_subplot(121) 56 #ax2 = fig.add_subplot(122) 57 58 #ax1.title.set_text(cat_cols[i]) 59 plt.subplot(1,2,1) 60 sns.countplot(x=cat_cols[i], data=df_main) 61 i += 1 62 63 #ax2.title.set_text(cat_cols[i]) 64 plt.subplot(1,2,2) 65 sns.countplot(x=cat_cols[i], data=df_main) 66 i += 1 67 68 plt.show() 69 70 #对于数值型变量绘制箱型图进行异常值检测 71 num_cols = ['Selling_Price(lacs)','Present_Price(lacs)','Kms_Driven','Age'] 72 i=0 73 while i < 4: 74 fig = plt.figure(figsize=[13,3]) 75 #ax1 = fig.add_subplot(121) 76 #ax2 = fig.add_subplot(122) 77 78 #ax1.title.set_text(num_cols[i]) 79 plt.subplot(1,2,1) 80 sns.boxplot(x=num_cols[i], data=df_main) 81 i += 1 82 83 #ax2.title.set_text(num_cols[i]) 84 plt.subplot(1,2,2) 85 sns.boxplot(x=num_cols[i], data=df_main) 86 i += 1 87 88 plt.show() 89 90 #将车辆销售相关变量进行不同分位数展示 91 def num_summary(dataframe, numerical_col): 92 quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90] 93 print(dataframe[numerical_col].describe(quantiles).T) 94 for num_col in df_main[['Present_Price(lacs)','Selling_Price(lacs)','Kms_Driven']].columns: 95 num_summary(df_main, num_col) 96 97 sns.heatmap(df_main.corr(), annot=True, cmap="RdBu") 98 plt.show() 99 100 df_main.corr()['Selling_Price(lacs)'] 101 102 df_main.pivot_table(values='Selling_Price(lacs)', index = 'Seller_Type', columns= 'Fuel_Type') 103 104 df_main.pivot_table(values='Selling_Price(lacs)', index = 'Seller_Type', columns= 'Transmission') 105 106 df=df_main.copy()#数据备份 107 # Fuel Type 108 df['Fuel_Type']=df['Fuel_Type'].map({ 109 'Petrol':0, 110 'CNG':1, 111 'Diesel':2 112 }) 113 df['Fuel_Type']=df['Fuel_Type'].astype(int) 114 115 # Seller Type 116 df['Seller_Type'] = df['Seller_Type'].map({ 117 'Dealer': 0, 118 'Individual': 1, 119 }) 120 df['Seller_Type'] = df['Seller_Type'].astype(int) 121 122 # Transmission 123 df['Transmission'] = df['Transmission'].map({ 124 'Manual': 0, 125 'Automatic': 1, 126 }) 127 df['Seller_Type'] = df['Seller_Type'].astype(int) 128 129 import scipy.stats as stat 130 import pylab 131 def plot_data(df, feature): 132 # This is for the left graph 133 plt.figure(figsize=(10, 6)) 134 plt.subplot(1, 2, 1) 135 print(feature) 136 df[feature].hist() 137 plt.subplot(1, 2, 2) 138 stat.probplot(df[feature], dist='norm', plot=pylab) 139 plt.show() 140 141 142 plot_data(df, 'Selling_Price(lacs)') 143 plot_data(df, 'Present_Price(lacs)') 144 plot_data(df, 'Kms_Driven') 145 df_main.drop(labels='Car_Name',axis= 1, inplace = True) 146 147 df_main.head() 148 149 df_main = pd.get_dummies(data = df_main,drop_first=True) 150 151 df_main.head() 152 153 y = df_main['Selling_Price(lacs)'] 154 X = df_main.drop('Selling_Price(lacs)',axis=1) 155 156 from sklearn.model_selection import train_test_split 157 158 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) 159 print("x train: ",X_train.shape) 160 print("x test: ",X_test.shape) 161 print("y train: ",y_train.shape) 162 print("y test: ",y_test.shape) 163 164 from sklearn.metrics import r2_score 165 from sklearn.model_selection import cross_val_score 166 167 CV = [] 168 R2_train = [] 169 R2_test = [] 170 171 def car_pred_model(model,model_name): 172 # 模型训练 173 model.fit(X_train,y_train) 174 175 # 训练集R2分数 176 y_pred_train = model.predict(X_train) 177 R2_train_model = r2_score(y_train,y_pred_train) 178 R2_train.append(round(R2_train_model,2)) 179 180 # 测试集R2分数 181 y_pred_test = model.predict(X_test) 182 R2_test_model = r2_score(y_test,y_pred_test) 183 R2_test.append(round(R2_test_model,2)) 184 185 # R2交叉验证分数 186 cross_val = cross_val_score(model ,X_train ,y_train ,cv=5) 187 cv_mean = cross_val.mean() 188 CV.append(round(cv_mean,2)) 189 190 # 结果打印 191 print("Train R2-score :",round(R2_train_model,2)) 192 print("Test R2-score :",round(R2_test_model,2)) 193 print("Train CV scores :",cross_val) 194 print("Train CV mean :",round(cv_mean,2)) 195 196 # Plotting Graphs 197 # Residual Plot of train data 198 fig, ax = plt.subplots(1,2,figsize = (10,4)) 199 ax[0].set_title('Residual Plot of Train samples') 200 sns.distplot((y_train-y_pred_train),hist = False,ax = ax[0]) 201 ax[0].set_xlabel('y_train - y_pred_train') 202 203 # Y_test vs Y_train scatter plot 204 ax[1].set_title('y_test vs y_pred_test') 205 ax[1].scatter(x = y_test, y = y_pred_test) 206 ax[1].set_xlabel('y_test') 207 ax[1].set_ylabel('y_pred_test') 208 209 plt.show() 210 211 from sklearn.linear_model import LinearRegression 212 213 lr = LinearRegression() 214 car_pred_model(lr,"Linear_regressor.pkl") 215 216 from sklearn.linear_model import Ridge 217 from sklearn.model_selection import RandomizedSearchCV 218 219 # Creating Ridge model object 220 rg = Ridge() 221 # range of alpha 222 alpha = np.logspace(-3,3,num=14) 223 224 # Creating RandomizedSearchCV to find the best estimator of hyperparameter 225 rg_rs = RandomizedSearchCV(estimator = rg, param_distributions = dict(alpha=alpha)) 226 227 car_pred_model(rg_rs,"ridge.pkl") 228 229 from sklearn.linear_model import Lasso 230 from sklearn.model_selection import RandomizedSearchCV 231 232 ls = Lasso() 233 alpha = np.logspace(-3,3,num=14) # range for alpha 234 235 ls_rs = RandomizedSearchCV(estimator = ls, param_distributions = dict(alpha=alpha)) 236 237 car_pred_model(ls_rs,"lasso.pkl") 238 239 from sklearn.ensemble import RandomForestRegressor 240 from sklearn.model_selection import RandomizedSearchCV 241 242 rf = RandomForestRegressor() 243 244 # 基学习器个数 245 n_estimators=list(range(500,1000,100)) 246 # 最大深度 247 max_depth=list(range(4,9,4)) 248 # 最小分类样本数 249 min_samples_split=list(range(4,9,2)) 250 # 叶子节点最少样本数 251 min_samples_leaf=[1,2,5,7] 252 # 每次拆分时需要考虑的特征数量 253 max_features=['auto','sqrt'] 254 255 # 网格搜素字典 256 param_grid = {"n_estimators":n_estimators, 257 "max_depth":max_depth, 258 "min_samples_split":min_samples_split, 259 "min_samples_leaf":min_samples_leaf, 260 "max_features":max_features} 261 262 rf_rs = RandomizedSearchCV(estimator = rf, param_distributions = param_grid) 263 264 car_pred_model(rf_rs,'random_forest.pkl') 265 266 print(rf_rs.best_estimator_) 267 268 from sklearn.ensemble import GradientBoostingRegressor 269 from sklearn.model_selection import RandomizedSearchCV 270 271 gb = GradientBoostingRegressor() 272 273 # 学习率 274 learning_rate = [0.001, 0.01, 0.1, 0.2] 275 # 基学习器个数 276 n_estimators=list(range(500,1000,100)) 277 # 最大深度 278 max_depth=list(range(4,9,4)) 279 # 最小分类样本数 280 min_samples_split=list(range(4,9,2)) 281 # 叶子节点最少样本数 282 min_samples_leaf=[1,2,5,7] 283 # 每次拆分时需要考虑的特征数量 284 max_features=['auto','sqrt'] 285 286 # 网格搜索字典 287 param_grid = {"learning_rate":learning_rate, 288 "n_estimators":n_estimators, 289 "max_depth":max_depth, 290 "min_samples_split":min_samples_split, 291 "min_samples_leaf":min_samples_leaf, 292 "max_features":max_features} 293 294 gb_rs = RandomizedSearchCV(estimator = gb, param_distributions = param_grid) 295 car_pred_model(gb_rs,"gradient_boosting.pkl") 296 297 # Technique = ["LinearRegression","Ridge","Lasso","RandomForestRegressor","GradientBoostingRegressor"] 298 results=pd.DataFrame({'Model': Technique,'R Squared(Train)': R2_train,'R Squared(Test)': R2_test, 'CV score mean(Train)': CV}) 299 display(results) 300 301 print(len(Technique),len(R2_train),len(R2_test),len(CV)) 302
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!