- year,moth,day,week:分别表示的具体的时间。
- temp_2:前天的最高温度值。
- temp_1:昨天的最高温度值。
- average:在历史中,每年这一天的平均最高温度值。
- actual:就是标签值,当天的真实最高温度。
- friend:这一列可能是凑热闹的,你的朋友猜测的可能值,不管它就好。
print('数据维度:', features.shape) #数据维度: (348, 9)
1 # 处理时间数据 2 import datetime 3 4 # 分别得到年,月,日 5 years = features['year'] 6 months = features['month'] 7 days = features['day'] 8 9 # datetime格式 10 dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)] 11 dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
1 # 准备画图 2 import matplotlib.pyplot as plt 3 4 %matplotlib inline 5 6 # 指定默认风格 7 plt.style.use('fivethirtyeight')
1 # 设置布局 2 fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize = (10,10)) 3 fig.autofmt_xdate(rotation = 45) 4 5 # 标签值 6 ax1.plot(dates, features['actual']) 7 ax1.set_xlabel(''); ax1.set_ylabel('Temperature'); ax1.set_title('Max Temp') 8 9 # 昨天 10 ax2.plot(dates, features['temp_1']) 11 ax2.set_xlabel(''); ax2.set_ylabel('Temperature'); ax2.set_title('Previous Max Temp') 12 13 # 前天 14 ax3.plot(dates, features['temp_2']) 15 ax3.set_xlabel('Date'); ax3.set_ylabel('Temperature'); ax3.set_title('Two Days Prior Max Temp') 16 17 # 我的逗逼朋友 18 ax4.plot(dates, features['friend']) 19 ax4.set_xlabel('Date'); ax4.set_ylabel('Temperature'); ax4.set_title('Friend Estimate') 20 21 plt.tight_layout(pad=2)
图9-1 各项特征指标
图9-2是常用的转换方式,称作one-hot encoding或者独热编码,目的就是将属性值转换成数值。对应的特征中有几个可选属性值,就构造几列新的特征,并将其中符合的位置标记为1,其他位置标记为0。
图9-2 特征编码
1 # 独热编码 2 features = pd.get_dummies(features) 3 features.head(5)

help(pd.get_dummies) Help on function get_dummies in module pandas.core.reshape.reshape: get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame' Convert categorical variable into dummy/indicator variables. Parameters ---------- data : array-like, Series, or DataFrame Data of which to get dummy indicators. prefix : str, list of str, or dict of str, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, `prefix` can be a dictionary mapping column names to prefixes. prefix_sep : str, default '_' If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with `prefix`. dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored. columns : list-like, default None Column names in the DataFrame to be encoded. If `columns` is None then all the columns with `object` or `category` dtype will be converted. sparse : bool, default False Whether the dummy-encoded columns should be backed by a :class:`SparseArray` (True) or a regular NumPy array (False). drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. .. versionadded:: 0.23.0 Returns ------- DataFrame Dummy-coded data. See Also -------- Series.str.get_dummies : Convert Series to dummy codes. Examples -------- >>> s = pd.Series(list('abca')) >>> pd.get_dummies(s) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 >>> s1 = ['a', 'b', np.nan] >>> pd.get_dummies(s1) a b 0 1 0 1 0 1 2 0 0 >>> pd.get_dummies(s1, dummy_na=True) a b NaN 0 1 0 0 1 0 1 0 2 0 0 1 >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], ... 'C': [1, 2, 3]}) >>> pd.get_dummies(df, prefix=['col1', 'col2']) C col1_a col1_b col2_a col2_b col2_c 0 1 1 0 0 1 0 1 2 0 1 1 0 0 2 3 1 0 0 0 1 >>> pd.get_dummies(pd.Series(list('abcaa'))) a b c 0 1 0 0 1 0 1 0 2 0 0 1 3 1 0 0 4 1 0 0 >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True) b c 0 0 0 1 1 0 2 0 1 3 0 0 4 0 0 >>> pd.get_dummies(pd.Series(list('abc')), dtype=float) a b c 0 1.0 0.0 0.0 1 0.0 1.0 0.0 2 0.0 0.0 1.0
print('Shape of features after one-hot encoding:', features.shape) #Shape of features after one-hot encoding: (348, 15)
1 # 数据与标签 2 import numpy as np 3 4 # 标签 5 labels = np.array(features['actual']) 6 7 # 在特征中去掉标签 8 features= features.drop('actual', axis = 1) 9 10 # 名字单独保存一下,以备后患 11 feature_list = list(features.columns) 12 13 # 转换成合适的格式 14 features = np.array(features)
1 # 数据集切分 2 from sklearn.model_selection import train_test_split 3 4 train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, 5 random_state = 42) 6 print('训练集特征:', train_features.shape) 7 print('训练集标签:', train_labels.shape) 8 print('测试集特征:', test_features.shape) 9 print('测试集标签:', test_labels.shape)
训练集特征: (261, 14) 训练集标签: (261,) 测试集特征: (87, 14) 测试集标签: (87,)
1 # 导入算法 2 from sklearn.ensemble import RandomForestRegressor 3 4 # 建模 5 rf = RandomForestRegressor(n_estimators= 1000, random_state=42) 6 7 # 训练 8 rf.fit(train_features, train_labels)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False)
1 # 预测结果 2 predictions = rf.predict(test_features) 3 4 # 计算误差 5 errors = abs(predictions - test_labels) 6 7 # mean absolute percentage error (MAPE) 8 mape = 100 * (errors / test_labels) 9 10 print ('MAPE:',np.mean(mape))
MAPE: 6.011244187972058
第① 步:下载安装。
下载graphviz-2.38.msi,完成后双击这个msi文件,然后一直单击next按钮,即可安装Graphviz软件(注意:一定要记住安装路径,因为后面配置环境变量会用到路径信息,系统默认的安装路径是C:\Program Files (x86)\Graphviz2.38)。
dot -version dot - graphviz version 2.38.0 (20140413.2041) libdir = "D:\tools\GraphViz\bin" Activated plugin library: gvplugin_dot_layout.dll Using layout: dot:dot_layout Activated plugin library: gvplugin_core.dll Using render: dot:core Using device: dot:dot:core The plugin configuration file: D:\tools\GraphViz\bin\config6 was successfully loaded. render : cairo dot fig gd gdiplus map pic pov ps svg tk vml vrml xdot layout : circo dot fdp neato nop nop1 nop2 osage patchwork sfdp twopi textlayout : textlayout device : bmp canon cmap cmapx cmapx_np dot emf emfplus eps fig gd gd2 gif gv imap imap_np ismap jpe jpeg jpg
metafile pdf pic plain plain-ext png pov ps ps2 svg svgz tif tiff tk vml vmlz vrml wbmp xdot xdot1.2 xdot1.4 loadimage : (lib) bmp eps gd gd2 gif jpe jpeg jpg png ps svg
1 pip3 install graphviz 2 pip3 install pydot2 3 pip3 install pydotplus 4 pip3 install pydot
1 # 导入所需工具包 2 from sklearn.tree import export_graphviz 3 import pydot #pip install pydot 4 5 # 拿到其中的一棵树 6 tree = rf.estimators_[5] 7 8 # 导出成dot文件 9 export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1) 10 11 # 绘图 12 (graph, ) = pydot.graph_from_dot_file('tree.dot') 13 14 # 展示 15 graph.write_png('tree.png');
1 print('The depth of this tree is:', tree.tree_.max_depth) 2 #The depth of this tree is: 15
图9-9 树模型可视化中各项指标含义
1 # 得到特征重要性 2 importances = list(rf.feature_importances_) 3 4 # 转换格式 5 feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)] 6 7 # 排序 8 feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True) 9 10 # 对应进行打印 11 [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
Variable: temp_1 Importance: 0.7 Variable: average Importance: 0.19 Variable: day Importance: 0.03 Variable: temp_2 Importance: 0.02 Variable: friend Importance: 0.02 Variable: month Importance: 0.01 Variable: year Importance: 0.0 Variable: week_Fri Importance: 0.0 Variable: week_Mon Importance: 0.0 Variable: week_Sat Importance: 0.0 Variable: week_Sun Importance: 0.0 Variable: week_Thurs Importance: 0.0 Variable: week_Tues Importance: 0.0 Variable: week_Wed Importance: 0.0
1 # 转换成list格式 2 x_values = list(range(len(importances))) 3 4 # 绘图 5 plt.bar(x_values, importances, orientation = 'vertical') 6 7 # x轴名字 8 plt.xticks(x_values, feature_list, rotation='vertical') 9 10 # 图名 11 plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
图9-10 随机森林特征重要性
1 # 选择最重要的那两个特征来试一试 2 rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=42) 3 4 # 拿到这俩特征 5 important_indices = [feature_list.index('temp_1'), feature_list.index('average')] 6 train_important = train_features[:, important_indices] 7 test_important = test_features[:, important_indices] 8 9 # 重新训练模型 10 rf_most_important.fit(train_important, train_labels) 11 12 # 预测结果 13 predictions = rf_most_important.predict(test_important) 14 15 errors = abs(predictions - test_labels) 16 17 # 评估结果 18 19 mape = np.mean(100 * (errors / test_labels)) 20 21 print('mape:', mape)
mape: 6.229055723613811
1 # 日期数据 2 months = features[:, feature_list.index('month')] 3 days = features[:, feature_list.index('day')] 4 years = features[:, feature_list.index('year')] 5 6 # 转换日期格式 7 dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)] 8 dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates] 9 10 # 创建一个表格来存日期和其对应的标签数值 11 true_data = pd.DataFrame(data = {'date': dates, 'actual': labels}) 12 13 # 同理,再创建一个来存日期和其对应的模型预测值 14 months = test_features[:, feature_list.index('month')] 15 days = test_features[:, feature_list.index('day')] 16 years = test_features[:, feature_list.index('year')] 17 18 test_dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)] 19 20 test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates] 21 22 predictions_data = pd.DataFrame(data = {'date': test_dates, 'prediction': predictions}) 23 24 # 真实值 25 plt.plot(true_data['date'], true_data['actual'], 'b-', label = 'actual') 26 27 # 预测值 28 plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction') 29 plt.xticks(rotation = '60'); 30 plt.legend() 31 32 # 图名 33 plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');
