萌新向Python数据分析及数据挖掘 第三章 机器学习常用算法 第二节 线性回归算法 (下)实操篇
线性回归算法
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data[:,5] #- RM average number of rooms per dwelling
y = boston.target
print(X.shape)
print(y.shape)
print(boston.DESCR) #数据描述
plt.scatter(X,y)#使用单个变量 RM -price 用散点图表示
Signature: plt.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs) Docstring: Make a scatter plot of x
vs y
.
Marker size is scaled by s
and marker color is mapped to c
.
Parameters
x, y : array_like, shape (n, ) Input data
s : scalar or array_like, shape (n, ), optional size in points^2. Default is rcParams['lines.markersize'] ** 2
.
c : color, sequence, or sequence of color, optional, default: 'b' c
can be a single color format string, or a sequence of color specifications of length N
, or a sequence of N
numbers to be mapped to colors using the cmap
and norm
specified via kwargs (see below). Note that c
should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c
can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.
marker : ~matplotlib.markers.MarkerStyle
, optional, default: 'o' See ~matplotlib.markers
for more information on the different styles of markers scatter supports. marker
can be either an instance of the class or the text shorthand for a particular marker.
cmap : ~matplotlib.colors.Colormap
, optional, default: None A ~matplotlib.colors.Colormap
instance or registered name. cmap
is only used if c
is an array of floats. If None, defaults to rc image.cmap
.
norm : ~matplotlib.colors.Normalize
, optional, default: None A ~matplotlib.colors.Normalize
instance is used to scale luminance data to 0, 1. norm
is only used if c
is an array of floats. If None
, use the default :func:normalize
.
vmin, vmax : scalar, optional, default: None vmin
and vmax
are used in conjunction with norm
to normalize luminance data. If either are None
, the min and max of the color array is used. Note if you pass a norm
instance, your settings for vmin
and vmax
will be ignored.
alpha : scalar, optional, default: None The alpha blending value, between 0 (transparent) and 1 (opaque)
linewidths : scalar or array_like, optional, default: None If None, defaults to (lines.linewidth,).
verts : sequence of (x, y), optional If marker
is None, these vertices will be used to construct the marker. The center of the marker is located at (0,0) in normalized units. The overall marker is rescaled by s
.
edgecolors : color or sequence of color, optional, default: None If None, defaults to 'face'
If 'face', the edge color will always be the same as
the face color.
If it is 'none', the patch boundary will not
be drawn.
For non-filled markers, the `edgecolors` kwarg
is ignored and forced to 'face' internally.
Returns
paths : ~matplotlib.collections.PathCollection
Other Parameters
**kwargs : ~matplotlib.collections.Collection
properties
See Also
plot : to plot scatter plots when markers are identical in size and color
Notes
-
The
plot
function will be faster for scatterplots where markers don't vary in size or color. -
Any or all of
x
,y
,s
, andc
may be masked arrays, in which case all masks will be combined and only unmasked points will be plotted.Fundamentally, scatter works with 1-D arrays;
x
,y
,s
, andc
may be input as 2-D arrays, but within scatter they will be flattened. The exception isc
, which will be flattened only if its size matches the size ofx
andy
.
.. note:: In addition to the above described arguments, this function can take a data keyword argument. If such a data argument is given, the following arguments are replaced by data[]:
* All arguments with the following names: 'c', 'color', 'edgecolors', 'facecolor', 'facecolors', 'linewidths', 's', 'x', 'y'.
X
y.max()
Docstring: a.max(axis=None, out=None, keepdims=False)
X = X[y < 50]#去掉y>=50de
y = y[y < 50]
print(X.shape)
print(y.shape)
plt.scatter(X,y)
多元线性回归
X = boston.data
y = boston.target
X = X[y < 50]
y = y[y < 50]
from sklearn.model_selection import train_test_split #载入数据切分工具
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)#数据切分
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
Init signature: LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1) Docstring:
Ordinary least squares Linear Regression.
Parameters
fit_intercept : boolean, optional, default True whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False This parameter is ignored when fit_intercept
is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use :class:sklearn.preprocessing.StandardScaler
before calling fit
on an estimator with normalize=False
.
copy_X : boolean, optional, default True If True, X will be copied; else, it may be overwritten.
n_jobs : int, optional, default 1 The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems.
Attributes
coef_ : array, shape (n_features, ) or (n_targets, n_features) Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
intercept_ : array Independent term in the linear model.
Notes
From the implementation point of view, this is just plain Ordinary Least Squares (scipy.linalg.lstsq) wrapped as a predictor object. File: c:\users\qq123\anaconda3\lib\site-packages\sklearn\linear_model\base.py Type: ABCMeta
lin_reg.fit(X_train,y_train)
Signature: lin_reg.fit(X, y, sample_weight=None) Docstring: Fit linear model.
Parameters
X : numpy array or sparse matrix of shape [n_samples,n_features] Training data
y : numpy array of shape [n_samples, n_targets] Target values. Will be cast to X's dtype if necessary
sample_weight : numpy array of shape [n_samples] Individual weights for each sample
.. versionadded:: 0.17
parameter *sample_weight* support to LinearRegression.
Returns
self : returns an instance of self.
lin_reg.coef_#系数
lin_reg.intercept_#截距
lin_reg.score(X_test,y_test)
K近邻回归算法
from sklearn.neighbors import KNeighborsRegressor #载入KNN分类器
knn_reg = KNeighborsRegressor()# 设置分类器
knn_reg.fit(X_train,y_train)
knn_reg.score(X_test,y_test)
from sklearn.model_selection import GridSearchCV
para_grid = [
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,11)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,11)],
'p':[i for i in range(1,6)]
}
]
knn_reg_grid = KNeighborsRegressor(n_jobs = -1)
grid_search = GridSearchCV(knn_reg_grid,para_grid,verbose =1)
grid_search.fit(X_train,y_train)
grid_search.best_estimator_
grid_search.best_score_
grid_search.best_estimator_.score(X_test,y_test)
参数权重排序
lin_reg.coef_#参数
np.argsort(lin_reg.coef_)
Signature: np.argsort(a, axis=-1, kind='quicksort', order=None) Docstring: Returns the indices that would sort an array.
Perform an indirect sort along the given axis using the algorithm specified by the kind
keyword. It returns an array of indices of the same shape as a
that index data along the given axis in sorted order.
Parameters
a : array_like Array to sort. axis : int or None, optional Axis along which to sort. The default is -1 (the last axis). If None, the flattened array is used. kind : {'quicksort', 'mergesort', 'heapsort'}, optional Sorting algorithm. order : str or list of str, optional When a
is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.
Returns
index_array : ndarray, int Array of indices that sort a
along the specified axis. If a
is one-dimensional, a[index_array]
yields a sorted a
.
See Also
sort : Describes sorting algorithms used. lexsort : Indirect stable sort with multiple keys. ndarray.sort : Inplace sort. argpartition : Indirect partial sort.
Notes
See sort
for notes on the different sorting algorithms.
As of NumPy 1.4.0 argsort
works with real/complex arrays containing nan values. The enhanced sort order is documented in sort
.
Examples
One dimensional array:
x = np.array([3, 1, 2]) np.argsort(x) array([1, 2, 0])
Two-dimensional array:
x = np.array([[0, 3], [2, 2]]) x array([[0, 3], [2, 2]])
np.argsort(x, axis=0) # sorts along first axis (down) array([[0, 1], [1, 0]])
np.argsort(x, axis=1) # sorts along last axis (across) array([[0, 1], [0, 1]])
Indices of the sorted elements of a N-dimensional array:
ind = np.unravel_index(np.argsort(x, axis=None), x.shape) ind (array([0, 1, 1, 0]), array([0, 0, 1, 1])) x[ind] # same as np.sort(x, axis=None) array([0, 2, 2, 3])
Sorting with keys:
x = np.array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')]) x array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
np.argsort(x, order=('x','y')) array([1, 0])
np.argsort(x, order=('y','x')) array([0, 1])
lin_reg.coef_[np.argsort(lin_reg.coef_)]#升序
boston.feature_names
boston.feature_names[np.argsort(lin_reg.coef_)]