机器学习库scikit-learn学习
一、获取数据
from sklearn import datasets
1.sklean自带数据集
鸢尾花
from sklearn import datasets
datasets.load_iris()
手写数字
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
print(digits.target.shape)
print(digits.images.shape)
2.创建数据集
生成随机数据
from sklearn.datasets.samples_generator import make_classification
X, y = make_classification(n_samples=6, n_features=5, n_informative=2,
n_redundant=2, n_classes=2, n_clusters_per_class=2, scale=1.0,
random_state=20)
用sklearn.datasets.make_blobs来生成类别数据
scikit中的make_blobs方法常被用来生成聚类算法的测试数据,直观地说,make_blobs会根据用户指定的特征数量,中心点数量,范围等来生成几类数据,这些数据可用于测试聚类算法的效果。
sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3,cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True,random_state=None)[source]
用sklearn.datasets.make_circles和make_moons来生成圆形\月形数据
sklearn.datasets.make_circles(n_samples=100, shuffle=True, noise=None,random_state=None, factor=0.8)
x1,y1=make_circles(n_samples=1000,factor=0.5,noise=0.1)
二、数据预处理
from sklearn import preprocessing
几大方法:fit transform fit_transform
StandardScaler 平均变为0 标准差变为1
最小-最大规范化 MinMaxScaler 变换到[0,1]区间
正则化 X_normalized = preprocessing.normalize(X, norm='l2')
one-hot编码\类别特征编码 OneHotEncoder 可以转换特征、类标,处理类标时要输入二维数组,r.fit_transform(np.array(a).reshape(-1,1)).toarray()
特征二值化 Binarizer(threshold=1.1)
标签编码 LabelEncoder 转为整数
LabelBinarizer 转为one-hot独热
三、数据集拆分
from sklearn.model_selection import train_test_split
(X_train, X_test,y_train,y_test) = train_test_split(X, y, test_size=0.25, random_state=0,shuffle=True)
k折交叉验证:
from sklearn.model_selection import cross_val_score
四、导入模型:
线性回归
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True, normalize=False,copy_X=True, n_jobs=1)
逻辑回归LR
from sklearn.linear_model import LogisticRegression
朴素贝叶斯算法NB(Naive Bayes)
from sklearn import naive_bayes
决策树DT
from sklearn.tree import DecisionTreeClassifier
支持向量机SVM
from sklearn.svm import SVC
k近邻算法KNN
from sklearn import neighbors
五、模型评估
检验曲线
from sklearn.model_selection import validation_curve
from sklearn.metrics import confusion_matrix
六、保存模型:
from sklearn.externals import joblib
joblib.dump(model, 'model.pickle')
model = joblib.load('model.pickle')