LazyPredict:适合懒人跑的机器学习模型

机器学习有很多模型算法,比如线性回归、逻辑回归、KNN分类、支持向量机、随机森林等,一般需要我们一个个地调用多个模型的拟合和预测。那么,有没有一个可自动执行机器学习建模过程、以及输出诸如准确率、精准率、召回率、F1、ROC、AUC等指标的工具?当然有,LazyPredict就是一个专为懒人打造的建模预测的Python工具包。对于半吊子的生物信息、智能育种领域的新手,比较适合。

LazyPredict 旨在简化机器学习模型的比较和选择过程,适用于快速比较多个模型(包括约40个回归和约30个分类任务)的性能,它的语法与scikit-learn几乎相同,可以轻松地从现有代码中即插即用,轻松比较每个模型的性能指标,并调整最佳模型以进一步提高性能。

图片

步骤

步骤1

使用以下方法安装 lazypredict 库:

pip install lazypredict

第2步

导入 pandas 来加载我们的数据集。

import pandas as pd

第3步

加载数据集。

df = pd.read_csv('Mal_Customers.csv')

第4步

打印数据集的前几行

图片

这里 Y 变量是支出分数列,而其余列是 X 变量。

现在,在确定了 X 和 Y 变量之后,我们将它们分成训练和测试数据集。

# 导入 train_test_split,用于分割数据集
from sklearn.model_selection import train_test_split
# 定义 X 和 y 变量
X = df.loc[:, df.columns != 'Spending Score (1-100)']
y = df['Spending Score (1-100)'] # 对数据进行分区。
# 分割数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

第5步

我们导入之前安装的lazypredict库,lazypredict里面有两个类,一个用于分类,一个用于回归。

# 导入 lazypredict
导入 lazypredict
# 从 lazypredict 导入回归类
from lazypredict.Supervised import LazyRegressor
# 从 lazypredict.Supervised 中导入分类类
from lazypredict.Supervised import LazyClassifier

导入后,我们将使用 LazyRegressor,因为我们正在处理回归问题,如果你正在处理分类问题,则这两种类型的问题都需要相同的步骤。

# 使用 LazyRegressor 定义模型
multiple_ML_model = lazyRegressor(verbose=0, ignore_warnings=True, predictions=True)
# 对模型进行拟合,同时预测每个模型的输出结果
models, predictions = multiple_ML_model.fit(X_train, X_test, y_train, y_test)

这里,prediction = True 表示你想要获得每个模型的准确性并想要每个模型的预测值。

模型的变量包含每个模型精度以及一些其他重要信息。

图片

它在我的回归问题上实现了42 个 ML 模型,因为本指南更侧重于如何测试许多模型,而不是提高其准确性。所以我对每个模型的准确性不感兴趣。

查看每个模型的预测。

图片

你可以利用这些预测来创建一个混淆矩阵。

如果正在处理分类问题,这就是使用 lazypredict 库的方法。

# 使用 LazyRegressor 定义模型
multiple_ML_model = lazyClassifier(verbose=0,
          ignore_warnings=True,
          predictions=True)
# 对模型进行拟合,并预测每个模型的输出结果
models, predictions = multiple_ML_model.fit(
          X_train, X_test, y_train, y_test)

案例

分类

from lazypredict.Supervised import LazyClassifierfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitdata = load_breast_cancer()X = data.datay= data.targetX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.5,random_state =123)clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)models,predictions = clf.fit(X_train, X_test, y_train, y_test)print(models)| Model                          |   Accuracy |   Balanced Accuracy |   ROC AUC |   F1 Score |   Time Taken ||:-------------------------------|-----------:|--------------------:|----------:|-----------:|-------------:|| LinearSVC                      |   0.989474 |            0.987544 |  0.987544 |   0.989462 |    0.0150008 || SGDClassifier                  |   0.989474 |            0.987544 |  0.987544 |   0.989462 |    0.0109992 || MLPClassifier                  |   0.985965 |            0.986904 |  0.986904 |   0.985994 |    0.426     || Perceptron                     |   0.985965 |            0.984797 |  0.984797 |   0.985965 |    0.0120046 || LogisticRegression             |   0.985965 |            0.98269  |  0.98269  |   0.985934 |    0.0200036 || LogisticRegressionCV           |   0.985965 |            0.98269  |  0.98269  |   0.985934 |    0.262997  || SVC                            |   0.982456 |            0.979942 |  0.979942 |   0.982437 |    0.0140011 || CalibratedClassifierCV         |   0.982456 |            0.975728 |  0.975728 |   0.982357 |    0.0350015 || PassiveAggressiveClassifier    |   0.975439 |            0.974448 |  0.974448 |   0.975464 |    0.0130005 || LabelPropagation               |   0.975439 |            0.974448 |  0.974448 |   0.975464 |    0.0429988 || LabelSpreading                 |   0.975439 |            0.974448 |  0.974448 |   0.975464 |    0.0310006 || RandomForestClassifier         |   0.97193  |            0.969594 |  0.969594 |   0.97193  |    0.033     || GradientBoostingClassifier     |   0.97193  |            0.967486 |  0.967486 |   0.971869 |    0.166998  || QuadraticDiscriminantAnalysis  |   0.964912 |            0.966206 |  0.966206 |   0.965052 |    0.0119994 || HistGradientBoostingClassifier |   0.968421 |            0.964739 |  0.964739 |   0.968387 |    0.682003  || RidgeClassifierCV              |   0.97193  |            0.963272 |  0.963272 |   0.971736 |    0.0130029 || RidgeClassifier                |   0.968421 |            0.960525 |  0.960525 |   0.968242 |    0.0119977 || AdaBoostClassifier             |   0.961404 |            0.959245 |  0.959245 |   0.961444 |    0.204998  || ExtraTreesClassifier           |   0.961404 |            0.957138 |  0.957138 |   0.961362 |    0.0270066 || KNeighborsClassifier           |   0.961404 |            0.95503  |  0.95503  |   0.961276 |    0.0560005 || BaggingClassifier              |   0.947368 |            0.954577 |  0.954577 |   0.947882 |    0.0559971 || BernoulliNB                    |   0.950877 |            0.951003 |  0.951003 |   0.951072 |    0.0169988 || LinearDiscriminantAnalysis     |   0.961404 |            0.950816 |  0.950816 |   0.961089 |    0.0199995 || GaussianNB                     |   0.954386 |            0.949536 |  0.949536 |   0.954337 |    0.0139935 || NuSVC                          |   0.954386 |            0.943215 |  0.943215 |   0.954014 |    0.019989  || DecisionTreeClassifier         |   0.936842 |            0.933693 |  0.933693 |   0.936971 |    0.0170023 || NearestCentroid                |   0.947368 |            0.933506 |  0.933506 |   0.946801 |    0.0160074 || ExtraTreeClassifier            |   0.922807 |            0.912168 |  0.912168 |   0.922462 |    0.0109999 || CheckingClassifier             |   0.361404 |            0.5      |  0.5      |   0.191879 |    0.0170043 || DummyClassifier                |   0.512281 |            0.489598 |  0.489598 |   0.518924 |    0.0119965 |

回归

from lazypredict.Supervised import LazyRegressorfrom sklearn import datasetsfrom sklearn.utils import shuffleimport numpy as npboston = datasets.load_boston()X, y = shuffle(boston.data, boston.target, random_state=13)X = X.astype(np.float32)offset = int(X.shape[0] * 0.9)X_train, y_train = X[:offset], y[:offset]X_test, y_test = X[offset:], y[offset:]reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)models, predictions = reg.fit(X_train, X_test, y_train, y_test)print(models)| Model                         | Adjusted R-Squared | R-Squared |  RMSE | Time Taken ||:------------------------------|-------------------:|----------:|------:|-----------:|| SVR                           |               0.83 |      0.88 |  2.62 |       0.01 || BaggingRegressor              |               0.83 |      0.88 |  2.63 |       0.03 || NuSVR                         |               0.82 |      0.86 |  2.76 |       0.03 || RandomForestRegressor         |               0.81 |      0.86 |  2.78 |       0.21 || XGBRegressor                  |               0.81 |      0.86 |  2.79 |       0.06 || GradientBoostingRegressor     |               0.81 |      0.86 |  2.84 |       0.11 || ExtraTreesRegressor           |               0.79 |      0.84 |  2.98 |       0.12 || AdaBoostRegressor             |               0.78 |      0.83 |  3.04 |       0.07 || HistGradientBoostingRegressor |               0.77 |      0.83 |  3.06 |       0.17 || PoissonRegressor              |               0.77 |      0.83 |  3.11 |       0.01 || LGBMRegressor                 |               0.77 |      0.83 |  3.11 |       0.07 || KNeighborsRegressor           |               0.77 |      0.83 |  3.12 |       0.01 || DecisionTreeRegressor         |               0.65 |      0.74 |  3.79 |       0.01 || MLPRegressor                  |               0.65 |      0.74 |  3.80 |       1.63 || HuberRegressor                |               0.64 |      0.74 |  3.84 |       0.01 || GammaRegressor                |               0.64 |      0.73 |  3.88 |       0.01 || LinearSVR                     |               0.62 |      0.72 |  3.96 |       0.01 || RidgeCV                       |               0.62 |      0.72 |  3.97 |       0.01 || BayesianRidge                 |               0.62 |      0.72 |  3.97 |       0.01 || Ridge                         |               0.62 |      0.72 |  3.97 |       0.01 || TransformedTargetRegressor    |               0.62 |      0.72 |  3.97 |       0.01 || LinearRegression              |               0.62 |      0.72 |  3.97 |       0.01 || ElasticNetCV                  |               0.62 |      0.72 |  3.98 |       0.04 || LassoCV                       |               0.62 |      0.72 |  3.98 |       0.06 || LassoLarsIC                   |               0.62 |      0.72 |  3.98 |       0.01 || LassoLarsCV                   |               0.62 |      0.72 |  3.98 |       0.02 || Lars                          |               0.61 |      0.72 |  3.99 |       0.01 || LarsCV                        |               0.61 |      0.71 |  4.02 |       0.04 || SGDRegressor                  |               0.60 |      0.70 |  4.07 |       0.01 || TweedieRegressor              |               0.59 |      0.70 |  4.12 |       0.01 || GeneralizedLinearRegressor    |               0.59 |      0.70 |  4.12 |       0.01 || ElasticNet                    |               0.58 |      0.69 |  4.16 |       0.01 || Lasso                         |               0.54 |      0.66 |  4.35 |       0.02 || RANSACRegressor               |               0.53 |      0.65 |  4.41 |       0.04 || OrthogonalMatchingPursuitCV   |               0.45 |      0.59 |  4.78 |       0.02 || PassiveAggressiveRegressor    |               0.37 |      0.54 |  5.09 |       0.01 || GaussianProcessRegressor      |               0.23 |      0.43 |  5.65 |       0.03 || OrthogonalMatchingPursuit     |               0.16 |      0.38 |  5.89 |       0.01 || ExtraTreeRegressor            |               0.08 |      0.32 |  6.17 |       0.01 || DummyRegressor                |              -0.38 |     -0.02 |  7.56 |       0.01 || LassoLars                     |              -0.38 |     -0.02 |  7.56 |       0.01 || KernelRidge                   |             -11.50 |     -8.25 | 22.74 |       0.01 |

总结建议

LazyPredict库可以在几行Python代码中训练大约70个分类和回归模型。这是一个非常方便的工具,因为它提供了模型执行的总体情况,并且可以比较每个模型的性能。

每个模型都使用其默认参数进行训练,因为它不执行超参数调整。选择性能最佳的模型后,开发人员可以调整模型以进一步提高性能。

需要注意的是:这个库仅用于测试目的,为提供有关哪种模型在您的数据集上表现良好的信息。建议使用conda单独建立一个虚拟环境,为它提供了一个单独的环境,避免与其他环境有版本冲突。

参考:

_https://mp.weixin.qq.com/s/5wQerXhb9PcgsiE31okapg;_https://mp.weixin.qq.com/s/o92ZJMMJHqKAFf0Sup6_8A

图片

posted @ 2024-07-28 17:23  生物信息与育种  阅读(114)  评论(0编辑  收藏  举报