AutoML 简单分类

最近工作碰见一个棘手的事情，是给了固定长度数字，分下类。一开始用了深度学习网络，效果不太好，于是转向机器学习，带着先验知识来看看咋样。结果机器学习学起来头大，调参有点劝退，就试试自动化机器学习了。

自动化机器学习的库有好多，我用了autogluon这个，是李沐大神开发的，基本上一切的结构化数据都可以操作，就照着做就好了。

直接上代码

import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor

# 这里是直接进来一个.csv格式的表单，我这里粗略处理下，得到训练集和测试集
data_df = pd.read_csv(i)
print(data_df['label'].value_counts())
train_df = data_df.sample(frac=0.8, axis=0, random_state=2022)
test_df = data_df[~data_df.index.isin(train_df.index)]

# 下面就是自动化代码了，代码学习成本特别低。
# 把训练集数据变成automl框架指定的数据集
train_data = TabularDataset(train_df)
# 给训练器指定一下我们要预测哪一列数据，我们想要依据哪个指标来让它越来越好
predictor = TabularPredictor(label='label', eval_metric='f1')
# 下面就训练了，第一项是训练数据，这里我有几列不想让其参与训练，
# 第二项 训练模式，我选了一个最好的，但是耗时最长的
# 第三项 自动堆叠，我还没弄清楚
# 第四项 时间限制，如果不用这个，真的会卡在原地一动不动
# 第五项 输出模式
predictor.fit(train_data.drop(columns=['sdate','2']),
              presets='best_quality',
              auto_stack=True,
              time_limit=7200,
              verbosity=2)

# 测试集的标签，先拿出来
y_test = test_df['label']
# 预测一波
y_pred = predictor.predict(test_df.drop(columns=['1', '2', 'label']))
# 下面可以看下测试集上的指标什么的
# ....

# 这个可以看见一些模型的表现如何
predictor.leaderboard(test_df, silent=True)

输出是这样的

No path specified. Models will be saved in: "AutogluonModels/ag-20221101_075738\"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 7200s
AutoGluon will save models to "AutogluonModels/ag-20221101_075738\"
AutoGluon Version:  0.5.2
Python Version:     3.9.13
Operating System:   Windows
Train Data Rows:    527506
Train Data Columns: 21
Label Column: alarm+1
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1855.73 MB
	Train Data (Original)  Memory Usage: 88.62 MB (4.8% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...