机器学习建模中--先“特征选择”还是先“划分数据集”?
先说结论:应该先“划分数据集”,再进行“特征选择”。这样可以避免数据泄露。
测试集就应该当做“看不见的数据”,只能在最后用一次,按照这个原则处理。
代码实例:
# -*- coding: utf-8 -*-
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#===================错误做法:特征选择在前,划分数据集在后============================
# #---错误做法的结果很好----
# # random data:
# X = np.random.randn(500, 10000)
# y = np.random.choice(2, size=500)
# selector = SelectKBest(k=25)
# # first select features
# X_selected = selector.fit_transform(X,y)
# # then split
# X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.25, random_state=42)
# # fit a simple logistic regression
# lr = LogisticRegression()
# lr.fit(X_selected_train,y_train)
# # predict on the test set and get the test accuracy:
# y_pred = lr.predict(X_selected_test)
# acc = accuracy_score(y_test, y_pred)
# print(acc)
# #几次结果为0.712,0.688,0.792,0.776,0.648
# #---检验一下这种错误做法;泛化性能很差----
# X_new = np.random.randn(500, 10000)
# y_new = np.random.choice(2, size=500)
# # select the same features in the new data
# X_new_selected = selector.transform(X_new)
# # predict and get the accuracy:
# y_new_pred = lr.predict(X_new_selected)
# acc_new = accuracy_score(y_new, y_new_pred)
# print(acc_new)
# # 几次结果为:0.498,0.504, 0.492, 0.538
#=============正确做法:先划分数据集,再特征选择===============================
# random data:
X = np.random.randn(500, 10000)
y = np.random.choice(2, size=500)
# split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# then select features using the training set only
selector = SelectKBest(k=25)
X_train_selected = selector.fit_transform(X_train,y_train)
# fit again a simple logistic regression
lr = LogisticRegression()
lr.fit(X_train_selected,y_train)
# select the same features on the test set, predict, and get the test accuracy:
X_test_selected = selector.transform(X_test)
y_pred = lr.predict(X_test_selected)
acc = accuracy_score(y_test, y_pred)
print(acc)
# 几次的结果为:0.48,0.472,0.52
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!