测试集&训练集划分的集中方式对比:KFold,RepeatedKFold, RepeatedStratifiedKFold, train_test_split
KFold
KFold()中:
- 当shuffle=False时,每次划分的折都一样;
- 当指定random_state时,必须shuffle=True;但是当random_state不变时,其划分就是一样的;可以重新复现
import numpy as np
from sklearn.model_selection import KFold
# X = range(10)
X = np.array([1,3,5,7,9,21])
# kf = KFold(n_splits=3, random_state=None, shuffle=True)
kf = KFold(n_splits=3, random_state=24, shuffle=True)
for train,test in kf.split(X):
print('train_index:',train)
print('test_index:',test)
print('train:', X[train])
print('test:',X[test])
print('\n')
在random_state固定时,其多次运行结果是一样的:
train_index: [0 1 2 3]
test_index: [4 5]
train: [1 3 5 7]
test: [ 9 21]
train_index: [2 3 4 5]
test_index: [0 1]
train: [ 5 7 9 21]
test: [1 3]
train_index: [0 1 4 5]
test_index: [2 3]
train: [ 1 3 9 21]
test: [5 7]
RepeatedKFold
RepeatedKFold中:
- n_splits控制划分的折数;
- n_repeats控制重复划分的次数(重复划分是不同的);
- random_state不变时,每次程序运行的结果是一样的可重复验证
import numpy as np
from sklearn.model_selection import RepeatedKFold
X = np.array([1,3,5,7,9,21])
cv = RepeatedKFold(n_splits=3, n_repeats=2, random_state=42)
for train,test in cv.split(X):
print('train_index:',train)
print('test_index:',test)
print('train:', X[train])
print('test:',X[test])
print('\n')
在random_state固定时,程序多次运行结果是一样的(但是n_repeats重复划分的结果是不同的):
#n_repeats第1次重复
train_index: [2 3 4 5]
test_index: [0 1]
train: [ 5 7 9 21]
test: [1 3]
train_index: [0 1 3 4]
test_index: [2 5]
train: [1 3 7 9]
test: [ 5 21]
train_index: [0 1 2 5]
test_index: [3 4]
train: [ 1 3 5 21]
test: [7 9]
#n_repeats第2次重复
train_index: [1 2 4 5]
test_index: [0 3]
train: [ 3 5 9 21]
test: [1 7]
train_index: [0 3 4 5]
test_index: [1 2]
train: [ 1 7 9 21]
test: [3 5]
train_index: [0 1 2 3]
test_index: [4 5]
train: [1 3 5 7]
test: [ 9 21]
RepeatedStratifiedKFold
RepeatedStratifiedKFold:
- RepeatedStratifiedKFold是分层抽样,所以需要根据y的类别,按比例抽样;
- X需要是2D以上数组;
- 其余和RepeatedKFold是一样的
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1,3,5,7,9,11],
[2,4,6,8,10,12],
[21,25,23,27,29,22],
[30,36,34,38,32,35],
[41,42,43,44,45,46],
[51,52,53,54,55,56]])
y = [0,0,0,1,1,1]
cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=42)
for train,test in cv.split(X,y):
print('train_index:',train)
print('test_index:',test)
print('train: \n', X[train])
print('test: \n',X[test])
print('\n')
运行结果:
#n_repeats第1次重复
train_index: [1 2 3 4]
test_index: [0 5]
train:
[[ 2 4 6 8 10 12]
[21 25 23 27 29 22]
[30 36 34 38 32 35]
[41 42 43 44 45 46]]
test:
[[ 1 3 5 7 9 11]
[51 52 53 54 55 56]]
train_index: [0 2 4 5]
test_index: [1 3]
train:
[[ 1 3 5 7 9 11]
[21 25 23 27 29 22]
[41 42 43 44 45 46]
[51 52 53 54 55 56]]
test:
[[ 2 4 6 8 10 12]
[30 36 34 38 32 35]]
train_index: [0 1 3 5]
test_index: [2 4]
train:
[[ 1 3 5 7 9 11]
[ 2 4 6 8 10 12]
[30 36 34 38 32 35]
[51 52 53 54 55 56]]
test:
[[21 25 23 27 29 22]
[41 42 43 44 45 46]]
#n_repeats第2次重复
train_index: [1 2 3 4]
test_index: [0 5]
train:
[[ 2 4 6 8 10 12]
[21 25 23 27 29 22]
[30 36 34 38 32 35]
[41 42 43 44 45 46]]
test:
[[ 1 3 5 7 9 11]
[51 52 53 54 55 56]]
train_index: [0 2 4 5]
test_index: [1 3]
train:
[[ 1 3 5 7 9 11]
[21 25 23 27 29 22]
[41 42 43 44 45 46]
[51 52 53 54 55 56]]
test:
[[ 2 4 6 8 10 12]
[30 36 34 38 32 35]]
train_index: [0 1 3 5]
test_index: [2 4]
train:
[[ 1 3 5 7 9 11]
[ 2 4 6 8 10 12]
[30 36 34 38 32 35]
[51 52 53 54 55 56]]
test:
[[21 25 23 27 29 22]
[41 42 43 44 45 46]]
train_test_split
train_test_split:
- 这种就根据test_size=0.2比例,划分一次,不分折;
- 而 KFold则是划分多折,但是会每一折都会测试;所以KFold要比train_test_split复杂一点。
- KFold返回的是索引,需要根据索引找到具体的值;train_test_split返回的直接就是索引。
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array([1,3,5,7,9,11])
y = [21,25,23,27,29,22]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('X_train:', X_train)
print('X_test:', X_test)
print('y_train:', y_train)
print('y_test:', y_test)
运行结果:
#返回的直接是具体的值,而不是索引
X_train: [11 5 9 7]
X_test: [1 3]
y_train: [22, 23, 29, 27]
y_test: [21, 25]
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!