测试集&训练集划分的集中方式对比：KFold，RepeatedKFold， RepeatedStratifiedKFold， train_test_split

KFold

KFold()中：

当shuffle=False时，每次划分的折都一样；
当指定random_state时，必须shuffle=True；但是当random_state不变时，其划分就是一样的；可以重新复现

import numpy as np
from sklearn.model_selection import KFold

# X = range(10)
X = np.array([1,3,5,7,9,21])

# kf = KFold(n_splits=3, random_state=None, shuffle=True)
kf = KFold(n_splits=3, random_state=24, shuffle=True)

for train,test in kf.split(X):
    print('train_index:',train)
    print('test_index:',test)
    print('train:', X[train])
    print('test:',X[test])
    print('\n')

在random_state固定时，其多次运行结果是一样的：

train_index: [0 1 2 3]
test_index: [4 5]
train: [1 3 5 7]
test: [ 9 21]


train_index: [2 3 4 5]
test_index: [0 1]
train: [ 5  7  9 21]
test: [1 3]


train_index: [0 1 4 5]
test_index: [2 3]
train: [ 1  3  9 21]
test: [5 7]

RepeatedKFold

RepeatedKFold中：

n_splits控制划分的折数；
n_repeats控制重复划分的次数（重复划分是不同的）；
random_state不变时，每次程序运行的结果是一样的可重复验证

import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([1,3,5,7,9,21])
cv = RepeatedKFold(n_splits=3, n_repeats=2, random_state=42)

for train,test in cv.split(X):
    print('train_index:',train)
    print('test_index:',test)
    print('train:', X[train])
    print('test:',X[test])
    print('\n')

在random_state固定时，程序多次运行结果是一样的（但是n_repeats重复划分的结果是不同的）：

#n_repeats第1次重复
train_index: [2 3 4 5]
test_index: [0 1]
train: [ 5  7  9 21]
test: [1 3]

train_index: [0 1 3 4]
test_index: [2 5]
train: [1 3 7 9]
test: [ 5 21]

train_index: [0 1 2 5]
test_index: [3 4]
train: [ 1  3  5 21]
test: [7 9]

#n_repeats第2次重复
train_index: [1 2 4 5]
test_index: [0 3]
train: [ 3  5  9 21]
test: [1 7]

train_index: [0 3 4 5]
test_index: [1 2]
train: [ 1  7  9 21]
test: [3 5]

train_index: [0 1 2 3]
test_index: [4 5]
train: [1 3 5 7]
test: [ 9 21]

RepeatedStratifiedKFold

RepeatedStratifiedKFold:

RepeatedStratifiedKFold是分层抽样，所以需要根据y的类别，按比例抽样；
X需要是2D以上数组；
其余和RepeatedKFold是一样的

import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1,3,5,7,9,11],
              [2,4,6,8,10,12],
              [21,25,23,27,29,22],
              [30,36,34,38,32,35],
              [41,42,43,44,45,46],
              [51,52,53,54,55,56]])
y = [0,0,0,1,1,1]

cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=42)

for train,test in cv.split(X,y):
    print('train_index:',train)
    print('test_index:',test)
    print('train: \n', X[train])
    print('test: \n',X[test])
    print('\n')

运行结果：

#n_repeats第1次重复
train_index: [1 2 3 4]
test_index: [0 5]
train: 
 [[ 2  4  6  8 10 12]
 [21 25 23 27 29 22]
 [30 36 34 38 32 35]
 [41 42 43 44 45 46]]
test: 
 [[ 1  3  5  7  9 11]
 [51 52 53 54 55 56]]


train_index: [0 2 4 5]
test_index: [1 3]
train: 
 [[ 1  3  5  7  9 11]
 [21 25 23 27 29 22]
 [41 42 43 44 45 46]
 [51 52 53 54 55 56]]
test: 
 [[ 2  4  6  8 10 12]
 [30 36 34 38 32 35]]


train_index: [0 1 3 5]
test_index: [2 4]
train: 
 [[ 1  3  5  7  9 11]
 [ 2  4  6  8 10 12]
 [30 36 34 38 32 35]
 [51 52 53 54 55 56]]
test: 
 [[21 25 23 27 29 22]
 [41 42 43 44 45 46]]

#n_repeats第2次重复
train_index: [1 2 3 4]
test_index: [0 5]
train: 
 [[ 2  4  6  8 10 12]
 [21 25 23 27 29 22]
 [30 36 34 38 32 35]
 [41 42 43 44 45 46]]
test: 
 [[ 1  3  5  7  9 11]
 [51 52 53 54 55 56]]

train_index: [0 2 4 5]
test_index: [1 3]
train: 
 [[ 1  3  5  7  9 11]
 [21 25 23 27 29 22]
 [41 42 43 44 45 46]
 [51 52 53 54 55 56]]
test: 
 [[ 2  4  6  8 10 12]
 [30 36 34 38 32 35]]

train_index: [0 1 3 5]
test_index: [2 4]
train: 
 [[ 1  3  5  7  9 11]
 [ 2  4  6  8 10 12]
 [30 36 34 38 32 35]
 [51 52 53 54 55 56]]
test: 
 [[21 25 23 27 29 22]
 [41 42 43 44 45 46]]

train_test_split

train_test_split:

这种就根据test_size=0.2比例，划分一次，不分折;
而 KFold则是划分多折，但是会每一折都会测试；所以KFold要比train_test_split复杂一点。
KFold返回的是索引，需要根据索引找到具体的值；train_test_split返回的直接就是索引。

import numpy as np
from sklearn.model_selection import train_test_split

X = np.array([1,3,5,7,9,11])
y = [21,25,23,27,29,22]

X_train,  X_test,  y_train,  y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('X_train:', X_train)
print('X_test:', X_test)
print('y_train:', y_train)
print('y_test:', y_test)

运行结果：

#返回的直接是具体的值，而不是索引
X_train: [11  5  9  7]
X_test: [1 3]
y_train: [22, 23, 29, 27]
y_test: [21, 25]

posted @ 2023-05-15 16:29 温小皮阅读(211) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

测试集&训练集划分的集中方式对比：KFold，RepeatedKFold， RepeatedStratifiedKFold， train_test_split

KFold

RepeatedKFold

RepeatedStratifiedKFold

train_test_split

公告