测试集&训练集划分的集中方式对比:KFold,RepeatedKFold, RepeatedStratifiedKFold, train_test_split

KFold

KFold()中:

  • 当shuffle=False时,每次划分的折都一样;
  • 当指定random_state时,必须shuffle=True;但是当random_state不变时,其划分就是一样的;可以重新复现
import numpy as np
from sklearn.model_selection import KFold

# X = range(10)
X = np.array([1,3,5,7,9,21])

# kf = KFold(n_splits=3, random_state=None, shuffle=True)
kf = KFold(n_splits=3, random_state=24, shuffle=True)

for train,test in kf.split(X):
    print('train_index:',train)
    print('test_index:',test)
    print('train:', X[train])
    print('test:',X[test])
    print('\n')

在random_state固定时,其多次运行结果是一样的:

train_index: [0 1 2 3]
test_index: [4 5]
train: [1 3 5 7]
test: [ 9 21]


train_index: [2 3 4 5]
test_index: [0 1]
train: [ 5  7  9 21]
test: [1 3]


train_index: [0 1 4 5]
test_index: [2 3]
train: [ 1  3  9 21]
test: [5 7]

RepeatedKFold

RepeatedKFold中:

  • n_splits控制划分的折数;
  • n_repeats控制重复划分的次数(重复划分是不同的);
  • random_state不变时,每次程序运行的结果是一样的可重复验证
import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([1,3,5,7,9,21])
cv = RepeatedKFold(n_splits=3, n_repeats=2, random_state=42)

for train,test in cv.split(X):
    print('train_index:',train)
    print('test_index:',test)
    print('train:', X[train])
    print('test:',X[test])
    print('\n')

在random_state固定时,程序多次运行结果是一样的(但是n_repeats重复划分的结果是不同的):

#n_repeats第1次重复
train_index: [2 3 4 5]
test_index: [0 1]
train: [ 5  7  9 21]
test: [1 3]

train_index: [0 1 3 4]
test_index: [2 5]
train: [1 3 7 9]
test: [ 5 21]

train_index: [0 1 2 5]
test_index: [3 4]
train: [ 1  3  5 21]
test: [7 9]

#n_repeats第2次重复
train_index: [1 2 4 5]
test_index: [0 3]
train: [ 3  5  9 21]
test: [1 7]

train_index: [0 3 4 5]
test_index: [1 2]
train: [ 1  7  9 21]
test: [3 5]

train_index: [0 1 2 3]
test_index: [4 5]
train: [1 3 5 7]
test: [ 9 21]

RepeatedStratifiedKFold

RepeatedStratifiedKFold:

  • RepeatedStratifiedKFold是分层抽样,所以需要根据y的类别,按比例抽样;
  • X需要是2D以上数组;
  • 其余和RepeatedKFold是一样的
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1,3,5,7,9,11],
              [2,4,6,8,10,12],
              [21,25,23,27,29,22],
              [30,36,34,38,32,35],
              [41,42,43,44,45,46],
              [51,52,53,54,55,56]])
y = [0,0,0,1,1,1]

cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=42)

for train,test in cv.split(X,y):
    print('train_index:',train)
    print('test_index:',test)
    print('train: \n', X[train])
    print('test: \n',X[test])
    print('\n')

运行结果:

#n_repeats第1次重复
train_index: [1 2 3 4]
test_index: [0 5]
train: 
 [[ 2  4  6  8 10 12]
 [21 25 23 27 29 22]
 [30 36 34 38 32 35]
 [41 42 43 44 45 46]]
test: 
 [[ 1  3  5  7  9 11]
 [51 52 53 54 55 56]]


train_index: [0 2 4 5]
test_index: [1 3]
train: 
 [[ 1  3  5  7  9 11]
 [21 25 23 27 29 22]
 [41 42 43 44 45 46]
 [51 52 53 54 55 56]]
test: 
 [[ 2  4  6  8 10 12]
 [30 36 34 38 32 35]]


train_index: [0 1 3 5]
test_index: [2 4]
train: 
 [[ 1  3  5  7  9 11]
 [ 2  4  6  8 10 12]
 [30 36 34 38 32 35]
 [51 52 53 54 55 56]]
test: 
 [[21 25 23 27 29 22]
 [41 42 43 44 45 46]]

#n_repeats第2次重复
train_index: [1 2 3 4]
test_index: [0 5]
train: 
 [[ 2  4  6  8 10 12]
 [21 25 23 27 29 22]
 [30 36 34 38 32 35]
 [41 42 43 44 45 46]]
test: 
 [[ 1  3  5  7  9 11]
 [51 52 53 54 55 56]]

train_index: [0 2 4 5]
test_index: [1 3]
train: 
 [[ 1  3  5  7  9 11]
 [21 25 23 27 29 22]
 [41 42 43 44 45 46]
 [51 52 53 54 55 56]]
test: 
 [[ 2  4  6  8 10 12]
 [30 36 34 38 32 35]]

train_index: [0 1 3 5]
test_index: [2 4]
train: 
 [[ 1  3  5  7  9 11]
 [ 2  4  6  8 10 12]
 [30 36 34 38 32 35]
 [51 52 53 54 55 56]]
test: 
 [[21 25 23 27 29 22]
 [41 42 43 44 45 46]]

train_test_split

train_test_split:

  • 这种就根据test_size=0.2比例,划分一次,不分折;
  • 而 KFold则是划分多折,但是会每一折都会测试;所以KFold要比train_test_split复杂一点。
  • KFold返回的是索引,需要根据索引找到具体的值;train_test_split返回的直接就是索引。
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array([1,3,5,7,9,11])
y = [21,25,23,27,29,22]

X_train,  X_test,  y_train,  y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('X_train:', X_train)
print('X_test:', X_test)
print('y_train:', y_train)
print('y_test:', y_test)

运行结果:

#返回的直接是具体的值,而不是索引
X_train: [11  5  9  7]
X_test: [1 3]
y_train: [22, 23, 29, 27]
y_test: [21, 25]
posted @ 2023-05-15 16:29  温小皮  阅读(211)  评论(0编辑  收藏  举报