13、数据集加载

① keras.datasets 加载数据集准备

② tf.data.Dataset.from_tensor_slices  将数据加载进内存,转换为tensor

  • shuffle 打乱顺序
  • map  自动进行函数转换
  • batch   批量操作
  • repeat  重复

③ Pipeline 多线程加载大型数据集


 (1)keras.datasets 加载数据集准备

  ① boston housing

  ”Boston housing price regression dataset. 波斯顿房价回归数据集

  ② mnist/fashion mnist 

  MNIST/Fashion-MNIST dataset.  手写数据集

  ③ cifar1o/100

  ”small images classification dataset. 图片数据集

    ④ imdb

  ”sentiment classification dataset. 对NLP处理的数据集  (情感分类数据集)

 

(2)MNIST 手写数据集

  

 

  每个数字的图片像素是28x28,通道数为1,70k的图片数量分为60k训练集和10k的测试集

1 (x,y),(x_test,y_test) = datasets.mnist.load_data()
2 print(x[1])
3 print(x.shape)
4 print(y[:20])
5 
6 y_onehot = tf.one_hot(y, depth=10) #one_hot编码
7 print(y_onehot[:10])

输出:

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  51 159 253
  159  50   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0  48 238 252 252
  252 237   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0  54 227 253 252 239
  233 252  57   6   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0  10  60 224 252 253 252 202
   84 252 253 122   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0 163 252 252 252 253 252 252
   96 189 253 167   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0  51 238 253 253 190 114 253 228
   47  79 255 168   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0  48 238 252 252 179  12  75 121  21
    0   0 253 243  50   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0  38 165 253 233 208  84   0   0   0   0
    0   0 253 252 165   0   0   0   0   0]
 [  0   0   0   0   0   0   0   7 178 252 240  71  19  28   0   0   0   0
    0   0 253 252 195   0   0   0   0   0]
 [  0   0   0   0   0   0   0  57 252 252  63   0   0   0   0   0   0   0
    0   0 253 252 195   0   0   0   0   0]
 [  0   0   0   0   0   0   0 198 253 190   0   0   0   0   0   0   0   0
    0   0 255 253 196   0   0   0   0   0]
 [  0   0   0   0   0   0  76 246 252 112   0   0   0   0   0   0   0   0
    0   0 253 252 148   0   0   0   0   0]
 [  0   0   0   0   0   0  85 252 230  25   0   0   0   0   0   0   0   0
    7 135 253 186  12   0   0   0   0   0]
 [  0   0   0   0   0   0  85 252 223   0   0   0   0   0   0   0   0   7
  131 252 225  71   0   0   0   0   0   0]
 [  0   0   0   0   0   0  85 252 145   0   0   0   0   0   0   0  48 165
  252 173   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0  86 253 225   0   0   0   0   0   0 114 238 253
  162   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0  85 252 249 146  48  29  85 178 225 253 223 167
   56   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0  85 252 252 252 229 215 252 252 252 196 130   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0  28 199 252 252 253 252 252 233 145   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0  25 128 252 253 252 141  37   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]]
(60000, 28, 28)
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]

tf.Tensor(
[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]], shape=(10, 10), dtype=float32)

(3)CIFAR10/100

 

每个图片像素是32x32,通道数为3,70k的图片数量分为60k训练集和10k的测试集

 

1 (x,y),(x_test,y_test) = datasets.cifar10.load_data()
2 print(x.shape) #(50000, 32, 32, 3)
3 print(y[:20])
4 
5 y_onehot = tf.one_hot(y, depth=10)
6 print(y_onehot[:10])

输出:

(50000, 32, 32, 3)
[[6]
 [9]
 [9]
 [4]
 [1]
 [1]
 [2]
 [7]
 [8]
 [3]
 [4]
 [7]
 [7]
 [2]
 [9]
 [9]
 [9]
 [3]
 [2]
 [6]]

tf.Tensor(
[[[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

 [[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]

 [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]]], shape=(10, 1, 10), dtype=float32)

(3)tf.data.Dataset

  ① tf.data.Dataset.from_tensor_slices()  

  将numpy转换为tensor

1 (x,y),(x_test,y_test) = datasets.cifar10.load_data()
2 db = tf.data.Dataset.from_tensor_slices(x_test) #转换为tensor
3 print(db) #得到一个数据集对象
4 
5 db_ = next(iter(db))  #iter(db)得到迭代器,使用next方法得到一张图片
6 print(db_.shape)  #(32, 32, 3)

  ② .shuffle 

  在做DP时,神经网络具有很强的记忆功能,如果总是按照固定的顺序进行训练,就会导致网络找到捷径影响预测

1 db = tf.data.Dataset.from_tensor_slices(x_test)
2 db = db.shuffle(10000)
3 print(db)  #生成一个对象,<ShuffleDataset shapes: (32, 32, 3), types: tf.uint8>
4 print(next(iter(db)))

  ③ .map 

 1 def preProcess(x,y):
 2     x = tf.cast(x, dtype = tf.float32)/255
 3     y = tf.cast(y, dtype = tf.int32)
 4     y = tf.one_hot(y,depth = 10)
 5     return x, y
 6 
 7 (x,y),(x_test,y_test) = datasets.cifar10.load_data()
 8 db = tf.data.Dataset.from_tensor_slices((x_test, y_test))
 9 db2 = db.map(preProcess)
10 print(next(iter(db2)))

  ④ .batch

1 db3 = db2.batch(32)
2 res = next(iter(db3))
3 print(res[0].shape, res[1].shape)  #(32, 32, 32, 3) (32, 1, 10)

  ⑤ .repeat()

db4 = db3.repeat()
print(db4)

 

 

 

  

posted on 2019-12-06 15:26  Luaser  阅读(605)  评论(0编辑  收藏  举报