tfds.load()和tf.data.Dataset的简介
tfds.load()和tf.data.Dataset的简介
tfds.load()有以下参数
tfds.load(
name, split=None, data_dir=None, batch_size=None, shuffle_files=False,
download=True, as_supervised=False, decoders=None, read_config=None,
with_info=False, builder_kwargs=None, download_and_prepare_kwargs=None,
as_dataset_kwargs=None, try_gcs=False
)
重要参数如下:
- name 数据集的名字
- split 对数据集的切分
- data_dir 数据的位置或者数据下载的位置
- batch_size 批道数
- shuffle_files 打乱
- as_supervised 返回元组(默认返回时字典的形式的)
1.数据的切分
# 拿数据集中训练集(数据集默认划分为train,test)
train_ds = tfds.load('mnist', split='train')
# 两部分都拿出来
train_ds, test_ds = tfds.load('mnist', split=['train', 'test'])
# 两部分都拿出来,并合成一个
train_test_ds = tfds.load('mnist', split='train+test')
# 从训练集的10(含)到20(不含)
train_10_20_ds = tfds.load('mnist', split='train[10:20]')
# 训练集的前10%
train_10pct_ds = tfds.load('mnist', split='train[:10%]')
# 训练集的前10%和后80%
train_10_80pct_ds = tfds.load('mnist', split='train[:10%]+train[-80%:]')
#---------------------------------------------------
# 10%的交错验证集:
# 没批验证集拿训练集的10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
vals_ds = tfds.load('mnist', split=[
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
])
# 训练集拿90%:
# [10%:100%] (验证集为 [0%:10%]),
# [0%:10%] + [20%:100%] (验证集为 [10%:20%]), ...,
# [0%:90%] (验证集为 [90%:100%]).
trains_ds = tfds.load('mnist', split=[
f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
])
还有使用ReadInstruction
API 切分的,效果跟上面一样
# The full `train` split.
train_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train'))
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = tfds.load('mnist', split=[
tfds.core.ReadInstruction('train'),
tfds.core.ReadInstruction('test'),
])
# The full `train` and `test` splits, interleaved together.
ri = tfds.core.ReadInstruction('train') + tfds.core.ReadInstruction('test')
train_test_ds = tfds.load('mnist', split=ri)
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction(
'train', from_=10, to=20, unit='abs'))
# The first 10% of train split.
train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction(
'train', to=10, unit='%'))
# The first 10% of train + the last 80% of train.
ri = (tfds.core.ReadInstruction('train', to=10, unit='%') +
tfds.core.ReadInstruction('train', from_=-80, unit='%'))
train_10_80pct_ds = tfds.load('mnist', split=ri)
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = tfds.load('mnist', [
tfds.core.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains_ds = tfds.load('mnist', [
(tfds.core.ReadInstruction('train', to=k, unit='%') +
tfds.core.ReadInstruction('train', from_=k+10, unit='%'))
for k in range(0, 100, 10)])
2.返回的对象
返回的对象是一个tf.data.Dataset或者和一个tfds.core.DatasetInfo(如果有的话)
3.指定目录
指定目录十分简单(默认会放到用户目录下面)
train_ds = tfds.load('mnist', split='train',data_dir='~/user')
4.获取img和label
因为返回的是一个tf.data.Dataset对象,我们可以在对其进行迭代之前对数据集进行操作,以此来获取符合我们要求的数据。
tf.data.Dataset有以下几个重要的方法:
4.1 shuffle
数据的打乱
shuffle(
buffer_size, seed=None, reshuffle_each_iteration=None
)
#随机重新排列此数据集的元素。
#该数据集用buffer_size元素填充缓冲区,然后从该缓冲区中随机采样元素,将所选元素替换为新元素。为了实现完美
#的改组,需要缓冲区大小大于或等于数据集的完整大小。
#例如,如果您的数据集包含10,000个元素但buffer_size设置为1,000个,则shuffle最初将仅从缓冲区的前1,000
#个元素中选择一个随机元素。选择一个元素后,其缓冲区中的空间将由下一个(即1,001个)元素替换,并保留1,000个#元素缓冲区。
#reshuffle_each_iteration控制随机播放顺序对于每个时期是否应该不同。
4.2 batch
批道大小(一批多少个数据),迭代的是时候根据批道数放回对应的数据量
batch(
batch_size, drop_remainder=False
)
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
list(dataset.as_numpy_iterator())
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)
list(dataset.as_numpy_iterator())
返回的是一个Dataset
4.3 map
用跟普通的map方法差不多,目的是对数据集操作
map(
map_func, num_parallel_calls=None, deterministic=None
)
dataset = Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ]
dataset = dataset.map(lambda x: x + 1)
list(dataset.as_numpy_iterator())
返回的是一个Dataset
4.4 as_numpy_iterator
返回一个迭代器,该迭代器将数据集的所有元素转换为numpy。
使用as_numpy_iterator
检查你的数据集的内容。要查看元素的形状和类型,请直接打印数据集元素,而不要使用 as_numpy_iterator
。
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
print(element)
#tf.Tensor( 1 , shape = ( ) , dtype = int32 )
#tf.Tensor ( 2 , shape = ( ) . dtype = int32 )
#tf.Tensor ( 3 , shape = ( ) , dtype = int32 )
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
print(element)
#1
#2
#3
4.5 对数据集操作示例
通过下面的写法可以获取符合格式的数据:
#先用map()将img进行resize,然后进打乱,然后设定迭代的放回的batch_size
dataset_train = dataset_train.map(lambda img, label: (tf.image.resize(img, (224, 224)) / 255.0, label)).shuffle(1024).batch(batch_size)
#因为是测试集,所以不打乱,只是把img进行resize
dataset_test = dataset_test.map(lambda img, label: (tf.image.resize(img, (224, 224)) / 255.0, label)).batch(batch_size)
对数据进行迭代:
for images, labels in dataset_train:
labels_pred = model(images, training=True)
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=labels, y_pred=labels_pred)
loss = tf.reduce_mean(loss)
········
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)