Tensorflow2.0进阶学习-分布式自定义循环训练 (十)

其实分布式最核心的方式就是tf.distribute.Strategy这个函数，只要在这个函数的管理范围都可以。自定义循环训练

自定义循环训练

引包
数据准备
创建一个分布式策略
模型准备
跑起来
加载模型
循环数据

引包

引入一些基础包

# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

print(tf.__version__)

数据准备

fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Adding a dimension to the array -> new shape == (28, 28, 1)
# We are doing this because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Getting the images in [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

创建一个分布式策略

tf.distribute.MirroredStrategy 策略是如何运作的？

所有变量和模型图都复制在副本上。
输入都均匀分布在副本中。
每个副本在收到输入后计算输入的损失和梯度。
通过求和，每一个副本上的梯度都能同步。
同步后，每个副本上的复制的变量都可以同样更新。

有了strategy 后，之后代码就在它的管理范围内写就好啦，就分布式辽。

# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

来让我看看到底几个设备，只有一个GPU

print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# Number of devices: 1

将数据变成注入流水线，就像流水一样，源源不断地注入模型中进行训练。

# 对数据集的超参数进行设置
BUFFER_SIZE = len(train_images)

BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

EPOCHS = 10

设置完毕后，创建数据集分发它们

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE) 
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE) 
# 数据也要用strategy进行修饰
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)

模型准备

def create_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
    ])

  return model

# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

定义损失函数的时候要注意一下。

如果使用 tf.keras.losses 类（如下面的示例所示），则需要将损失缩减显式地指定为 NONE 或 SUM。与tf.distribute.Strategy 一起使用时，不允许使用 AUTO 和 SUM_OVER_BATCH_SIZE。不允许使用 AUTO，因为用户应明确考虑他们想要的缩减量，以确保在分布式情况下缩减量正确。不允许使用SUM_OVER_BATCH_SIZE，因为当前它只能按副本批次大小进行划分，而将按副本数量划分划留给用户，这可能很容易遗漏。因此，我们转而要求用户自己显式地执行缩减操作。

SparseCategoricalAccuracy 就是交叉熵函数，分类用。

with strategy.scope():
  # Set reduction to `none` so we can do the reduction afterwards and divide by
  # global batch size.
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True,
      reduction=tf.keras.losses.Reduction.NONE)  # NONE 或者 SUM
  def compute_loss(labels, predictions):
    per_example_loss = loss_object(labels, predictions)
    return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

追踪训练信息

with strategy.scope():
  test_loss = tf.keras.metrics.Mean(name='test_loss')

  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='train_accuracy')
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy')

跑起来

# model, optimizer, and checkpoint must be created under `strategy.scope`.
with strategy.scope():
  model = create_model()

  optimizer = tf.keras.optimizers.Adam()

  checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
# 训练的处理
def train_step(inputs):

  images, labels = inputs

  with tf.GradientTape() as tape:
    predictions = model(images, training=True)
    loss = compute_loss(labels, predictions)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_accuracy.update_state(labels, predictions)
  return loss 
# 测试的处理
def test_step(inputs):
  images, labels = inputs

  predictions = model(images, training=False)
  t_loss = loss_object(labels, predictions)

  test_loss.update_state(t_loss)
  test_accuracy.update_state(labels, predictions)

以下就是跑起来的代码

# `run` replicates the provided computation and runs it
# with the distributed input.
@tf.function  # 加这个可以加快训练，但是无法断点了
def distributed_train_step(dataset_inputs):
  per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
  return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                         axis=None)

@tf.function
def distributed_test_step(dataset_inputs):
  return strategy.run(test_step, args=(dataset_inputs,))

for epoch in range(EPOCHS):
  # TRAIN LOOP
  total_loss = 0.0
  num_batches = 0
  for x in train_dist_dataset:
    total_loss += distributed_train_step(x)
    num_batches += 1
  train_loss = total_loss / num_batches

  # TEST LOOP
  for x in test_dist_dataset:
    distributed_test_step(x)

  if epoch % 2 == 0:
    checkpoint.save(checkpoint_prefix)

  template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
              "Test Accuracy: {}")
  print (template.format(epoch+1, train_loss,
                         train_accuracy.result()*100, test_loss.result(),
                         test_accuracy.result()*100))

  test_loss.reset_states()
  train_accuracy.reset_states()
  test_accuracy.reset_states()

加载模型

加载最近的模型并进行测试
使用 tf.distribute.Strategy 设置了检查点的模型可以使用或不使用策略进行恢复

eval_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='eval_accuracy')

new_model = create_model()
new_optimizer = tf.keras.optimizers.Adam()

test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)

@tf.function
def eval_step(images, labels):
  predictions = new_model(images, training=False)
  eval_accuracy(labels, predictions)
  
checkpoint = tf.train.Checkpoint(optimizer=new_optimizer, model=new_model)
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

for images, labels in test_dataset:
  eval_step(images, labels)

print ('Accuracy after restoring the saved model without strategy: {}'.format(
    eval_accuracy.result()*100))

循环数据

循环数据有 tf.function 内外之分，在外就自己写写，和Torch差不多

for _ in range(EPOCHS):
  total_loss = 0.0
  num_batches = 0
  # 走的是代数
  train_iter = iter(train_dist_dataset)
	
  for _ in range(10):
  	# 走的是批次
    total_loss += distributed_train_step(next(train_iter))
    num_batches += 1
  average_train_loss = total_loss / num_batches

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print (template.format(epoch+1, average_train_loss, train_accuracy.result()*100))
  train_accuracy.reset_states()

在内这样写可以for x in ... 和上面iter(next()) 等价。

@tf.function
def distributed_train_epoch(dataset):
  total_loss = 0.0
  num_batches = 0
  for x in dataset:
    per_replica_losses = strategy.run(train_step, args=(x,))
    total_loss += strategy.reduce(
      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
    num_batches += 1
  return total_loss / tf.cast(num_batches, dtype=tf.float32)

for epoch in range(EPOCHS):
  train_loss = distributed_train_epoch(train_dist_dataset)

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print (template.format(epoch+1, train_loss, train_accuracy.result()*100))

  train_accuracy.reset_states()

posted @ 2022-05-04 16:02 赫凯阅读(27) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Tensorflow2.0进阶学习-Keras 的分布式训练 (九)

· Tensorflow2.0学习-快速入门 (一)

· Tensorflow 2.x入门教程

· [翻译] 使用 TensorFlow 进行分布式训练

· [源码解析] TensorFlow 分布式 DistributedStrategy 之基础篇

阅读排行：
· TypeScript + Deepseek 打造卜卦网站：技术与玄学的结合
· Manus的开源复刻OpenManus初探
· 写一个简单的SQL生成工具
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期（2025年3.1-3.9）

公告

**可以写一些人生感悟，不止于技术！**

昵称：赫凯
园龄： 7年
粉丝： 1
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

赫凯

Tensorflow2.0进阶学习-分布式自定义循环训练 (十)

自定义循环训练

引包

数据准备

创建一个分布式策略

模型准备

跑起来

加载模型

循环数据

公告

搜索

常用链接

我的标签

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论