Distributed Training: DeepSpeed ZeRO 1/2/3 + Accelerate, Megatron-LM

1 Introduction

Github: https://github.com/microsoft/DeepSpeed

  1. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
  2. ZeRO-Offload: Democratizing Billion-Scale Model Training
  3. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
  4. ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

ZeRO(Zero Redundancy Optimizer)是一种去除冗余的分布式数据并行(Data Parallel)方案,分为Stage 1, Stage 2, Stage 3,而Deepspeed就是论文中ZeRO方法的Microsoft官方的工程实现。

ZeRO-Offload为解决由于ZeRO而增加通信数据量的问题,提出将GPU转移到CPU

ZeRO-Infinity同样是进行offload,ZeRO-Offload更侧重单卡场景,而ZeRO-Infinity则是典型的工业界风格,奔着极大规模训练去了

ZeRO++是对ZeRO 3的通信优化,优化了以下三个方面:

  1. 每个服务器有完整的模型参数,消除跨服务器的All_gather操作;
  2. 通信时,基于块的量化,模型参数从FP16转换成INT8;
  3. 替代ring-based ReduceScatter通信,改为分层级的量化 AllToALL;

Megatron-LM是NVIDIA开发的大规模语言模型训练框架,相比于DeepSpeed而言,具有更好的模型并行和流水线并行技术,但数据并行DeepSpeed更有优势。

2 预备知识

2.1 分布式并行策略

单卡可以完成训练流程的模型

  • 数据并行(Data Parallel, DP):每个GPU都复制一份完整模型,但是数据是不同的,每个GPU数据加起来是一个完整的数据

单卡无法完成训练流程的模型

  • 流水线并行(Pipeline Parallel, PP):将模型按层拆开
  • 张量并行(Tensor Parallel, TP):将模型按每层的权重拆开

混合策略(3D并行)

  • 数据并行 + 流水线并行 + 张量并行

3D并行示例:

  • 2路数据并行+ 4路流水线并行 + 4路张量并行

2.2 LLM推理和训练的算力需求估算

2.2.1 数据精度格式

对于大语言模型,选择合适的精度格式至关重要。高精度如FP32适合高要求的任务,但消耗资源多;FP16和bfloat16则在维持性能的同时,显著降低了计算成本。低精度格式如int8和fp4更适合资源受限的环境,尤其在推理任务中,通过压缩存储和计算需求,提高了部署效率。合理运用这些格式能够优化性能和资源利用,推动大语言模型在更广泛场景中的应用。

名称 简称 对应字节 对应比特
单精度浮点格式(Single-precision floating-point format) fp32 4 Bytes 32 bits
半精度浮点格式(Half-precision floating-point format) fp16 2 Bytes 16 bits
脑浮点格式(Brain floating-point format) bp16 2 Bytes 16 bits
8位整数格式(8-bit integer format) int8 1 Bytes 8 bits
4位浮点格式(4-bit floating-point format) fp4 0.5 Bytes 4 bits
4位正常浮点格式 (4-bit NormalFloat format) nf4 0.5 Bytes 4 bits

2.2.2 显存(VRAM)需求计算 - 推理

以LLaMA 2 7B为例,显存需求如下:

仅考虑了模型参数本身,并未包括其他运行时所需的额外空间,如优化器状态、激活等。

类型 模型精度 模型规模 推理/训练 最低显存(以粗略计算方式)
全精度 FP32 7B 推理 7B * 4 Bytes = 28 GB
半精度 FP16 7B 推理 7B * 2 Bytes = 14 GB
低精度 INT8 7B 推理 7B * 1 Bytes = 7 GB
INT4 7B 推理 7B * 0.5 Bytes = 3.5 GB

3 ZeRO

Video: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

如今,Mixed-Precision TrainingAdam Optimizer是LLM Distributed Training的标配

ZeRO将模型训练阶段,每张卡中显存内容分为两类(以优化器Adam为例):

  1. Model States:
    • Parameters(fp16)
    • Gradient(fp16)
    • Optimizer States(fp32)

VRAM计算:假设Parameters是Ψ,则共需要2Ψ+2Ψ+(4Ψ+4Ψ+4Ψ)=16Ψ bytes进行存储。

  1. Residual States
    • 激活值activation
    • 临时缓冲区buffer
    • 无法使用的显存碎片fragmentation

3.1 Stage 1, 2, 3

  • Pos是指

3.2 ZeRO-Offload

3.2.1 通信数据量分析

3.3 ZeRO-Infinity

3.4 ZeRO++

4 DeepSpeed + Accelerate

https://www.bilibili.com/video/BV1hb421E7WY/

4.1 环境配置

VSCode extension

4.2 Baseline

1. Install deepspeed & accelerate

pip install deepspeed accelerate

2. Accelerate config file

accelerate config
In which compute environment are you running? This machine

Which type of machine are you using? Multi-GPU

How many different machines will you use (use more than l for multi node training)? [1]: 1

Should distributed operatlons be checked while running for errors? This can avoid timeout issues but will be slover. fves/Nol:

Do you wish to optinize your seript with torch dynano?[yes/No]:Do you want to use DeepSpeed? [yes/No]: yes

Do you want to specify a json file to a DeepSpeed config? [yes/No]: No

What should be your DeepSpeed's ZeR0 optinization stage? 2

Where to offload optinizer states? none

Where to offload paraneters? none

How many gradient aceumulation steps you're passing in your seript? [1]: l

Do you want to use gradient clipping? [yes/No]: No

Do you want to enable 'deepspeed. zero. init' when using ZeR0 Stage 3 for constructing massive models? [yes/No]: No

Do you want to enable Mixture of-Experts training (MoE)? [ves/No]:How many cPu(s) should be used for distributed trainine? [1]:2

Do you wish to use FPlG or BFlG (nixed precision)? bf16

accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml

3. Copy the file to the current path

cp [source] [destination]

cp /root/.cache/huggingface/accelerate/default_config.yaml ./

4. Run

accelerate launch --config_file default_config.yaml ddp_accelerate.py

5. Create a new terminal

nvidia-smi -1 1

4.3 Custom using deepspeed_config.json

https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed#deepspeed-config-file

5 Megatron-LM

Github: https://github.com/NVIDIA/Megatron-LM

  1. Megatron-LM: Training Multi-Billion Parameter Language Models Using
    Model Parallelism
  2. Efficient Large-Scale Language Model Training on GPU Clusters
    Using Megatron-LM
  3. Reducing Activation Recomputation in Large Transformer Models

Reference

posted @   ForHHeart  阅读(964)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· spring官宣接入deepseek,真的太香了~
点击右上角即可分享
微信分享提示