3 + Accelerate, Megatron-LM

1 Introduction

Github: https://github.com/microsoft/DeepSpeed

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

ZeRO-Offload: Democratizing Billion-Scale Model Training

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

ZeRO(Zero Redundancy Optimizer)是一种去除冗余的分布式数据并行(Data Parallel)方案，分为Stage 1, Stage 2, Stage 3，而Deepspeed就是论文中ZeRO方法的Microsoft官方的工程实现。

ZeRO-Offload为解决由于ZeRO而增加通信数据量的问题，提出将GPU转移到CPU

ZeRO-Infinity同样是进行offload，ZeRO-Offload更侧重单卡场景，而ZeRO-Infinity则是典型的工业界风格，奔着极大规模训练去了

ZeRO++是对ZeRO 3的通信优化，优化了以下三个方面：

每个服务器有完整的模型参数，消除跨服务器的All_gather操作；
通信时，基于块的量化，模型参数从FP16转换成INT8；
替代ring-based ReduceScatter通信，改为分层级的量化 AllToALL；

Megatron-LM是NVIDIA开发的大规模语言模型训练框架，相比于DeepSpeed而言，具有更好的模型并行和流水线并行技术，但数据并行DeepSpeed更有优势。

2 预备知识

2.1 分布式并行策略

单卡可以完成训练流程的模型

数据并行(Data Parallel, DP)：每个GPU都复制一份完整模型，但是数据是不同的，每个GPU数据加起来是一个完整的数据

单卡无法完成训练流程的模型

流水线并行(Pipeline Parallel, PP)：将模型按层拆开
张量并行(Tensor Parallel, TP)：将模型按每层的权重拆开

混合策略(3D并行)

数据并行 + 流水线并行 + 张量并行

3D并行示例：

2路数据并行+ 4路流水线并行 + 4路张量并行

2.2 LLM推理和训练的算力需求估算

2.2.1 数据精度格式

对于大语言模型，选择合适的精度格式至关重要。高精度如FP32适合高要求的任务，但消耗资源多；FP16和bfloat16则在维持性能的同时，显著降低了计算成本。低精度格式如int8和fp4更适合资源受限的环境，尤其在推理任务中，通过压缩存储和计算需求，提高了部署效率。合理运用这些格式能够优化性能和资源利用，推动大语言模型在更广泛场景中的应用。

名称	简称	对应字节	对应比特
单精度浮点格式(Single-precision floating-point format)	fp32	4 Bytes	32 bits
半精度浮点格式(Half-precision floating-point format)	fp16	2 Bytes	16 bits
脑浮点格式(Brain floating-point format)	bp16	2 Bytes	16 bits
8位整数格式(8-bit integer format)	int8	1 Bytes	8 bits
4位浮点格式(4-bit floating-point format)	fp4	0.5 Bytes	4 bits
4位正常浮点格式 (4-bit NormalFloat format)	nf4	0.5 Bytes	4 bits

2.2.2 显存(VRAM)需求计算 - 推理

以LLaMA 2 7B为例，显存需求如下：

仅考虑了模型参数本身，并未包括其他运行时所需的额外空间，如优化器状态、激活等。

类型	模型精度	模型规模	推理/训练	最低显存(以粗略计算方式)
全精度	FP32	7B	推理	7B * 4 Bytes = 28 GB
半精度	FP16	7B	推理	7B * 2 Bytes = 14 GB
低精度	INT8	7B	推理	7B * 1 Bytes = 7 GB
INT4	7B	推理	7B * 0.5 Bytes = 3.5 GB

3 ZeRO

Video: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

如今，Mixed-Precision Training和Adam Optimizer是LLM Distributed Training的标配

ZeRO将模型训练阶段，每张卡中显存内容分为两类(以优化器Adam为例)：

Model States:
- Parameters(fp16)
- Gradient(fp16)
- Optimizer States(fp32)

VRAM计算：假设Parameters是 $\Psi$ ，则共需要2 $\Psi$ +2 $\Psi$ +(4 $\Psi$ +4 $\Psi$ +4 $\Psi$ )=16 $\Psi$ bytes进行存储。

Residual States
- 激活值activation
- 临时缓冲区buffer
- 无法使用的显存碎片fragmentation

3.1 Stage 1, 2, 3

$P_{os}$ 是指

3.2 ZeRO-Offload

3.2.1 通信数据量分析

3.3 ZeRO-Infinity

3.4 ZeRO++

4 DeepSpeed + Accelerate

https://www.bilibili.com/video/BV1hb421E7WY/

4.1 环境配置

VSCode extension

4.2 Baseline

1. Install deepspeed & accelerate

pip install deepspeed accelerate

2. Accelerate config file

accelerate config

In which compute environment are you running? This machine

Which type of machine are you using? Multi-GPU

How many different machines will you use (use more than l for multi node training)? [1]: 1

Should distributed operatlons be checked while running for errors? This can avoid timeout issues but will be slover. fves/Nol:

Do you wish to optinize your seript with torch dynano?[yes/No]:Do you want to use DeepSpeed? [yes/No]: yes

Do you want to specify a json file to a DeepSpeed config? [yes/No]: No

What should be your DeepSpeed's ZeR0 optinization stage? 2

Where to offload optinizer states? none

Where to offload paraneters? none

How many gradient aceumulation steps you're passing in your seript? [1]: l

Do you want to use gradient clipping? [yes/No]: No

Do you want to enable 'deepspeed. zero. init' when using ZeR0 Stage 3 for constructing massive models? [yes/No]: No

Do you want to enable Mixture of-Experts training (MoE)? [ves/No]:How many cPu(s) should be used for distributed trainine? [1]:2

Do you wish to use FPlG or BFlG (nixed precision)? bf16

accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml

3. Copy the file to the current path

cp [source] [destination]

cp /root/.cache/huggingface/accelerate/default_config.yaml ./

4. Run

accelerate launch --config_file default_config.yaml ddp_accelerate.py

5. Create a new terminal

nvidia-smi -1 1

4.3 Custom using deepspeed_config.json

https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed#deepspeed-config-file

5 Megatron-LM

Github: https://github.com/NVIDIA/Megatron-LM

Megatron-LM: Training Multi-Billion Parameter Language Models Using
Model Parallelism

Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM

Reducing Activation Recomputation in Large Transformer Models

Reference

posted @ 2024-09-07 05:53 ForHHeart 阅读(964) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Pre-trained Model Summary

· transformers

· DeepSpeed x MiniGPT4Qwen

· DeepSpeed框架：1-大纲和资料梳理

· 大模型训练框架deepspeed和accelerate

阅读排行：
· DeepSeek “源神”启动！「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1：开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化（本地部署与 API 调用教程）
· spring官宣接入deepseek，真的太香了~

公告

昵称： ForHHeart
园龄： 2年9个月
粉丝： 0
关注： 0

+加关注

2025年2月

日

一

二

三

四

五

六

ForHHeart