Quantization: fp16, bf16, int8, fp4, nf4

1 GPU Memory Usage

1.1 How to Compute

How to compute GPU Memory Usage?

Model size:
Model Weights: 4Bytes * num_param
Optimizer: 4Bytes * 2 * num_param (for AdamW)
Gradient: 4Bytes * num_param
feed forward:
sum:

1.2 How to Reduce

Strategy 1:

Optimization Strategy Optimization Object Description Training Time
Baseline -
+ Gradient Accumulation Forward propagation value
+ Gradient Checkpoints
Trainer(gradient_checkingpoint = True)
Forward propagation value not save the immediate weights and values take more time -> get less memory
+ Adafactor Optimizer Optimizer
+ Freeze Model Forward propagation value / Gradient
+ Data Length Forward propagation value

Strategy 2: Reduce the number of parameters
PEFT(Prompt Tuning, LoRA...)
Strategy 3: Reduce the number of bytes each parameter occupies
The default precision is single precision, which is represented as fp32, using 32 bits to represent one digit.

Name
Single-precision floating-point format fp32 4 Bytes 32 bits
Half-precision floating-point format fp16 2 Bytes 16 bits
Brain floating-point format(BFloat16) bp16 2 Bytes 16 bits
int8 1 Bytes 8 bits
fp4 0.5 Bytes 4 bits
4-bit NormalFloat nf4 0.5 Bytes 4 bits

2 Precision

02 - Half precision & LLaMA 2
03 - Half precision & ChatGLM 3
04 - 8 Bit
05 - 4 Bit & QLoRA

Reference

手把手带你实战HuggingFace Transformers-实战篇

posted @   ForHHeart  阅读(132)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· spring官宣接入deepseek,真的太香了~
点击右上角即可分享
微信分享提示