【Coursera GenAI with LLM】 Week 2 PEFT Class Notes
With PEFT, we only train on small portion of parameters!
What's using memory while training model?
- Trainable weights
- Optimizer states
- Gradients
- Forward Activations
- Temporary memory
PEFT Trade-offs
- Parameter Efficiency
- Memory Efficiency
- Model Performance
- Training Speed
- Inference Costs
PEFT Methods
- Selective: select subset of initial LLM parameters to fine-tune
- Re-parameterize: re-parameterize model weights using a low-rank representation. ex. LoRA
- Additive: add trainable layers or parameters to model while keeping all of the original LLM weights frozen
- Adapter methods: add new trainable layers to the architecture of the model, typically inside the encoder or decoder components after the attention or feed-forward layers.
- Soft prompt methods: keep the model architecture fixed and frozen, and focus on manipulating the input to achieve better performance
Re-cap of how Transformer works
- The input prompt is turned into tokens
- Tokens converted to embedding vectors and passed into the encoder and/or decoder parts of the transformer.
- In Encoder and Decoder, there are two kinds of neural networks: self-attention and feedforward networks.
- The weights of these networks are learned during pre-training.
- During full fine-tuning, every parameter in these layers is updated.
Or, step 5, we can get LoRA going!
LoRA (Low-Rank Adaptation of LLM): LoRA is a strategy that reduces the number of parameters to be trained during fine-tuning by freezing all of the original model parameters and then injecting a pair of rank decomposition matrices alongside the original weights. Then you can get a LoRA fine-tuned LLM for a specific task
You can use a single GPU instead of multiple of them, if you are using LoRA.
You can switch out the matrices for different tasks, those matrices are typically very small:
It's not the case that bigger matrices, better performance. Ranks in the range of 4-32 can provide you with a good trade-off between reducing trainable parameters and preserving performance.
Prompt Tuning: different from prompt engineering, you add additional trainable tokens (soft prompts) to your prompt and leave it up to the supervised learning process to determine their optimal values
Soft prompts: weights of the model are frozen, but the embedding vectors of the soft prompt gets updated over time to optimize the model's completion of the prompt.
Bigger the model, more effective prompt tuning is:
Question remaining:
Smaller LLMs can struggle with one-shot and few-shot inference
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· AI与.NET技术实操系列(二):开始使用ML.NET
· 单线程的Redis速度为什么快?