【Coursera GenAI with LLM】 Week 3 LLM-powered applications Class Notes

Model optimizations to improve application performance

Distillation: uses a larger model, the teacher model, to train a smaller model, the student model, we freeze teacher's weights and generate completions, also generate student model's completion, the difference between those 2 completions is Distillation Loss . Student model will adjust its final prediction layer or hidden layer. You then use the smaller model for inference to lower your storage and compute budget.
Quantization: post training quantization transforms a model's weights to a lower precision representation, such as a 16-bit floating point or eight-bit integer. This reduces the memory footprint of your model.
Pruning: removes redundant model parameters that contribute little to the model's performance.

Cheat Sheet

RAG (Retrieval Augmented Generation)

Chain of thought prompting

Program-Aided Language Model (PAL)

Orchestrator: can manage the information between LLM, external app and external databases. ex. Langchain

ReAct: it's a format for prompting (?), synergizing reasoning and action in LLMs

Thought: reason about the current situation
Action: an external task model can carry out from an allowed set of actions--search, lookup, finish
Observation: a few example