8 Innovative BERT Knowledge Distillation Papers That Have Changed The Landscape of NLP
8 Innovative BERT Knowledge Distillation Papers That Have Changed The Landscape of NLP
Contemporary state-of-the-art NLP models are difficult to be utilized in production. Knowledge distillation offers tools for tackling such issues along with several others, but it has its quirks.
BERT’s inefficient nature has not gone unnoticed. Many researchers have pursued ways to reduce its cost and size. Some of the most active research is in model compression techniques such as smaller architectures (structured pruning), distillation, quantization, and unstructured pruning. A few of the more impactful papers include:
- DistilBERT used knowledge distillation to transfer knowledge from a BERT base model to a 6-layer version.
- TinyBERT implemented a more complicated distillation setup to better transfer the knowledge from the baseline model into a 4-layer version.
- The Lottery Ticket Hypothesis applied magnitude pruning during pre-training of a BERT model to create a sparse architecture that generalized well across fine-tuning tasks.
- Movement Pruning applied a combination of the magnitude and gradient information to remove redundant parameters while fine-tuning with distillation.
This post is about text classification on problems with a limited sample count.