Does Knowledge Distillation Really Work?

Summary

The paper addresses the importance of student fidelity to the teacher in knowledge distillation. The major contribution is that the authors claimed that the fidelity is not related to accuracy, but correlates with calibration. And most previous knowledge distillation methods pay much attention to accuracy instead of fidelity, which is helpful for student generalization. These previous knowledge distillation methods also tested different ways of modifying data, changing the training procedure but unable to reach a high enough fidelity, potentially harming model generalization. The title of the paper is eye-catching, and its content seems inspiring.

Motivation

The knowledge distillation dose not typically work as it is commonly understood: there often remains a large discrepancy between teacher and student.

1.why unable match? 　　Difficulties in optimization.

2.how closely match? 　　Data.

3.better student generalization? 　　Not closely match.

Background

in self-distillation the student fails to match the teacher and, paradoxically, student generalization improves as a result.

in large model, fidelity is aligned with generalization.

Conclusion

Good student accuracy does not imply good distillation fidelity;
Student fidelity is correlated with calibration when distilling ensembles;
Optimization is challenging in knowledge distillation;
There is a trade-off between optimization complexity and distillation data quality.

Notes

1.why high fidelity an important objective.

When is knowledge tansfer successful?

Experiments about self-distillation.

MNIST: self-distillation does not improve generalization

ResNet-56: increasing fidelity decreases student generalization.

What can self-distillation tell us about knowledge distillation in general?

if the teacher generalizes significantly better than an independently trained student,

fidelity dominate other regularization effects associated with not matching the teacher.

higher fidelity students do not always generalize better,(section5)

If distillation already improves generalization, why care about fidelity?

knowledge distillation does often improve generalization, understanding the relationship between fidelity and generalization, and how to maximize fidelity.

there is often a significant disparity in generalization between large teacher models and smaller students.

Possible causes of low distillation fidelity

Architecture

Student capacity

Identifiablity(data) section5

Optimization section6

2.Identifiability: Are We Using the Right Distillation Dataset?

Should we do more data augmentation?

the best augmentation policies for generalization, MixUp, and GAN3 , are not the best policies for fidelity

Should data augmentation be close to the data distribution?

data augmentation has an effect beyond improving identifiability — it has a regularizing effect

an insufficient quantity of teacher labels is not the primary obstacle to high fidelity.

The data recycling hypothesis

modifying the distillation data can slightly improve fifidelity, but the evidence does not support blaming poor distillation fifidelity on the wrong choice of distillation data.

3.Optimization: Does the Student Match the Teacher on Distillation Data?

More distillation data lowers train agreement

for heavier augmentations the agreement drops dramatically

the optimization method is unable to achieve high fidelity even on the distillation dataset when extensive data augmentation or synthetic data is used.

Why is train agreement so low?

the distillation fidelity cannot be significantly improved by training longer or with a different optimizer.

if there is any modification of the problem that can produce a high-fidelity student：when initialized close to the teacher, the student converges to the same basin as the teacher achieving nearly 100% agreement. In practice, optimization converges to sub-optimal solutions, leading to poor distillation fidelity.

一泓喜悲vv