New Features

More flexible distillation: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT). See the documentation for details.
Faster distillation: Users can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time. See the documentation for details.

Improvements

MultiTaskDistiller now is the subclass of GeneralDistiller and supports intermediate feature matching loss.
Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).
pkd_loss now accepts tensors of shape (batch_size, length,hidden_size) or (batch_size,hidden_size). In the latter case, the loss is computed directly on the input tensors, without taking the hidden states on the first position.