This repo hosts code for Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks and borrows starter code from fairseq, huggingface, and transformer-xl.
See paper for details, comparison with entmax, and ood results.
Results on the SQuAD question answering task:
Percentage | Exact/F1 scores | Time(s) | GPU Memory(GB) |
---|---|---|---|
0 | 81.02/88.63 | 95.41 | 6.85 |
90 | 79.62/87.32 | 86.44 | 5.00 |