chapters 6,7,8,10 http://www.deeplearningbook.org/ The original sequence to sequence paper https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
One of the original attention paper https://arxiv.org/pdf/1409.0473.pdf
For the class last spring when we covered attention, I used https://talbaumel.github.io/attention/ for figures and simplified explanation.
I also pointed everyone to Chris Olah's blog https://distill.pub/2016/augmented-rnns/
Although to mention how they work, I think I gained the most insight into this from the pointer networks paper https://arxiv.org/pdf/1506.03134.pdf. For some reason, I had really thought of attention closer to the pointer network.
8.4 - 8.6 https://web.stanford.edu/~jurafsky/slp3/8.pdf
http://www.deeplearningbook.org/contents/rnn.html 10.4-10.6 10.10
http://www.statmt.org/mtma15/uploads/mtma15-neural-mt.pdf
https://arxiv.org/pdf/1703.01619.pdf
https://www.aclweb.org/anthology/D14-1179
https://web.stanford.edu/~jurafsky/slp3/8.pdf
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf
https://arxiv.org/pdf/1703.01619.pdf
https://web.stanford.edu/~jurafsky/slp3/8.pdf