- Architecture
- simple version & Basic idea
- overview
- Encoder, Decoder
- Attention
- Coding: Attention
- comprehensive version
- Encoder-only: BERT
- architecture
- pre-training
- Decoder-only: GPT
- Brief history: What's new, what's the differences
- pre-training
- tokenizer, embedding, transformer block, im_head
- Encoder-Decoder
- Comparation
- bert vs. GPT
- decoder-only vs. encoder-decoder
- Coding
- Encoder-only: BERT
- simple version & Basic idea
- Training process
-
结果不同是算法写得不对还是随机数生成不同?
-
贴一下代码 + 写注释介绍思路
-
关于Attention反向传播的推导是否、如何有利于下文我对更细致分析的理解
-
transformer 基本框架和架构都是结构的问题,说明起来有什么不同吗(要pre的话要不并在一起讲吧)
- 记得包括预训练流程
-
Concepts/MLM more approaches
- Next Sentence Prediction
- reconstruct corrupted sentence
-
Encoder - Decoder: seems a simple conbination
-
pre-training vs. post training: have general methods for post-training but specific approaches for pre-training according to the architecture itself?
-
comparation: 还有其它原因吗
- 没细看
- 报告里到处找不到
max-length
- 了解更多预处理和infrastructure
- The following section seems to have mentioned this as well, where's different - comprehensive the basic architecture of the transformers
- 记得看一下 flu shot learning 里面的链接的别的学习资料, and check with this one