Pytorch implementation of Google AI's 2018 BERT on moemen95's Pytorch-Project-Template.
BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805
moemen95's Pytorch-Project-Template has a specific structure represented above. It's proposing a baseline for any Pytorch project so that we can only focus on the model implementation. It provides some examples as well. So click the link and see what it is.
This repository is a reconstruction result of dhlee347's Pytorchic BERT and codertimo's BERT-pytorch on Pytorch template. The purpose of this is to learn how pytorch and bert work. So in this repository, pretraining
and validating
are only available.
To understand BERT, I recommend to read articles below.
(English)
(Korean)
In the paper, authors use masked language model
and predict next sentence
tasks for pretraining. Here's short explanation of those two (copied from codertimo's BERT-Pytorch).
Original Paper : 3.3.1 Task #1: Masked LM
Input Sequence : The man went to [MASK] store with [MASK] dog
Target Sequence : the his
Randomly 15% of input token will be changed into something, based on under sub-rules
- Randomly 80% of tokens, gonna be a
[MASK]
token - Randomly 10% of tokens, gonna be a
[RANDOM]
token(another word) - Randomly 10% of tokens, will be remain as same. But need to be predicted.
Original Paper : 3.3.2 Task #2: Next Sentence Prediction
Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next
Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = Not Next
"Is this sentence can be continuously connected?"
understanding the relationship, between two text sentences, which is not directly captured by language modeling
Iter (loss=8.964 / NSP_acc=0.302): 100%|███████████████████████████████████████████████| 2746/2746 [36:34<00:00, 1.37it/s]
[INFO]: Epoch 1/50 : Average Loss 16.002 / NSP acc: 0.506
Iter (loss=4.536 / NSP_acc=0.281): 100%|███████████████████████████████████████████████| 2746/2746 [36:28<00:00, 1.37it/s]
[INFO]: Epoch 2/50 : Average Loss 7.178 / NSP acc: 0.526
Iter (loss=3.408 / NSP_acc=0.260): 100%|███████████████████████████████████████████████| 2746/2746 [36:31<00:00, 1.29it/s]
[INFO]: Epoch 3/50 : Average Loss 4.440 / NSP acc: 0.544
In pretraining with Korean corpus(sejong corpus), 300k iteration with 32 batch size, I was able to get 78% of accuracy in Next Sentence Prediction task. The average loss goes down to 2.643.
With Korean corpus, the result of using 32 batch size is better than using 96. It seems that more frequent parameter updating leads to the optima. Pictures below are loss graphs of Language Model Loss
and Next Sentence Prediction Classification Loss
.
The result shows that the model had been learning about NSP task after language model because of the difference of magnitude of loss values.
I'm preparing English corpus for another experiment.
Basically, your corpus should be prepared with two sentences in one line with tab(\t) separator
Welcome to the \t the jungle\n
I can stay \t here all night\n
In configs/bert_exp_0.json
, you can edit almost all hyper-parameters.
If you are fine to use Byte Pair Encoding
, it will generate vocab file according to your corpus. If else, you need to build your own. While the model runs, it will do basic text cleaning and tokenization of the corpus by BPE
. You will find the model
and vocab
file of BPE
in experiment/bert_exp_0
directory.
Run run.sh
.