Skip to content

Implementation of a decoder transformer in Pytorch from scratch to generate text based on Jules Verne literary works.

Notifications You must be signed in to change notification settings

joaoflf/verne-decoder-transformer

Repository files navigation

Verne Decoder Transformer

This repository contains an implementation of a Transformer Decoder in PyTorch from scratch. The purpose of this project is to generate text based on Jules Verne's literary works, using the original Transformer model as proposed in the "Attention is All You Need" paper and its subsequent improvements. This is also an application of my learnings from Andrey Karpathy's latest youtube series.

 

▶ Usage

To use this code, first, clone the repository:

git clone https://github.com/joaoflf/transformer_decoder_pytorch.git
cd transformer_decoder_pytorch

Next, install the dependencies:

pip install -r requirements.txt

 

Training the Model

The train.py script trains the model. It accepts the following command line arguments:

  • --iters: Total iterations to train. Default is 5000.
  • --batch-size: Batch size. Default is 32.
  • --lr: Learning rate. Default is 3e-4.
  • --device: Device to use for training. Default is "cuda" if CUDA is available, otherwise "mps".
  • --checkpoint_dir: Directory to save the model checkpoints. Default is "checkpoints".

Example usage:

python train.py --iters 10000 --batch-size 64 --lr 1e-4 --device cuda --checkpoint_dir my_checkpoints

This will train the model for 10000 iterations with a batch size of 64, a learning rate of 1e-4, using a CUDA device for training. The model checkpoints will be saved in the my_checkpoints directory.

 

Generating New Text

The generate.py script generates new text from a trained model. It accepts the following command line arguments:

  • --checkpoint_path: Path to the model checkpoint. This argument is required.
    • You can download the latest trained weights here
  • --num_tokens: Number of tokens to generate. Default is 100.

Example usage:

python generate.py --checkpoint_path my_checkpoints/model_state_10000.pt --num_tokens 500

This will generate 500 new tokens from the model checkpoint at my_checkpoints/model_state_10000.pt.

🏈 Game Plan

  • ✅ Start with a basic bigram model and a basic table lookup embedding layer.

    iterations: 10,000
    batch_size: 32
    Metric Value
    Train Loss 2.57
    Val Loss N/A

 

  • ✅ Add a self-attention block and introduce basic positional embeddings.

    iterations: 10,000
    batch_size: 32
    block_size: 8
    embed_size: 256
    Metric Value
    Train Loss 2.4980
    Val Loss 2.5421

 

  • ✅ Implement multihead self-attention.

    iterations: 10,000
    batch_size: 32
    block_size: 8
    embed_size: 256
    num_heads: 8
    Metric Value
    Train Loss 2.1
    Val Loss 2.13

 

  • ✅ Add a feed-forward network and stack multiple blocks of multi-head attention.

    iterations: 10,000
    batch_size: 32
    block_size: 8
    embed_size: 256
    num_heads: 8
    num_blocks: 4
    Metric Value
    Train Loss 3.13
    Val Loss 3.17

    *the networks is now too deep and is hurting training performance

 

  • ✅ Implement Layer Normalization and residual connections. Scale up the model

     GPU: M1 Pro 10-core
     iterations: 5,000
     batch_size: 64
     block_size: 256
     embed_size: 384
     num_heads: 6
     num_blocks: 6
     dropout: 0.2
    Metric Value
    Train Loss 1.02
    Val Loss 1.19

     

    Generated Text

    F the fact of this life appeared for its last ten
    to the Northern minutes which formed me a mountain number of our worthy and
    millions that we have made for land known of the Central Sea."
    
    "Well," said the Professor; "it is a depth of extraordinary track,
    their island wood."
    
    "But it is quite getting at Ned Land."
    
    At this moment, I saw the amed horizontal horrible at last would the
    hargonal man. I came to fain the extraordinary and excitement power on
    the other you."
    

 

  • ✅ Replace char level tokenizer with TikToken ('gpt2').

     GPU: M1 Pro 10-core
     iterations: 5,000
     batch_size: 64
     block_size: 256
     embed_size: 384
     num_heads: 6
     num_blocks: 6
     dropout: 0.2
    Metric Value
    Train Loss 0.128
    Val Loss 7.09

    The model now overfits as the training data is too small. Due to the new tokenizer, the model now has a vocabulary of 50k+ tokens, which increases training time by 4x. (~4it/s -> ~1it/s on a M1 Pro 10-core) The generated text is now much more coherent and readable.

     

    Generated Text

    "Then," he said, "it is impossible in a contrary, your
    cannot be easy to the weight being about. We must put
    utterly at last observation to the end of this gallery."
    
    "My dear uncle," I ventured mildly to his answer. "Let
    the way to the old--of no means a minute or of the sentence as he did not care answer.
    
    The fartherfied forth in the high seas of the volcano. I looked around. The
    excellent Professor, and did not speak English with
    fancy a most despairing form a dull
    rocks. His telescope began to
    uncle, which his great deal of supper, appeared to be
    a wide thinking of steed--one that we were to
    discovered surrounding us on all sides point.
    
    TheHaving got over this occasion, I sought for it
    my head simply
    eating made from his making the circumstances.
    
    Our stock of my uncle partly confounded towards Hans.
    
    The Icelander gently pressed our departure, and the guide, I began to feel
    a powerful arms. My uncle made no longer moved myface
    ready. I began to think or not.
    

 

About

Implementation of a decoder transformer in Pytorch from scratch to generate text based on Jules Verne literary works.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published