Skip to content

Something in the middle of Karpathy's mingpt model and video lectures, BabyGPT is an easy to use model on a much smaller scale (16 and 256 out channels , 5 heads, fine tuned). To be made useful on low powered devices.

License

Notifications You must be signed in to change notification settings

soumyadip1995/BabyGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BabyGPT 🐀

Sublime's custom image

Table of contents

Building on the intuition of Karpathy's ng-video-lectures and mingpt, BabyGPT provides a working model of a GPT on a much smaller scale (256 as well as 16 out channels, 5 layer GPT, fine-tuned). BabyGPT has been built from a toyGPT which was made to understand transformers from scratch. It has been scaled down , as you will see below. Visit the notebooks. We scale up to transformers from simple Language models, attention mechanisms and finally BabyGPT. While toyGPT has been built by separating all the layers of a transformer indiviually. In BabyGPT, the attention mechanism is implemented manually. The purpose of building smaller GPTs is to understand transformer functions at a much more granular level.

The models

To train small models we are using tinystories. You can download the weights from hugging face. We are setting max_iters to 5000 on a Tesla T4. For the OG model, we are using 256 out channels.

model context length n_layers n_head n_embd train loss val loss parametres data
15M 16 4 4 16 2.4633 2.4558 13k stories15M.bin
42M 32 8 8 32 2.3772 2.3821 1.01M stories42M.bin
BabyGPT Original 64 8 8 256 1.3954 1.5959 6.37M data

Note:- The 110M is being omitted for now. The RAM blew up..!!

The Roadmap πŸš€

If you wish to understand the nitty gritty of how transformers work from scratch, this roadmap will guide you. We start from implementing simple language models bigram and ngram and then from there work our way up to building transformers , a GPT from scratch and then finally babyGPT.

alt_text

We implement a Low rank approximation as well as lit-lama to babyGPT as well. We finally train the models and generate tokens.

Low rank approximation

Low rank approximation improves parametre efficiency(compression technique). A LoRa_model.py has been added(on 256 out channels) . We receive a parametre reduction of about 2. All, we need to do is to compute a rank parametre and compute the attention accordingly. In the LoRa notebook, an estimation of FLOPs has been done according to the chinchilla paper.

Quantization on low rank approximation

Quantization has also been performed on the lora model. A calculation of FLOPs has been added as well. For the BabyGPT model for 256 out channels we get 0.407 Peta FLOPs. In the LoRa notebook we have added the quantization. In terms of size reduction, we are getting a reduction of a factor of 1.3 for now.

We support LLaMa for BabyGPT ⚑ ⚑

1. LLaMa version -1 πŸ¦™

An implemetation of the lit-llama model has been ported to BabyGPT(based on llama- version 1). You can find the notebook here -> llama_implementation . Run the model ,MFU has also been added. llama\python llama_model_v1.py. Training and generating tokens has been provided below.

Note:- We have ported build_rope_cache() , apply_rope() and RMSNorm() from version 1. We are also not using the version 1 weights or checkpoints(these are for even larger models 7B, 13B, 65B etc). You can download the weights and port llama to your own version.

2. LLaMa2 πŸ¦„

We have ported llama2 by meta into BabyGPT. You can find the implementation at llama\python llama2.py. we have also provided a calculation of FLOPs along with the model.

The FLOPs to compute k, v cache is the flops to compute is :- 2 * 2 * num_layers * (embedded_dim) ^2. Find more information on how to compute memory and compute in kipply's blog MFU has been added to llama2

Note:- We are not using original llama weights by meta. We are also using arbitrary values for 70B. You can port it to your own model using your own weights.

Tokenization

Tokenization using sentencepiece has been done. We are exporting a tokenizer.bin unlike the tokenizer the quant folder. Run it in llama\python tokenizer.py(meta-pieces added) . We can use the .bin file for further inference.

LLaMA with Model FLOP Utilization(MFU) ⚑

We need efficient memory usage for LLMs. Hardware accelerators use a technique called Hardware FLOP Utilization for efficient trade-offs between memory usage and compute. This is typically done using an estimate of the ratio of FLOPs observed on a given device to its theoretical peak FLOPs. MFU is the ratio of the observed throughput (tokens-per-second), relative to the theoretical maximum throughput of a system operating at peak FLOPs. The theoretical peak matmul of Tesla T4 is around 8.1 TFLOPS. Hence, we calculate the MFU of the LLaMA trainer model. See in the trainer notebook under LLaMA-trainer. We receive a MFU of : 0.0527723427% on 3.22M parametres. This would of course increase as the number of parametres increases. For a 530B parametre model, MPU is around 30% on A100 GPUs. We use Section B from the PaLM paper for reference.

Quantization

LLMs require many GPUs to run, we need to find ways to reduce these requirements while preserving the model's performance. Various technologies have been developed that try to shrink the model size, you may have heard of quantization and distillation. It has been discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size.

To remediate that, 8-bit quantization was introduced. This method uses a quarter precision, thus needing only 1/4th of the model size! But it's not done by just dropping another half of the bits. There's a lot more to this topic. Look at hugging face quantization.

You can see quant.md on how to perform llama-quantization. You can look at quantization Notebook for a beginner's introduction to quantization. Different benchmarkings has been done. For ex:- On a GPU, the 7B parametre model on bfloat16 will take about 15GB. BabyGPT will take about a few kilobytes..!!! quantization.py has been obtained from lit-llama repo. A tokenizer using sentencepiece has been added as well. Diffrent kinds of weight operations can be performed from the tokenizer.model

Our result

For post training quantization, we are able to reduce the model size by a factor of almost 4.

model_fp = BabyGPTmodel(config)
model_fp.eval()
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  
    
/// number of parameters: 3222637
12.9688 MB
3.4603 MB ////

Note: Just for quantization we are using a bigger model with about 3.22M paramtres

Performance Benchmark

Performance benchmarking has been done on the BabyGPTmodel and the quantized model. Below are the results.

alt_text.

It has been added to the quantization Notebook in the quant folder.

Files

BabyGPT
β”œβ”€β”€ bigram_lm.py
β”œβ”€β”€ ngram_lm.py
β”œβ”€β”€ model.py
β”œβ”€β”€ Lora_model.py
β”œβ”€β”€ Llama_model.py
β”œβ”€β”€ Attention
β”‚   β”œβ”€β”€ dot product attention.py
β”‚   β”œβ”€β”€ multi headed attention.py
β”‚   β”œβ”€β”€ cross attention.py
β”‚   β”œβ”€β”€ spatial attention.py
β”œβ”€β”€ Notebook
β”‚   β”œβ”€β”€ Dot product attention
β”‚   β”œβ”€β”€ multiheaded attention
β”‚   β”œβ”€β”€ gpt from scratch
β”‚   β”œβ”€β”€ spatial transformer
β”‚   β”œβ”€β”€ babyGPT
β”‚   β”œβ”€β”€ LoRa
β”‚   β”œβ”€β”€ llama_implementation
β”‚   β”œβ”€β”€ mixed precision

β”œβ”€β”€ Train
|	β”œβ”€β”€ babygpt_trainer.py
|	β”œβ”€β”€ llama_trainer.py
β”œβ”€β”€ transformers
|   β”œβ”€β”€ transformer_model.py
β”‚   β”œβ”€β”€ babyGPT.py
β”œβ”€β”€ Quant
β”‚   β”œβ”€β”€ quantization.py
β”‚   β”œβ”€β”€ quantization notebook
β”‚   β”œβ”€β”€ tokenizer.model
β”‚   β”œβ”€β”€ tokenizer.vocab
β”‚   β”œβ”€β”€ tokenizer.py
β”‚   β”œβ”€β”€ model.pth
β”‚   β”œβ”€β”€ quant.md
β”œβ”€β”€ llama
β”‚   β”œβ”€β”€ llama2.py
β”‚   β”œβ”€β”€ llama_model_v1.py
β”‚   β”œβ”€β”€ tokenizer.py
β”‚   β”œβ”€β”€ tokenizer.vocab
β”‚   β”œβ”€β”€ tokenizer.bin
β”‚   β”œβ”€β”€ tokenizer.model
β”œβ”€β”€ text.txt
β”œβ”€β”€ trainer.ipynb
β”œβ”€β”€ requirements.txt

Run πŸƒ

Clone the Repo and run the following:

! git clone https://github.com/soumyadip1995/BabyGPT.git

To run the bigram and ngram language models. python bigram_lm.py and python ngram_lm.py .

To run babygpt transformers\python babygpt.py from transformers folder.

To run a simple transformer model python transformer_model.py

To run a low rank approximation model python LoRa_model.py

To run the llama model llama\python llama_model_v1.py

To run llama2 llama\python llama2.py

Run the different attention mechanisms from Attention folder.

Auto Mixed Precision

A very preliminary auto mixed precision has been added. FP16/FP32 It can be achieved with a cuda enabled gpu. A combination of pytorch's autocast() and gradscalar() is used for mixed precision. See more in the pytorch tutorial. Unfortunately the gpu blew up during training and cpu for now only supports bfloat16. Takes a hell of a long time to train. If anyone can improve upon it that would be awesome. Check the Mixed Precision Notebook.

Train and Generate πŸƒ

If you wish to get started on BabyGPT and llama , but don't want to go through all the hassle of knowing all about transformer models, you can simply start by running the code from the train folder.

To train and generate text from both BabyGPT model and the LLaMA model. Run train\python babygpt_trainer.py and train\python llama_trainer.py from the train folder.

Both have been trained on the Tesla T4 GPUs. You can increase or decrease the values of max_iters according to your wish. Takes a few minutes to train.

Results πŸ“‹

You can see the result from both the models in the trainer notebook.

Result from the llama trainer.

Number of params = 3.22 M

``` number of parameters: 3222381 ```

step 0: train loss 4.6894, val loss 4.6895
step 500: train loss 2.1731, val loss 2.1832
step 1000: train loss 1.7580, val loss 1.8032
step 1500: train loss 1.5790, val loss 1.6645
step 2000: train loss 1.4482, val loss 1.5992
step 2500: train loss 1.3538, val loss 1.5874
step 3000: train loss 1.2574, val loss 1.5971
.
.
.
step 9000: train loss 0.5236, val loss 2.4614
step 9500: train loss 0.4916, val loss 2.5494
step 10000: train loss 0.4680, val loss 2.6631
step 10500: train loss 0.4448, val loss 2.6970
step 10999: train loss 0.4341, val loss 2.7462

Detroit, revior myself 'til I confused to get the big clead Mastles
Slaughterhouse on the blue, that's when he pine I'm hop with the cowprinton
robaly I want to a lox on my tempt

But now we can't never find a gift killed broke
Big before anyone could ever hear the first as I was cooped chill
But i this o for a big star
I said get chased up!
(Hello darkness, my old friend)[Eminem:]
If my legacy I acged buving in the tub (might what?)
I would know one [*Barrns, worried :]
Yeah, so kon bitch, it's

Seems like the model converges a bit early, towards the end. Maybe that will need more modification. Spitting some Eminem yo..:smile:

Data

The data folder contains the text document which has the lyrics to all of Eminem's songs.

Running the notebooks

Notebook Description
Dot product attention colab
Multi headed attention colab
GPT from scratch colab (Approx 860k parametres)
Spatial Transformers colab
BabyGPT colab(16, 256 out channels)
LoRa colab(256 out channels)
lit-llama for BabyGPT colab(16 out channels for lit-llama )
trainer for Babygpt and llama colab(16 out channels for BabyGPT , 256 out channels for llama)

text.txt is based on Eminem's Stan.

Acknowledgements

  1. Pytorch GPT tutorial
  2. Karpathy's youtube videos and tutorials
  3. Karpathy's Mingpt
  4. O' reilly notes for NLP
  5. lit-llama repository.
  6. llama from facebook
  7. chinchilla paper
  8. karpathy's nn zero to hero.
  9. Lightning AI for Low rank Approximation.
  10. IST das for gptq

LICENSES

Licenses have been updated to include facebookresearch/llama, lit-llama and IST-das lab You can use it under GNU, Apache and MIT.

TO DO

  1. look into triton
  2. Fix the readme.md
  3. Inference using libtorch or each separately
  4. look into sentencepiece and tokenizer
  5. look into tinystories

About

Something in the middle of Karpathy's mingpt model and video lectures, BabyGPT is an easy to use model on a much smaller scale (16 and 256 out channels , 5 heads, fine tuned). To be made useful on low powered devices.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published