LoRA Fine Tuning #82

danablend · 2024-03-04T15:22:23Z

This is just the first draft so we can start building this feature.

Added dataloader.py, which loads data for training
Added train.py, with the current training loop
Added lora.py, for LoRA wrapper of the stage 1 Transformer
Added dummy_dataset folder with 25 data samples to work with when testing (VCTK-->p311)
Commented out the initial inference code when stage 1 model is built.

There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update function, probably not that difficult to solve).

The dataloader works fine, but everything else requires work. This is just a dirty initial draft so we can start working on this thing together! :-)

Hope to hear some insights! I have time to put into this feature, so any pointers would be great! Am not super well-versed in AI, so bear with me.

This is just the first draft so we can start building this feature. - Added dataloader.py, which loads data for training - Added train.py, with the current training loop - Added lora.py, for LoRA wrapper of the stage 1 Transformer - Added dummy_dataset folder with 25 data samples to work with when testing (VCTK-->p311) - Commented out the initial inference code when stage 1 model is built. There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update). The dataloader works fine, but everything else requires some work. This is just an initial draft so we can start working on this thing together! :-)

danablend · 2024-03-04T15:41:01Z

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:

LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
def _train_infer_stage_one breaks

Can make it clean once we've got a basic loop running successfully

vatsalaggarwal

I think the main thing is that it looks like you're trying to finetune the whole network end-to-end from first stage -> second stage -> mbd -> deepfilternet with L2 loss... I would recommend just finetuning the first stage using next-token prediction loss... (ref: https://github.com/karpathy/nanoGPT/blob/master/model.py#L184)...

It also looks like you might be trying to finetune through a file load operation, which would kill the gradient (https://github.com/metavoiceio/metavoice-src/pull/82/files?file-filters%5B%5D=.py&file-filters%5B%5D=dotfile&show-viewed-files=true#diff-ed183d67207df065a11e1289f19d34cc2abbc5448dea952683cfe9728c342b95R270)?

For this, you'll need to change your data loader slightly to output only the first two hierarchies of the encodec tokens in a flattened interleaved manner.

Last thing is - why are you trying to finetune? If it's mostly for generalisation to a new speaker, it should be sufficient to finetune only the first stage...

vatsalaggarwal · 2024-03-04T15:38:03Z

dataloader.py

+ wav = wav.mean(axis=0, keepdims=True)
+
+ wav = wav.unsqueeze(0) # Add batch dimension
+ wav = wav.unsqueeze(0) # Add channel dimension


This would cause an error if wav.ndim=2?

vatsalaggarwal · 2024-03-04T15:40:04Z

dataloader.py

+ # Padding for text tokens
+ text_tokens = pad_sequence(text_tokens, batch_first=True, padding_value=0)
+
+ # Audio waveform - padding to longest waveform on 4th dimension


Why do we have 4 dimension? Should be either (batch_size, time) or (batch_size, channels=1, time)?

Since we only need the Encodec tokens for training stage 1, I have removed the audio waveforms from the dataloader.

vatsalaggarwal · 2024-03-04T15:40:53Z

dataloader.py

+MBD_SAMPLE_RATE = 24000
+
+MAX_DURATION_IN_SECONDS = 15
+MAX_INPUT_LENGTH = int(MBD_SAMPLE_RATE * MAX_DURATION_IN_SECONDS)


Where is this being used?

Nowhere. I just removed it in the latest commit :-)

vatsalaggarwal · 2024-03-04T15:41:14Z

fam/llm/fast_inference_utils.py

- guidance_scale=torch.tensor(3.0, device=device, dtype=precision),
- end_of_audio_token=9999, # don't end early for compilation stage.
- )
+ # y = generate(


Why comment this out? How do you test the model without this?

I'll add this back in, this was for faster initialization on my local setup (slow GPU)

vatsalaggarwal · 2024-03-04T15:42:20Z

train.py

+
+ print("Setting model to training mode...")
+ self.model.train() # Set the model to training mode
+ self.llm_second_stage.model.train() # Set the model to training mode


We probably don't need to train this stage at all if you're only adapting speaker identities

Agree. Removed stage 2 in latest commit :-)

vatsalaggarwal · 2024-03-04T16:08:48Z

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:

LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
def _train_infer_stage_one breaks
Can make it clean once we've got a basic loop running successfully

Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:

only work with the first stage (this should reduce memory usage significantly as the diffusion model is ~5GB itself)
swap out the Adam optimiser with a SGD optimiser (just to get things running - this should reduce memory usage significantly)
finetune only the first layer, ignore the token embeddings/logit output layer

danablend · 2024-03-04T18:23:48Z

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:
LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
def _train_infer_stage_one breaks
Can make it clean once we've got a basic loop running successfully

Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:

only work with the first stage (this should reduce memory usage significantly as the diffusion model is ~5GB itself)

swap out the Adam optimiser with a SGD optimiser (just to get things running - this should reduce memory usage significantly)

finetune only the first layer, ignore the token embeddings/logit output layer

Hey @vatsalaggarwal, these are all very helpful insights - much appreciated!

I will implement the points you mentioned:

Only work with first stage
Modify dataloader to return the first two hierarchies of the encodec tokens
Swap over to SGD optimizer to save on memory
Only fine tune the first layer of the first stage to adapt to new speakers (the use case)

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Should be able to commit these changes later today or tomorrow!

vatsalaggarwal · 2024-03-05T22:57:56Z

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help?

danablend · 2024-03-06T00:04:03Z

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help?

It massively helped - I've not worked with transformer networks before (only diffusion & GANs), so studying nanoGPT after you sent it today was very helpful!

Am just making a few more changes for today, then I'm pushing an update :-)

Switched from Adam to SGD optimizer Modified the DataLoader to return the first two encodec token hierarchies as a flattened interleaved tensor (let me know if that looks ok to you?) Modified LoRA wrapper to only fine tune speaker_cond_pos layer. In nanoGPT-LoRA only the causal attention layer is fine tuned (https://github.com/danielgrittner/nanoGPT-LoRA/blob/master/model.py#L128). Would it be worth trying something similar? Modified training loop to forward pass with entire batches at a time. Loss calculation doesn't work, need to match the GT labels with generated probabilities. Need some direction here.

danablend · 2024-03-06T01:30:57Z

Just committed - here's an overview:

Switched from Adam to SGD optimizer
Modified DataLoader to return first two Encodec token hierarchies as a flattened interleaved tensor (let me know if that looks ok to you?)
Modified LoRA wrapper to only fine tune speaker_cond_pos layer
- In nanoGPT-LoRA they fine tune the causal attention layer (https://github.com/danielgrittner/nanoGPT-LoRA/blob/master/model.py#L128). Would it be worth trying something similar with the attention layer here?
Modified training loop
- Forward pass entire batch at a time
- Wasn't able to format input_pos as (B, T) due to tensor dimension mismatch in KVCache. Unsure if this causes downstream effects or if that's fine?
- Loss function call causes error right now. We need to match the GT labels with the model probs. I need some direction here, because I'm not quite sure about what the "unit" of the model's raw output is and what format we need the labels and model outputs to have to correctly calculate the loss. I assumed that the model would output something like a probability distribution of (B, S, V), B=batch_size, S=gen_encodec_tokens_count, V=vocab_size. But it's (B, T, V), with T=text_prompt_token_count. Not sure exactly how to get the data into the right format here for loss calculation, but the remainder of this implementation should be straightforward once this works.

EDIT1: After sleeping on it, I think it is an issue with the way the GT Encodec tokens are extracted and prepared in the DataLoader which is causing the problem, and not the format of the model output.

EDIT2: Was wondering if you have more detailed documentation about the model architecture / diagrams to help further understanding?

danablend · 2024-03-06T10:25:06Z

I might start to understand a little bit here.

Would the idea be to generate a single token at a random location for each batch by giving the model:
(prompt) + (random number of GT encodec tokens) as the input?

So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?

Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].

This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.

Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?

Is that the right intuition?

If so, would the GT Encodec labels be determined by the raw output of the EncodecModel.encode function, or do we need to run it through the first stage adapter to get the correct vocab indices? If you're not sure, I'll see if I can find out.

Again, thanks a lot for pointing me in the right direction, it's very helpful given my limited knowledge!

Almost trains, I have just made a mistake somewhere causing: RuntimeError: Trying to backward through the graph a second time I'm guessing it's because we need to do all the preprocessing in the dataloader rather than in the training loop. Let me know any thoughts. It's getting close :-)

danablend · 2024-03-06T12:03:15Z

I've modified the training loop to be as I described above, and it seems to work (although I'm not 100% sure that the GT Encodec indices are correctly determined).

Somewhere I have made a mistake, causing tensors to be retained in the computational graph across batches, which leads to:
"RuntimeError: Trying to backward through the graph a second time" on the 2nd iteration. Maybe you would be able to see I made this mistake @vatsalaggarwal?

Let me know any thoughts! :-)

Script can be launched with mixed precision using accelerate now:
accelerate launch train.py --mixed-precision bfloat16 for bfloat16
accelerate launch train.py --mixed-precision float16 for float16

Moved loss calculation to LoRA wrapper model. Modified training loop to be similar to that of nanoGPT. This involves using a sliding window for prompts & labels, which should more accurately replicate what the model is actually producing at logits level. If my intuition is wrong about this, please correct me.

danablend · 2024-03-06T18:44:28Z

Still some work to do in dataloader to ensure proper windows (aka blocks in nanoGPT) of X,Y data is prepared for good training. Right now useless data might be used due to randomly selecting padded zero values. Can be fixed in dataloader, or might end up creating a get_batch function similar to that of nanoGPT.

Haven't been able to solve the error which occurs when calling .backward for the 2nd time:
"RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)."

I'm sure it's a very obvious error I've made, but I have not been able to isolate it.

danablend · 2024-03-07T08:22:07Z

I think I've isolated the backward error (see above) to the caching mechanism in the model. Will work on a solution today and then we should be able to have something training

EDIT: It is definitely the caching mechanism. I just got it to run when clearing and detaching K,V tensors in KVCache between iterations.

vatsalaggarwal · 2024-03-07T08:22:35Z

in the middle of finishing something, haven't had time to look at this, will do it soon, sorry!

danablend · 2024-03-07T08:24:18Z

in the middle of finishing something, haven't had time to look at this, will do it soon, sorry!

Completely understand, no pressure! I'm just posting updates for whenever you have the time :-)

vatsalaggarwal · 2024-03-07T08:25:10Z

the updates are super helpful though!

danablend · 2024-03-07T09:52:19Z

I have the model training the LoRA layers now, but the data preparation process is currently garbage and I'm probably also calculating the loss with unit mismatch between the prompt input, inference output, and GT labels.

Will play with this for a bit, but I might need some help correctly interpreting the "units" of the variables here so we can prepare them correctly for the prompt as well as for the loss function.

Gonna draw some inspiration from nanoGPT and improve the data preparation, and then we're getting close to running some tests.

The model is fine tuning now. Not correctly, but it is fine tuning. Data must be prepared correctly now. But first, Attention and KVCache must be modified to be compatible with processing batches where batch size > 1.

Training loop is pretty clean now, and all data preparation is now done in the dataloader.py file. Loss becomes "nan" when entries in a batch have a lot of variance between each other (eg one entry had to be padded a lot during collation due to big difference in lengths on either prompt or encodec tokens tensors or both). Issue could perhaps be solved by grouping longer data points together, keeping them of similar length to avoid a lot of padding. Would love to hear any thoughts here.

vatsalaggarwal · 2024-03-07T16:36:57Z

For this input (b, t, vocab_size), would b be the 2 predicted hierarchies of encodec tokens and not the batch size?

b is batch size, t is timesteps, vocab_size is vocab_size... the hierarchies are flattened and interleaved into one for the first stage model

My intuition is wrong here, and the output of the model is (2, S, V) where I would expect (B, 2, S, V)

does my previous comment help?

I tried this, but it causes dimension mismatch in KVCache and Attention

I will work on making KVCache and Attention compatible with a (B, S) input_pos, but you know much more about this than me, so maybe you could share your insight on this?

KVCaching isn't relevant during training, that should be switched off...

vatsalaggarwal · 2024-03-07T16:41:30Z

I think I've isolated the backward error (see above) to the caching mechanism in the model.

this shouldn't be used during training

Haven't been able to solve the error which occurs when calling .backward for the 2nd time

Do you mean for the second iteration? Have you zeroed gradients?

I might start to understand a little bit here.

Would the idea be to generate a single token at a random location for each batch by giving the model:
(prompt) + (random number of GT encodec tokens) as the input?

So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?

Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].

This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.

Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?

Is that the right intuition?

Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking.

danablend · 2024-03-07T23:19:16Z

I think I've isolated the backward error (see above) to the caching mechanism in the model.

this shouldn't be used during training

Haven't been able to solve the error which occurs when calling .backward for the 2nd time

Do you mean for the second iteration? Have you zeroed gradients?

I might start to understand a little bit here.
Would the idea be to generate a single token at a random location for each batch by giving the model:
(prompt) + (random number of GT encodec tokens) as the input?
So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?
Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].
This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.
Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?
Is that the right intuition?

Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking.

Thanks for explaining, that makes sense. I've implemented it, similar to NanoGPT. Pushing momentarily.

Data loading should be more memory optimized, but this runs. Gonna run some tests to ensure this correctly trains the LoRA. Might wanna test LoRAs on different layers.

Corrected mistake in data preparation. Will start training some LoRAs now and see if this fine tuning code is correctly set up.

danablend · 2024-03-08T19:05:14Z

The LoRA layer (1st layer in model) is 98k trainable parameters. Will try training now to validate the current code.

Disabled Accelerate for now. Properly aligned all dtypes between the model and dataloader. Previously loaded speaker embeddings were not converted to the correct dtype. Copied most of the nanoGPT-LoRA training parameters. Renamed "epochs" to "iters" like in nanoGPT.

update to further mimic the nanoGPT-LoRA training process

danablend · 2024-03-11T13:02:42Z

Currently sweeping learning rate and LoRA rank, alpha, dropout. Graphs look like this so far:

I'm unsure whether the data is fed into the model properly @vatsalaggarwal. I measure very high loss on the frozen foundational model (loss 8 - 12) when evaluating the model, but the audio sounds just fine. During training of LoRA layers I can reduce loss down to 4-5, indicating to me that the LoRAs work but that the data is formatted differently than how it was formatted when training the foundational model.

The data is prepared like in nanoGPT but with audio tokens added after the text tokens:

Loop through all text & wav files of dataset
Tokenize text with BPE
Extract Encodec tokens from wav with pretrained EncodecModel (bandwidth=6.0 kBps), take only first 2 hierarchies (first 2 indices) of predicted Encodec tokens. Then add 1024 to each 2nd Encodec token to separate the hierarchies when flattened starting from the 2nd idx. This was similar to how predicted tokens came out when I debugged the demo with app.py.
Sequence = <tokenized text + encodec tokens + EOS> concatted. Here, I have tried EOS=1024 and EOS = 2048, but it didn't make a big difference on loss measurements. I also tried appending EOS between the tokenized text and encodec tokens, but this also made little difference in the loss evaluation on the foundational model.
Concat Sequence to the FULL dataset tensor (one huge 1-dimensional tensor with EOS between each data point) like in nanoGPT

During Training:

A sliding window of block_size length randomly selects a window from the full dataset tensor, feeds it into the model and performs next-token pred.
Loss is calculated with Cross Entropy on the raw logits, and the target idxs are obtained by shifting the input window +1 to the right on our data to get the sequence that the model is supposed to predict, like in nanoGPT.

*input_pos is always set to torch.arange(0, block_size), since we are sliding a block_size window over the data to get the input
*block_size = 2048

Is this the same way that the foundational model was trained @vatsalaggarwal? I would just expect the loss to start out very low, given its ability to generalize extremely well. Were there any nuances when you trained the model that this doesn't account for?

vatsalaggarwal · 2024-03-12T11:11:39Z

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

danablend · 2024-03-12T13:59:35Z

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

That is great to hear! I am very interested to see how the solution looks.

Should quickly be able to follow up with added LoRAs to @lucapericlp's solution once pushed. Can run some sweeps to find us the best configuration as well :-)

Thanks!

makorihi · 2024-03-12T17:25:19Z

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

Eagerly awaiting this as well! Was fun reading the progress here too @danablend :)

vatsalaggarwal · 2024-03-13T14:31:44Z

@danablend / @makorihi check out #93 (review)

danablend · 2024-03-13T14:34:24Z

@danablend / @makorihi check out #93 (review)

Have just been reading through it! I'll get it up on my system and add LoRAs to this as soon as possible, hopefully today or tomorrow!

This is great - cheers @lucapericlp @vatsalaggarwal.

vatsalaggarwal · 2024-03-13T14:36:23Z

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

makorihi · 2024-03-13T14:41:16Z

@vatsalaggarwal

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

~~looking at the README in the other PR, I see~~

audio_files|captions
./data/audio.wav|./data/caption.txt

~~but you mention there's a | padding necessary in the finetune dataset as well? or is that post-ingestion within the system~~

ah, sorry, I misunderstood 😅

danablend · 2024-03-13T14:45:49Z

@vatsalaggarwal

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

looking at the README in the other PR, I see
audio_files|captions
./data/audio.wav|./data/caption.txt
but you mention there's a | padding necessary in the finetune dataset as well? or is that post-ingestion within the system

I believe padding is appended in the data preparation step here. Padding token is 2048 as set here

danablend · 2024-03-14T22:42:47Z

I have gotten LoRAs to train based off of @lucapericlp's awesome work.

Gonna clean it up and prepare for review. Probably going to happen in the weekend.

vatsalaggarwal · 2024-03-14T22:49:09Z

I have gotten LoRAs to train based off of @lucapericlp's awesome work.

Gonna clean it up and prepare for review. Probably going to happen in the weekend.

NICE! what sort of losses are you getting, and audio?

Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end?

danablend · 2024-03-15T11:24:39Z

I have gotten LoRAs to train based off of @lucapericlp's awesome work.
Gonna clean it up and prepare for review. Probably going to happen in the weekend.

NICE! what sort of losses are you getting, and audio?

Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end?

I've seen losses down at around 0.42 with 15m parameters, just on the last block's attention layer. I haven't actually written the code that loads the model with added LoRA layers yet, so I'm going to get some audio samples once that's there.

I took the LoRA from earlier, which is an adaptation of the one from nanoGPT-LoRA. Would you prefer if I used the Peft library from HF? Could probably change it without too much hassle. It's not so much extra code, now that the fine tuning & data loading has been cracked.

vatsalaggarwal · 2024-03-15T13:11:34Z

@lucapericlp is the LoRA whiz... what do you think we should?

vatsalaggarwal · 2024-03-15T13:12:24Z

I've seen losses down at around 0.42 with 15m parameters, just on the last block's attention layer. I haven't actually written the code that loads the model with added LoRA layers yet, so I'm going to get some audio samples once that's there.

That sounds pretty good!! Would just be worried about overfitting, but otherwise excited to hear the samples!!

lucapericlp · 2024-03-15T17:53:36Z

Re LoRA suggestions, from my experience, the two most impactful factors when comparing vs a full param finetune were:

adapting all layers
training for longer (increasing batch_size with any newly freed VRAM from decrease in learnable params makes this more feasible)

With those insights, I was able to get within a < 1.0pp "accuracy" degradation on the task I was training on.

Increasing rank does improve performance (we started with 8) but, I had found, has diminishing returns & results in marginal gains vs the others levers. Alpha, in that case, was kept static as per the paper (iirc) and we found following heuristic of alpha = 2*rank wasn't particularly helpful in moving the needle.

Might be useful, I had posted this thread that has articles I found helpful (even though, some were published after & replicated my private findings).

vatsalaggarwal · 2024-03-16T13:23:13Z

@lucapericlp we were also wondering if it’s worth integrating with PEFT or is the current way of doing things fine? cf #82 (comment)

lucapericlp · 2024-03-19T10:02:02Z

I'd say using PEFT is preferable purely from standpoint of reducing codebase footprint for "standard" ops where possible.

vatsalaggarwal · 2024-03-20T16:40:08Z

@danablend any news?

danablend mentioned this pull request Mar 4, 2024

Rough fine-tuning guidance #70

Open

vatsalaggarwal reviewed Mar 4, 2024

View reviewed changes

danablend added 2 commits March 7, 2024 12:28

Get model to train by clearing cache between iters

fecd0ac

The model is fine tuning now. Not correctly, but it is fine tuning. Data must be prepared correctly now. But first, Attention and KVCache must be modified to be compatible with processing batches where batch size > 1.

danablend added 3 commits March 7, 2024 23:53

Implement sliding window data loading

198fd9d

Data loading should be more memory optimized, but this runs. Gonna run some tests to ensure this correctly trains the LoRA. Might wanna test LoRAs on different layers.

Correct data loading

7c4b93e

Corrected mistake in data preparation. Will start training some LoRAs now and see if this fine tuning code is correctly set up.

inject lora into first layer of model

e7d27e8

load pretrained weights in lora layers

5c6c7a8

update to further mimic the nanoGPT-LoRA training process

LoRA Fine Tuning #82

Are you sure you want to change the base?

LoRA Fine Tuning #82

Conversation

danablend commented Mar 4, 2024

danablend commented Mar 4, 2024

vatsalaggarwal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danablend Mar 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vatsalaggarwal commented Mar 4, 2024

danablend commented Mar 4, 2024

vatsalaggarwal commented Mar 5, 2024 • edited

danablend commented Mar 6, 2024

danablend commented Mar 6, 2024 • edited

danablend commented Mar 6, 2024 • edited

danablend commented Mar 6, 2024 • edited

danablend commented Mar 6, 2024 • edited

danablend commented Mar 7, 2024 • edited

vatsalaggarwal commented Mar 7, 2024

danablend commented Mar 7, 2024

vatsalaggarwal commented Mar 7, 2024 • edited

danablend commented Mar 7, 2024 • edited

vatsalaggarwal commented Mar 7, 2024

vatsalaggarwal commented Mar 7, 2024

danablend commented Mar 7, 2024

danablend commented Mar 8, 2024

danablend commented Mar 11, 2024 • edited

vatsalaggarwal commented Mar 12, 2024

danablend commented Mar 12, 2024

makorihi commented Mar 12, 2024

vatsalaggarwal commented Mar 13, 2024

danablend commented Mar 13, 2024

vatsalaggarwal commented Mar 13, 2024

makorihi commented Mar 13, 2024 • edited

danablend commented Mar 13, 2024

danablend commented Mar 14, 2024

vatsalaggarwal commented Mar 14, 2024

danablend commented Mar 15, 2024 • edited

vatsalaggarwal commented Mar 15, 2024 • edited

vatsalaggarwal commented Mar 15, 2024

lucapericlp commented Mar 15, 2024

vatsalaggarwal commented Mar 16, 2024 • edited

lucapericlp commented Mar 19, 2024

vatsalaggarwal commented Mar 20, 2024

danablend Mar 6, 2024 •

edited

vatsalaggarwal commented Mar 5, 2024 •

edited

danablend commented Mar 6, 2024 •

edited

danablend commented Mar 6, 2024 •

edited

danablend commented Mar 6, 2024 •

edited

danablend commented Mar 6, 2024 •

edited

danablend commented Mar 7, 2024 •

edited

vatsalaggarwal commented Mar 7, 2024 •

edited

danablend commented Mar 7, 2024 •

edited

danablend commented Mar 11, 2024 •

edited

makorihi commented Mar 13, 2024 •

edited

danablend commented Mar 15, 2024 •

edited

vatsalaggarwal commented Mar 15, 2024 •

edited

vatsalaggarwal commented Mar 16, 2024 •

edited