Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA Fine Tuning #82

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

danablend
Copy link

This is just the first draft so we can start building this feature.

  • Added dataloader.py, which loads data for training
  • Added train.py, with the current training loop
  • Added lora.py, for LoRA wrapper of the stage 1 Transformer
  • Added dummy_dataset folder with 25 data samples to work with when testing (VCTK-->p311)
  • Commented out the initial inference code when stage 1 model is built.

There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update function, probably not that difficult to solve).

The dataloader works fine, but everything else requires work. This is just a dirty initial draft so we can start working on this thing together! :-)

Hope to hear some insights! I have time to put into this feature, so any pointers would be great! Am not super well-versed in AI, so bear with me.

This is just the first draft so we can start building this feature.

- Added dataloader.py, which loads data for training
- Added train.py, with the current training loop
- Added lora.py, for LoRA wrapper of the stage 1 Transformer
- Added dummy_dataset folder with 25 data samples to work with when testing (VCTK-->p311)
- Commented out the initial inference code when stage 1 model is built.

There is no batch processing in the training loop currently (was getting some dimension mismatching in the KVCache.update).

The dataloader works fine, but everything else requires some work. This is just an initial draft so we can start working on this thing together! :-)
@danablend
Copy link
Author

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:

  • LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
  • def _train_infer_stage_one breaks

Can make it clean once we've got a basic loop running successfully

Copy link
Contributor

@vatsalaggarwal vatsalaggarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing is that it looks like you're trying to finetune the whole network end-to-end from first stage -> second stage -> mbd -> deepfilternet with L2 loss... I would recommend just finetuning the first stage using next-token prediction loss... (ref: https://github.com/karpathy/nanoGPT/blob/master/model.py#L184)...

It also looks like you might be trying to finetune through a file load operation, which would kill the gradient (https://github.com/metavoiceio/metavoice-src/pull/82/files?file-filters%5B%5D=.py&file-filters%5B%5D=dotfile&show-viewed-files=true#diff-ed183d67207df065a11e1289f19d34cc2abbc5448dea952683cfe9728c342b95R270)?

For this, you'll need to change your data loader slightly to output only the first two hierarchies of the encodec tokens in a flattened interleaved manner.

Last thing is - why are you trying to finetune? If it's mostly for generalisation to a new speaker, it should be sufficient to finetune only the first stage...

dataloader.py Outdated
wav = wav.mean(axis=0, keepdims=True)

wav = wav.unsqueeze(0) # Add batch dimension
wav = wav.unsqueeze(0) # Add channel dimension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause an error if wav.ndim=2?

dataloader.py Outdated
# Padding for text tokens
text_tokens = pad_sequence(text_tokens, batch_first=True, padding_value=0)

# Audio waveform - padding to longest waveform on 4th dimension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have 4 dimension? Should be either (batch_size, time) or (batch_size, channels=1, time)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only need the Encodec tokens for training stage 1, I have removed the audio waveforms from the dataloader.

dataloader.py Outdated
MBD_SAMPLE_RATE = 24000

MAX_DURATION_IN_SECONDS = 15
MAX_INPUT_LENGTH = int(MBD_SAMPLE_RATE * MAX_DURATION_IN_SECONDS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this being used?

Copy link
Author

@danablend danablend Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere. I just removed it in the latest commit :-)

guidance_scale=torch.tensor(3.0, device=device, dtype=precision),
end_of_audio_token=9999, # don't end early for compilation stage.
)
# y = generate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment this out? How do you test the model without this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this back in, this was for faster initialization on my local setup (slow GPU)

train.py Outdated

print("Setting model to training mode...")
self.model.train() # Set the model to training mode
self.llm_second_stage.model.train() # Set the model to training mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to train this stage at all if you're only adapting speaker identities

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Removed stage 2 in latest commit :-)

@vatsalaggarwal
Copy link
Contributor

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:

LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
def _train_infer_stage_one breaks
Can make it clean once we've got a basic loop running successfully

Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:

  • only work with the first stage (this should reduce memory usage significantly as the diffusion model is ~5GB itself)
  • swap out the Adam optimiser with a SGD optimiser (just to get things running - this should reduce memory usage significantly)
  • finetune only the first layer, ignore the token embeddings/logit output layer

@danablend
Copy link
Author

I'm going to try getting a very basic thing to run. Currently, there are a couple issues:
LoRA parameter sizes don't match properly with the token embeddings and output layers. This causes tensor dimensions mismatch error. Also, not sure if these are the two layers we actually want to add LoRAs to. Probably need to test a lot, including figuring what the rank value should be.
def _train_infer_stage_one breaks
Can make it clean once we've got a basic loop running successfully

Got you. Are you not currently having to load all the models onto GPU VRAM though? Given the constraints of a 16GB VRAM, I would probably:

  • only work with the first stage (this should reduce memory usage significantly as the diffusion model is ~5GB itself)
  • swap out the Adam optimiser with a SGD optimiser (just to get things running - this should reduce memory usage significantly)
  • finetune only the first layer, ignore the token embeddings/logit output layer

Hey @vatsalaggarwal, these are all very helpful insights - much appreciated!

I will implement the points you mentioned:

  • Only work with first stage
  • Modify dataloader to return the first two hierarchies of the encodec tokens
  • Swap over to SGD optimizer to save on memory
  • Only fine tune the first layer of the first stage to adapt to new speakers (the use case)

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Should be able to commit these changes later today or tomorrow!

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Mar 5, 2024

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help?

@danablend
Copy link
Author

Makes sense to just work with stage 1 with cross entropy loss, I just wasn't able to get there - so thanks for that insight.

Not sure what you mean, but does https://github.com/karpathy/nanoGPT/blob/master/model.py#L184 help?

It massively helped - I've not worked with transformer networks before (only diffusion & GANs), so studying nanoGPT after you sent it today was very helpful!

Am just making a few more changes for today, then I'm pushing an update :-)

Switched from Adam to SGD optimizer

Modified the DataLoader to return the first two encodec token hierarchies as a flattened interleaved tensor (let me know if that looks ok to you?)

Modified LoRA wrapper to only fine tune speaker_cond_pos layer. In nanoGPT-LoRA only the causal attention layer is fine tuned (https://github.com/danielgrittner/nanoGPT-LoRA/blob/master/model.py#L128). Would it be worth trying something similar?

Modified training loop to forward pass with entire batches at a time. Loss calculation doesn't work, need to match the GT labels with generated probabilities. Need some direction here.
@danablend
Copy link
Author

danablend commented Mar 6, 2024

Just committed - here's an overview:

  • Switched from Adam to SGD optimizer
  • Modified DataLoader to return first two Encodec token hierarchies as a flattened interleaved tensor (let me know if that looks ok to you?)
  • Modified LoRA wrapper to only fine tune speaker_cond_pos layer
  • Modified training loop
    • Forward pass entire batch at a time
    • Wasn't able to format input_pos as (B, T) due to tensor dimension mismatch in KVCache. Unsure if this causes downstream effects or if that's fine?
    • Loss function call causes error right now. We need to match the GT labels with the model probs. I need some direction here, because I'm not quite sure about what the "unit" of the model's raw output is and what format we need the labels and model outputs to have to correctly calculate the loss. I assumed that the model would output something like a probability distribution of (B, S, V), B=batch_size, S=gen_encodec_tokens_count, V=vocab_size. But it's (B, T, V), with T=text_prompt_token_count. Not sure exactly how to get the data into the right format here for loss calculation, but the remainder of this implementation should be straightforward once this works.

EDIT1: After sleeping on it, I think it is an issue with the way the GT Encodec tokens are extracted and prepared in the DataLoader which is causing the problem, and not the format of the model output.

EDIT2: Was wondering if you have more detailed documentation about the model architecture / diagrams to help further understanding?

@danablend
Copy link
Author

danablend commented Mar 6, 2024

I might start to understand a little bit here.

Would the idea be to generate a single token at a random location for each batch by giving the model:
(prompt) + (random number of GT encodec tokens) as the input?

So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?

Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].

This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.

Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?

Is that the right intuition?

If so, would the GT Encodec labels be determined by the raw output of the EncodecModel.encode function, or do we need to run it through the first stage adapter to get the correct vocab indices? If you're not sure, I'll see if I can find out.

Again, thanks a lot for pointing me in the right direction, it's very helpful given my limited knowledge!

Almost trains, I have just made a mistake somewhere causing:

RuntimeError: Trying to backward through the graph a second time

I'm guessing it's because we need to do all the preprocessing in the dataloader rather than in the training loop.

Let me know any thoughts. It's getting close :-)
@danablend
Copy link
Author

danablend commented Mar 6, 2024

I've modified the training loop to be as I described above, and it seems to work (although I'm not 100% sure that the GT Encodec indices are correctly determined).

Somewhere I have made a mistake, causing tensors to be retained in the computational graph across batches, which leads to:
"RuntimeError: Trying to backward through the graph a second time" on the 2nd iteration. Maybe you would be able to see I made this mistake @vatsalaggarwal?

Let me know any thoughts! :-)

Script can be launched with mixed precision using accelerate now:
accelerate launch train.py --mixed-precision bfloat16 for bfloat16
accelerate launch train.py --mixed-precision float16 for float16

Moved loss calculation to LoRA wrapper model.

Modified training loop to be similar to that of nanoGPT. This involves using a sliding window for prompts & labels, which should more accurately replicate what the model is actually producing at logits level. If my intuition is wrong about this, please correct me.
@danablend
Copy link
Author

danablend commented Mar 6, 2024

Still some work to do in dataloader to ensure proper windows (aka blocks in nanoGPT) of X,Y data is prepared for good training. Right now useless data might be used due to randomly selecting padded zero values. Can be fixed in dataloader, or might end up creating a get_batch function similar to that of nanoGPT.

Haven't been able to solve the error which occurs when calling .backward for the 2nd time:
"RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)."

I'm sure it's a very obvious error I've made, but I have not been able to isolate it.

@danablend
Copy link
Author

danablend commented Mar 7, 2024

I think I've isolated the backward error (see above) to the caching mechanism in the model. Will work on a solution today and then we should be able to have something training

EDIT: It is definitely the caching mechanism. I just got it to run when clearing and detaching K,V tensors in KVCache between iterations.

@vatsalaggarwal
Copy link
Contributor

in the middle of finishing something, haven't had time to look at this, will do it soon, sorry!

@danablend
Copy link
Author

in the middle of finishing something, haven't had time to look at this, will do it soon, sorry!

Completely understand, no pressure! I'm just posting updates for whenever you have the time :-)

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Mar 7, 2024

the updates are super helpful though!

@danablend
Copy link
Author

danablend commented Mar 7, 2024

I have the model training the LoRA layers now, but the data preparation process is currently garbage and I'm probably also calculating the loss with unit mismatch between the prompt input, inference output, and GT labels.

Will play with this for a bit, but I might need some help correctly interpreting the "units" of the variables here so we can prepare them correctly for the prompt as well as for the loss function.

Gonna draw some inspiration from nanoGPT and improve the data preparation, and then we're getting close to running some tests.

The model is fine tuning now. Not correctly, but it is fine tuning.

Data must be prepared correctly now.

But first, Attention and KVCache must be modified to be compatible with processing batches where batch size > 1.
Training loop is pretty clean now, and all data preparation is now done in the dataloader.py file.

Loss becomes "nan" when entries in a batch have a lot of variance between each other (eg one entry had to be padded a lot during collation due to big difference in lengths on either prompt or encodec tokens tensors or both).

Issue could perhaps be solved by grouping longer data points together, keeping them of similar length to avoid a lot of padding. Would love to hear any thoughts here.
@vatsalaggarwal
Copy link
Contributor

For this input (b, t, vocab_size), would b be the 2 predicted hierarchies of encodec tokens and not the batch size?

b is batch size, t is timesteps, vocab_size is vocab_size... the hierarchies are flattened and interleaved into one for the first stage model

My intuition is wrong here, and the output of the model is (2, S, V) where I would expect (B, 2, S, V)

does my previous comment help?

I tried this, but it causes dimension mismatch in KVCache and Attention

I will work on making KVCache and Attention compatible with a (B, S) input_pos, but you know much more about this than me, so maybe you could share your insight on this?

KVCaching isn't relevant during training, that should be switched off...

@vatsalaggarwal
Copy link
Contributor

I think I've isolated the backward error (see above) to the caching mechanism in the model.

this shouldn't be used during training

Haven't been able to solve the error which occurs when calling .backward for the 2nd time

Do you mean for the second iteration? Have you zeroed gradients?

I might start to understand a little bit here.

Would the idea be to generate a single token at a random location for each batch by giving the model:
(prompt) + (random number of GT encodec tokens) as the input?

So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?

Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].

This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.

Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?

Is that the right intuition?

Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking.

@danablend
Copy link
Author

I think I've isolated the backward error (see above) to the caching mechanism in the model.

this shouldn't be used during training

Haven't been able to solve the error which occurs when calling .backward for the 2nd time

Do you mean for the second iteration? Have you zeroed gradients?

I might start to understand a little bit here.
Would the idea be to generate a single token at a random location for each batch by giving the model:
(prompt) + (random number of GT encodec tokens) as the input?
So we would concat (B, P) and (B, T_random) prompt and encodec tokens respectively, and then use that to generate (B, V) tensor logits for B tokens per batch?
Then input_pos = torch.arange(0, P+T_rand) for a given T_rand ∈ [0, T_max - 1].
This would then give us prediction tensor for the batch (B, V), for which we would calculate the loss.
Then the GT labels would be (B, 1), where we take (P+T_rand) on the 2nd dimension from the GT encodec tokens (the next GT token in the sequence that the model is supposed to predict)?
Is that the right intuition?

Didn't get this... best thing might be to work through the NanoGPT training loop... but rough idea is to create a row of (B, S) which contains all the text and audio tokens concatenated together. Then, you apply next-token pred. This does the right thing because of causal masking.

Thanks for explaining, that makes sense. I've implemented it, similar to NanoGPT. Pushing momentarily.

Data loading should be more memory optimized, but this runs.

Gonna run some tests to ensure this correctly trains the LoRA. Might wanna test LoRAs on different layers.
Corrected mistake in data preparation. Will start training some LoRAs now and see if this fine tuning code is correctly set up.
@danablend
Copy link
Author

The LoRA layer (1st layer in model) is 98k trainable parameters. Will try training now to validate the current code.

Disabled Accelerate for now.

Properly aligned all dtypes between the model and dataloader. Previously loaded speaker embeddings were not converted to the correct dtype.

Copied most of the nanoGPT-LoRA training parameters.

Renamed "epochs" to "iters" like in nanoGPT.
update to further mimic the nanoGPT-LoRA training process
@danablend
Copy link
Author

danablend commented Mar 11, 2024

Currently sweeping learning rate and LoRA rank, alpha, dropout. Graphs look like this so far:
image

I'm unsure whether the data is fed into the model properly @vatsalaggarwal. I measure very high loss on the frozen foundational model (loss 8 - 12) when evaluating the model, but the audio sounds just fine. During training of LoRA layers I can reduce loss down to 4-5, indicating to me that the LoRAs work but that the data is formatted differently than how it was formatted when training the foundational model.

The data is prepared like in nanoGPT but with audio tokens added after the text tokens:

  1. Loop through all text & wav files of dataset
  2. Tokenize text with BPE
  3. Extract Encodec tokens from wav with pretrained EncodecModel (bandwidth=6.0 kBps), take only first 2 hierarchies (first 2 indices) of predicted Encodec tokens. Then add 1024 to each 2nd Encodec token to separate the hierarchies when flattened starting from the 2nd idx. This was similar to how predicted tokens came out when I debugged the demo with app.py.
  4. Sequence = <tokenized text + encodec tokens + EOS> concatted. Here, I have tried EOS=1024 and EOS = 2048, but it didn't make a big difference on loss measurements. I also tried appending EOS between the tokenized text and encodec tokens, but this also made little difference in the loss evaluation on the foundational model.
  5. Concat Sequence to the FULL dataset tensor (one huge 1-dimensional tensor with EOS between each data point) like in nanoGPT

During Training:

  1. A sliding window of block_size length randomly selects a window from the full dataset tensor, feeds it into the model and performs next-token pred.
  2. Loss is calculated with Cross Entropy on the raw logits, and the target idxs are obtained by shifting the input window +1 to the right on our data to get the sequence that the model is supposed to predict, like in nanoGPT.

*input_pos is always set to torch.arange(0, block_size), since we are sliding a block_size window over the data to get the input
*block_size = 2048

Is this the same way that the foundational model was trained @vatsalaggarwal? I would just expect the loss to start out very low, given its ability to generalize extremely well. Were there any nuances when you trained the model that this doesn't account for?

@vatsalaggarwal
Copy link
Contributor

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

@danablend
Copy link
Author

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

That is great to hear! I am very interested to see how the solution looks.

Should quickly be able to follow up with added LoRAs to @lucapericlp's solution once pushed. Can run some sweeps to find us the best configuration as well :-)

Thanks!

@makorihi
Copy link

I think @lucapericlp is close to turning this into a working solution (without LoRA), so might be better to add LoRA to that... hopefully should be out by EOD.

The way you mentioned works for training text-only model, but we had some trouble training text+audio models that way... so the data formats etc were slightly differently, hopefully upcoming PR should clarify!

Eagerly awaiting this as well! Was fun reading the progress here too @danablend :)

@vatsalaggarwal
Copy link
Contributor

@danablend / @makorihi check out #93 (review)

@danablend
Copy link
Author

@danablend / @makorihi check out #93 (review)

Have just been reading through it! I'll get it up on my system and add LoRAs to this as soon as possible, hopefully today or tomorrow!

This is great - cheers @lucapericlp @vatsalaggarwal.

@vatsalaggarwal
Copy link
Contributor

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

@makorihi
Copy link

makorihi commented Mar 13, 2024

@vatsalaggarwal

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

looking at the README in the other PR, I see

audio_files|captions
./data/audio.wav|./data/caption.txt

but you mention there's a | padding necessary in the finetune dataset as well? or is that post-ingestion within the system

ah, sorry, I misunderstood 😅

@danablend
Copy link
Author

@vatsalaggarwal

@danablend I think the main thing to keep in mind is: each row is composed of "text tokens | audio tokens | padding"... I think you were taking arbitrary segments of these (per NanoGPT) as that's what they do in the text-LLM trainings, but it doesn't work so great for text->speech training..

looking at the README in the other PR, I see

audio_files|captions
./data/audio.wav|./data/caption.txt

but you mention there's a | padding necessary in the finetune dataset as well? or is that post-ingestion within the system

I believe padding is appended in the data preparation step here. Padding token is 2048 as set here

@danablend
Copy link
Author

I have gotten LoRAs to train based off of @lucapericlp's awesome work.

Gonna clean it up and prepare for review. Probably going to happen in the weekend.

@vatsalaggarwal
Copy link
Contributor

I have gotten LoRAs to train based off of @lucapericlp's awesome work.

Gonna clean it up and prepare for review. Probably going to happen in the weekend.

NICE! what sort of losses are you getting, and audio?

Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end?

@danablend
Copy link
Author

danablend commented Mar 15, 2024

I have gotten LoRAs to train based off of @lucapericlp's awesome work.
Gonna clean it up and prepare for review. Probably going to happen in the weekend.

NICE! what sort of losses are you getting, and audio?

Did you use https://github.com/huggingface/peft, or did you wrap your own as in this PR in the end?

I've seen losses down at around 0.42 with 15m parameters, just on the last block's attention layer. I haven't actually written the code that loads the model with added LoRA layers yet, so I'm going to get some audio samples once that's there.

I took the LoRA from earlier, which is an adaptation of the one from nanoGPT-LoRA. Would you prefer if I used the Peft library from HF? Could probably change it without too much hassle. It's not so much extra code, now that the fine tuning & data loading has been cracked.

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Mar 15, 2024

@lucapericlp is the LoRA whiz... what do you think we should?

@vatsalaggarwal
Copy link
Contributor

I've seen losses down at around 0.42 with 15m parameters, just on the last block's attention layer. I haven't actually written the code that loads the model with added LoRA layers yet, so I'm going to get some audio samples once that's there.

That sounds pretty good!! Would just be worried about overfitting, but otherwise excited to hear the samples!!

@lucapericlp
Copy link
Contributor

Re LoRA suggestions, from my experience, the two most impactful factors when comparing vs a full param finetune were:

  1. adapting all layers
  2. training for longer (increasing batch_size with any newly freed VRAM from decrease in learnable params makes this more feasible)

With those insights, I was able to get within a < 1.0pp "accuracy" degradation on the task I was training on.

Increasing rank does improve performance (we started with 8) but, I had found, has diminishing returns & results in marginal gains vs the others levers. Alpha, in that case, was kept static as per the paper (iirc) and we found following heuristic of alpha = 2*rank wasn't particularly helpful in moving the needle.

Might be useful, I had posted this thread that has articles I found helpful (even though, some were published after & replicated my private findings).

@vatsalaggarwal
Copy link
Contributor

vatsalaggarwal commented Mar 16, 2024

@lucapericlp we were also wondering if it’s worth integrating with PEFT or is the current way of doing things fine? cf #82 (comment)

@lucapericlp
Copy link
Contributor

I'd say using PEFT is preferable purely from standpoint of reducing codebase footprint for "standard" ops where possible.

@vatsalaggarwal
Copy link
Contributor

@danablend any news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants