Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trainer integration test for llava to ensure accelerate autocasting works correctly #30489

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

frasermince
Copy link
Contributor

What does this PR do?

This PR adds a new integration test to ensure the accelerate autocasting is working correctly. This came out of a discussion found here and that PR should probably be merged first (or this one merged into that one).

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@@ -0,0 +1,120 @@
import unittest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be the only new change. That will be clearer once the other is merged. Usually I would set it up to merge this one into the first but this feels a bit difference since the changes are on a fork.

@frasermince frasermince force-pushed the frasermince/trainer-integration-test branch from b723f1f to 908ff93 Compare April 25, 2024 17:32

output = model(**inputs)
expected_slice = torch.tensor(
[[-3.5664, -3.5625, -0.4309], [-5.8242, -5.6914, -1.3242], [-5.4805, -5.9375, 1.1465]],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this pattern in the llava next tests. These came from the use of this model before training. Not quite sure if this is correct so please let me know if there is something else we want to test. Perhaps instead we want the trained model before applying the downcasting change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note these do not yet pass allclose. I wanted to go ahead and open this PR to generate discussion around what the right thing to test is.

"llava-hf/bakLlava-v1-hf", quantization_config=bits_and_bytes_config
)
adapter_name = "lora_default"
peft_config = LoraConfig(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat unclear on where we test the slow tests but I assume there is some limit on memory so I tried to give a reasonable LORA for this test. If you think there is a simpler or more idiomatic way to do this test please let me know.

@frasermince frasermince force-pushed the frasermince/trainer-integration-test branch from 908ff93 to 1cd13e0 Compare April 25, 2024 17:35
@slow
@require_bitsandbytes
def test_model_trainer_integration_test(self):
def image_prompt_generator():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure this is the simplest or most idiomatic way to create this test dataset so please let me know if there is a better way.

@frasermince frasermince force-pushed the frasermince/trainer-integration-test branch 4 times, most recently from 7ce15e3 to 9440dd6 Compare April 25, 2024 18:13
@amyeroberts
Copy link
Collaborator

cc @muellerzr for first review

@frasermince frasermince force-pushed the frasermince/trainer-integration-test branch from 9440dd6 to e9e3feb Compare May 2, 2024 21:58
@frasermince frasermince force-pushed the frasermince/trainer-integration-test branch from e9e3feb to bc31529 Compare May 2, 2024 22:07
@frasermince
Copy link
Contributor Author

Updated this now that the previous PR is merged! I am very concerned about OOMs being an issue here however. I think there's some open questions around:

  1. How we ensure models are compatible with the trainer and accelerate
  2. How we test training a model in CI given how memory intensive this can be

@frasermince frasermince force-pushed the frasermince/trainer-integration-test branch from 8f35687 to 3c7e7e1 Compare May 5, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants