Fix VLM train_on_response_only #60

coding-famer · 2025-03-01T23:27:22Z

Make train_on_response_only compatible with VLM. Fix unslothai/unsloth#1396

In VLM, tokenization occurs within the data_collator rather than in SFTTrainer._prepare_dataset. To ensure consistency in the train_on_response_only interface, I modified the trainer’s data_collator to handle label construction.

patel-zeel · 2025-03-03T16:15:43Z

Hi @coding-famer! This PR has most of the changes and will be completed with the following:

Since the vision tokenizer has image_processor as well, _find_common_token_ids should take tokenizer.tokenizer instead of tokenizer.
batch["labels"] = _train_on_responses_only(batch) -> batch["labels"] = _train_on_responses_only(batch)["labels"], not to create nested dictionary with key "labels".

This colab will run end-to-end once the above changes are done.

coding-famer · 2025-03-03T17:45:43Z

Hi @coding-famer! This PR has most of the changes and will be completed with the following:

Since the vision tokenizer has image_processor as well, _find_common_token_ids should take tokenizer.tokenizer instead of tokenizer.

batch["labels"] = _train_on_responses_only(batch) -> batch["labels"] = _train_on_responses_only(batch)["labels"], not to create nested dictionary with key "labels".

This colab will run end-to-end once the above changes are done.

Hi @patel-zeel, thanks for your suggestion. Updated to fix 2. For 1, I believe the vision tokenizer will have the same behavior as tokenizer when image is None.

patel-zeel · 2025-03-05T13:29:15Z

@coding-famer Could you please help me understand why we would pass image=None for this case? In normal cases, the following line would need a check for VLM and need to access the text tokenizer explicitly with tokenizer.tokenizer.

unsloth-zoo/unsloth_zoo/dataset_utils.py

Line 184 in 0389d45

    
           tokenizer = trainer.processing_class if hasattr(trainer, "processing_class") else trainer.tokenizer

There might be bugs in the way I am using this feature, so I'd appreciate it if you could point to the required changes in my colab as well.

coding-famer · 2025-03-05T18:54:08Z

@coding-famer Could you please help me understand why we would pass image=None for this case? In normal cases, the following line would need a check for VLM and need to access the text tokenizer explicitly with tokenizer.tokenizer.

unsloth-zoo/unsloth_zoo/dataset_utils.py

Line 184 in 0389d45

tokenizer = trainer.processing_class if hasattr(trainer, "processing_class") else trainer.tokenizer

There might be bugs in the way I am using this feature, so I'd appreciate it if you could point to the required changes in my colab as well.

@patel-zeel Sorry I mean here processor(ids) will have same output with processor.tokenizer(ids), so this modification is unnecessary.
And I think the colab notebook works correctly? Please LMK what error will occur if we don't set tokenizer = tokenizer.tokenizer

patel-zeel · 2025-03-05T19:21:08Z

@coding-famer Oh, I got your point and also found the issue. Actually, the first argument to processor is images, and thus, my colab is running into an error. We need to make sure that text part goes to text argument, but, using keyword argument may break for some tokenizers?

Edit: I verified that all LLama, Pixtral, Qwen and llava processors accept images as their first argument (Perhaps they kept it that way on purpose so that no one mistakenly passes only text to the processor and wonders why model is so poor!).

coding-famer · 2025-03-05T19:39:48Z

@coding-famer Oh, I got your point and also found the issue. Actually, the first argument to processor is images, and thus, my colab is running into an error. We need to make sure that text part goes to text argument, but, using keyword argument may break for some tokenizers?

Oh yes you are right. I incorrectly set tokenizer=processor.tokenizer when I initialized the trainer(though this seems a solution for this issue but let's make it handle general cases). So what's the best way you think to solve it? Set tokenizer here for vlm?

unsloth-zoo/unsloth_zoo/dataset_utils.py

Line 184 in 0389d45

    
           tokenizer = trainer.processing_class if hasattr(trainer, "processing_class") else trainer.tokenizer

patel-zeel · 2025-03-05T19:43:37Z

@coding-famer IMO, we can continue with your logic of using has_tokenized to detect VLM and create an if-else condition to use processor(text=...) explicitly when a VLM is detected. We can also add what we have discovered so far in the code comments to save a few hours for future devs!

If we want to make VLM detection stricter, I used the following logic in my wrong PR:

IS_VISION_MODEL = False
# This approach should support all kind of models irrespecitve 
# of depth of vision layer in the parent module.
for module_name, _ in trainer.model.named_modules():
    if any(name in module_name for name in ["visual", "vision_tower", "vision_model"]):
        IS_VISION_MODEL = True
        break

coding-famer · 2025-03-09T23:18:11Z

@patel-zeel After thinking about this, I hope this pr can handle the general case: where inputs haven't been tokenized when init the trainer, instead of just handling vlm. So I think my has_tokenized flag is good.

patel-zeel · 2025-03-09T23:22:51Z

Yes, I agree, @coding-famer. After solving the merge conflicts with @danielhanchen's new edits, this PR will be ready for review.

patel-zeel · 2025-03-09T23:25:50Z

@coding-famer A small edit I thought about: Changing the tokenizer in place might create an issue in case a full tokenizer is needed later. How about something like the following to make it more explicit and readable?

# Get the text tokenizer from the combined tokenizer
if not has_tokenized:
    text_tokenizer = tokenizer.tokenizer
else:
    text_tokenizer = tokenizer
pass

# Get most common tokens since tokenizers can tokenize stuff differently!
Q_must, Q_left, Q_right = _find_common_token_ids(instruction_part, text_tokenizer)
A_must, A_left, A_right = _find_common_token_ids(response_part,    text_tokenizer)

patel-zeel · 2025-03-09T23:34:27Z

@coding-famer A few more thoughts:

I'm not sure what the following code does, but shall we keep it outside of the if else condition to apply it for VLMs, too?

# Check if all labels randomnly got masked to nothing - maybe wrong chat template?
from .training_utils import fix_zero_training_loss
fix_zero_training_loss(None, tokenizer, trainer.train_dataset)

I was thinking about what would happen if someone calls train_on_response_only multiple times (say executing the same cell of Jupyter notebook multiple times). Will it wrap one more time? I was thinking of something like this to avoid it:

class CustomDataCollator:
    def __init__(self, collator, modifier_fn):
        self.collator = collator
        self.modifier_fn = modifier_fn

    def __call__(self, examples):
        batch = self.collator(examples)
        batch["labels"] = self.modifier_fn(batch)["labels"]
        return batch

from unsloth_zoo.vision_utils import UnslothVisionDataCollator
import warnings

if not hasattr(trainer, "data_collator"):
    warnings.warn(
        "Vision model detected, but data_collator is not set. Setting it to UnslothVisionDataCollator."
    )
    trainer.data_collator = UnslothVisionDataCollator(model=trainer.model, processor=tokenizer)
if not isinstance(trainer.data_collator, CustomDataCollator):
    trainer.data_collator = CustomDataCollator(trainer.data_collator, _train_on_responses_only)

pass

coding-famer · 2025-03-09T23:37:41Z

@patel-zeel Good points! For 1 I'm not sure, too. Let me have a look at the logic of that function. I'll add 2.

unsloth_zoo/dataset_utils.py

coding-famer · 2025-03-09T23:55:57Z

unsloth_zoo/dataset_utils.py

-        not isinstance(trainer.data_collator, DataCollatorForSeq2Seq):
-        trainer.data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer)
+    from unsloth_zoo.vision_utils import UnslothVisionDataCollator
+    if hasattr(trainer, "data_collator"):


Do we need to ensure the trainer has a data_collator?

If people follow Unsloth Notebooks, it's unlikely to be missed, but I thought it's not hurting existing functionality. Also, it attaches the collator with a warning. In case an advanced user does not want this behavior, they can modify it after applying train_on_response_only.

This was mainly inspired by the recent DataCollatorForSeq2Seq change a few lines above.

How about leaving this for @danielhanchen to review?
I mean I didn't add your

if not hasattr(trainer, "data_collator"): warnings.warn( "Vision model detected, but data_collator is not set. Setting it to UnslothVisionDataCollator." ) trainer.data_collator = UnslothVisionDataCollator(model=trainer.model, processor=tokenizer)

part. And the recent DataCollatorForSeq2Seq change doesn't ensure the trainer has data_collator, too.
It's good to add this but the old logic doesn't do this (or maybe it is better to do it outside the train_on_response_only cus this is not quite related).

Yes, I agree. It's not directly related and not having it is not hurting anything.

unsloth_zoo/dataset_utils.py

patel-zeel · 2025-03-10T07:24:55Z

@coding-famer This looks mostly good with one caveat: If someone changes instruction_part and response_part and reapplies the function, it'll silently execute and will have no effect because we will simply check if trainer.data_collator already has a collator and will return the same object. One possible solution is:

if hasattr(trainer, "data_collator"):
    # If `UnslothResponseOnlyCollator` is found, extract internal collator to avoid double wrapping
    if hasattr(trainer.data_collator, "collator"):
        trainer.data_collator = trainer.data_collator.collator
    trainer.data_collator = UnslothResponseOnlyCollator(trainer.data_collator, _train_on_responses_only)

patel-zeel · 2025-03-10T23:29:59Z

Amazing @coding-famer!

Hi @danielhanchen, this PR is ready for review. This colab is for testing this feature. This is one of the issues given in the Unsloth puzzles colab.

mikeknapp · 2025-03-12T21:36:44Z

@coding-famer @patel-zeel I believe we can get text_only and image examples to work correctly if we do something like this:

class UnslothResponseOnlyCollator:
  def __init__(self, collator, modifier_fn):
      self.collator = collator
      self.modifier_fn = modifier_fn
  def __call__(self, examples):
      image_examples = []
      text_only_examples = []
      for example in examples:
          msgs = example.get("messages", [])
          if any(msg.get("type") == "image" for msg in msgs):
              image_examples.append(example)
          else:
              text_only_examples.append(example)
      batch = {}
      if len(image_examples) > 0:
          image_batch = self.collator(image_examples)
          image_batch["labels"] = self.modifier_fn(image_batch)["labels"]
          batch = image_batch
      if len(text_only_examples) > 0:
          text_batch = self.collator(text_only_examples)
          text_batch["labels"] = self.modifier_fn(text_batch)["labels"]
          batch = {**batch, **text_batch}
      return batch

This probably just needs some massaging to deal with the case where people have a 2 column dataset, with the second column being an image.

coding-famer · 2025-03-12T22:04:43Z

@coding-famer @patel-zeel I believe we can get text_only and image examples to work correctly if we do something like this:

class UnslothResponseOnlyCollator:
  def __init__(self, collator, modifier_fn):
      self.collator = collator
      self.modifier_fn = modifier_fn
  def __call__(self, examples):
      image_examples = []
      text_only_examples = []
      for example in examples:
          msgs = example.get("messages", [])
          if any(msg.get("type") == "image" for msg in msgs):
              image_examples.append(example)
          else:
              text_only_examples.append(example)
      batch = {}
      if len(image_examples) > 0:
          image_batch = self.collator(image_examples)
          image_batch["labels"] = self.modifier_fn(image_batch)["labels"]
          batch = image_batch
      if len(text_only_examples) > 0:
          text_batch = self.collator(text_only_examples)
          text_batch["labels"] = self.modifier_fn(text_batch)["labels"]
          batch = {**batch, **text_batch}
      return batch

This probably just needs some massaging to deal with the case where people have a 2 column dataset, with the second column being an image.

For me, I think the mixture of text only and text image case should be done in a new datacollator, and this issue is trying to handle the train_on_response_only case(imaging someone wants to train with mixture of text and text-image but doesn't want to train on response only). You can open a new PR and I'm willing to contribute to it, too.

danielhanchen · 2025-03-16T10:49:42Z

Oh my wait - apologies I totally missed this

danielhanchen · 2025-03-16T10:53:56Z

I actually I think accidentally in parallel did this during the release for Geamm 3: https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/vision_utils.py#L259

danielhanchen · 2025-03-16T10:59:13Z

I can accept the PR and provide the bounty for parts which are not in the current code base - the extra checks and names and etc

vlm train_on_response_only

3bcaac5

patel-zeel mentioned this pull request Mar 2, 2025

Add vlm support in train_on_responses_only #59

Closed

fix

88ddfe5

get text tokenizer in multimodal processor

4ac40b5

coding-famer added 2 commits March 9, 2025 23:29

fix

be35395

Merge branch 'main' into vlm_train_on_responses_only

acbce09

patel-zeel reviewed Mar 9, 2025

View reviewed changes

unsloth_zoo/dataset_utils.py Show resolved Hide resolved

update

8a9e72e

coding-famer commented Mar 9, 2025

View reviewed changes

patel-zeel reviewed Mar 10, 2025

View reviewed changes

unsloth_zoo/dataset_utils.py Outdated Show resolved Hide resolved

patel-zeel reviewed Mar 10, 2025

View reviewed changes

unsloth_zoo/dataset_utils.py Outdated Show resolved Hide resolved

coding-famer added 3 commits March 10, 2025 01:19

fix

9d2a120

fix reapply

90a223d

fix reapply

1523284

apply suggestion for double wrap collator

080a17f

Fix VLM train_on_response_only #60

Are you sure you want to change the base?

Fix VLM train_on_response_only #60

Uh oh!

Conversation

coding-famer commented Mar 1, 2025

Uh oh!

patel-zeel commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coding-famer commented Mar 3, 2025

Uh oh!

patel-zeel commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coding-famer commented Mar 5, 2025

Uh oh!

patel-zeel commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coding-famer commented Mar 5, 2025

Uh oh!

patel-zeel commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coding-famer commented Mar 9, 2025

Uh oh!

patel-zeel commented Mar 9, 2025

Uh oh!

patel-zeel commented Mar 9, 2025

Uh oh!

patel-zeel commented Mar 9, 2025

Uh oh!

coding-famer commented Mar 9, 2025

Uh oh!

Uh oh!

coding-famer Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

patel-zeel Mar 9, 2025

Choose a reason for hiding this comment

Uh oh!

coding-famer Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patel-zeel Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

patel-zeel commented Mar 10, 2025

Uh oh!

patel-zeel commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikeknapp commented Mar 12, 2025

Uh oh!

coding-famer commented Mar 12, 2025

Uh oh!

danielhanchen commented Mar 16, 2025

Uh oh!

danielhanchen commented Mar 16, 2025

Uh oh!

danielhanchen commented Mar 16, 2025

Uh oh!

Uh oh!

patel-zeel commented Mar 3, 2025 •

edited

Loading

patel-zeel commented Mar 5, 2025 •

edited

Loading

patel-zeel commented Mar 5, 2025 •

edited

Loading

patel-zeel commented Mar 5, 2025 •

edited

Loading

coding-famer Mar 10, 2025 •

edited

Loading

patel-zeel commented Mar 10, 2025 •

edited

Loading