Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in DataCollatorSpeechSeq2SeqWithPadding (Unit 5) #85

Open
tonywu71 opened this issue Jul 9, 2023 · 3 comments
Open

Error in DataCollatorSpeechSeq2SeqWithPadding (Unit 5) #85

tonywu71 opened this issue Jul 9, 2023 · 3 comments

Comments

@tonywu71
Copy link

tonywu71 commented Jul 9, 2023

In the unit 5 of the audio course, the following code is used:

class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

However, according to the following issue, bos_token_id shouldn't be used (@ArthurZucker). In my opinion, this should be replaced with self.processor.tokenizer.convert_tokens_to_ids("<|startoftranscript|>") or with model.config.decoder_start_token_id. What do you think?

Note if this is true, then there would be a similar error in @sanchit-gandhi's fine-tuning tutorial too.

Thanks for your attention.

Regards,
Tony

@ArthurZucker
Copy link

If the self.processor.tokenizer.bos_token_id is correctly set, (it should not be used in the sense that it is not used, of forced_decoder_ids is set it will be taken, instead of this one) then I don't really see a problem

@tonywu71
Copy link
Author

If you look at the issue I mentioned, it seems that you stated the opposite. Here's what you wrote for context:

I'll update the documentation to make it less confusing. The token used to store the "<|startoftranscript|>" token is decoder_start_token_id. The bos_token is pretty much unused, which is why it was set to the same as eos_token.

I might be wrong. But forced_decoder_ids is only used with generate and not with forward. And I checked in the forward source code: with the current collator, the batch["labels"] would always start with <|startoftranscript|>. However, a redundant <|startoftranscript|> would also be appended when using forward with the labels argument and when decoder_input_ids is None (see source code). Let me know if you think I'm wrong.

@MKhalusova
Copy link
Contributor

cc @sanchit-gandhi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants