Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for tokenizer_config.json before downloading tokenizer from HF Hub #834

Closed
wants to merge 3 commits into from
Closed

Conversation

AmgadHasan
Copy link

Check for a tokenizer_config.json file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type. This change is made to support inferring whisper-large-v3 from tokenizer config.

Previously, if there was no tokenizer.json file but there was a tokenizer_config.json file, the loader ignores this file and automatically downloads the tokenizer from whisper-tiny.en or whisper-tiny. This caused issues for models derived from whisper-large-v3 as it has a different tokenizer from these two models. For example, this caused the model to do translation even if the user specified the task as "transcribe" since these tokens have different token ids for large-v3.

…F Hub

Check for a `tokenizer_config.json`  file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type.
This change is made to support inferring whisper-large-v3 from tokenizer config
Fix space-after-comma formatting issues
Add Dockerfile example (#828)
@trungkienbkhn
Copy link
Collaborator

trungkienbkhn commented May 13, 2024

@AmgadHasan, hello. Currently, most FW conversion models have a tokenizer.json file in the model path and do not have a tokenizer_config.json file. So I believe your case is not very common. If your large v3 model is missing the tokenizer.json file, I think adding this file to the path is simpler than changing the FW code to adapt to it (and also need to add the tokenizer_config.json file).
For the conversion command, you can add the option --copy_files tokenizer.json to include the tokenizer file in the model path.

@AmgadHasan AmgadHasan closed this by deleting the head repository May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants