Check for tokenizer_config.json
before downloading tokenizer from HF Hub
#834
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check for a
tokenizer_config.json
file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type. This change is made to support inferring whisper-large-v3 from tokenizer config.Previously, if there was no
tokenizer.json
file but there was atokenizer_config.json
file, the loader ignores this file and automatically downloads the tokenizer fromwhisper-tiny.en
orwhisper-tiny
. This caused issues for models derived from whisper-large-v3 as it has a different tokenizer from these two models. For example, this caused the model to do translation even if the user specified the task as "transcribe" since these tokens have different token ids for large-v3.