Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834

AmgadHasan · 2024-05-13T09:30:23Z

Check for a tokenizer_config.json file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type. This change is made to support inferring whisper-large-v3 from tokenizer config.

Previously, if there was no tokenizer.json file but there was a tokenizer_config.json file, the loader ignores this file and automatically downloads the tokenizer from whisper-tiny.en or whisper-tiny. This caused issues for models derived from whisper-large-v3 as it has a different tokenizer from these two models. For example, this caused the model to do translation even if the user specified the task as "transcribe" since these tokens have different token ids for large-v3.

…F Hub Check for a `tokenizer_config.json` file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type. This change is made to support inferring whisper-large-v3 from tokenizer config

Fix space-after-comma formatting issues

Add Dockerfile example (#828)

trungkienbkhn · 2024-05-13T10:41:19Z

@AmgadHasan, hello. Currently, most FW conversion models have a tokenizer.json file in the model path and do not have a tokenizer_config.json file. So I believe your case is not very common. If your large v3 model is missing the tokenizer.json file, I think adding this file to the path is simpler than changing the FW code to adapt to it (and also need to add the tokenizer_config.json file).
For the conversion command, you can add the option --copy_files tokenizer.json to include the tokenizer file in the model path.

AmgadHasan added 3 commits May 13, 2024 12:26

Fix formatting issues

06496bf

Fix space-after-comma formatting issues

Merge pull request #1 from SYSTRAN/master

3844b30

Add Dockerfile example (#828)

AmgadHasan mentioned this pull request May 13, 2024

Faster whisper loads the wrong tokenizer for whisper-large-v3 derivatives #835

Closed

AmgadHasan closed this by deleting the head repository May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834

Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834

AmgadHasan commented May 13, 2024

trungkienbkhn commented May 13, 2024 •

edited

Loading

Check for tokenizer_config.json before downloading tokenizer from HF Hub #834

Check for tokenizer_config.json before downloading tokenizer from HF Hub #834

Conversation

AmgadHasan commented May 13, 2024

trungkienbkhn commented May 13, 2024 • edited Loading

Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834

Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834

trungkienbkhn commented May 13, 2024 •

edited

Loading