Improve Tokenizer New Type Onboarding #1536

zhenyan-zhang-meta · 2025-04-28T18:31:33Z

🚀 The feature, motivation and pitch

As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:

Tasks

Move TokenizerType to a centralized place

We now have two of them:

torchchat/dist_run.py

Lines 67 to 69 in 0299a37

    
           class TokenizerType(Enum): 
        
               Tiktoken = auto() 
        
               SentencePiece = auto()

torchchat/torchchat/cli/builder.py

Lines 241 to 245 in 0299a37

    
           class TokenizerType(Enum): 
        
               NONE = 0 
        
               TIKTOKEN = 1 
        
               SENTENCEPIECE = 2 
        
               HF_TOKENIZER = 3

Check all getters of tokenizer types
- It may be able to be simplified as inline
  
  torchchat/torchchat/generate.py
  
  Line 368 in 0299a37
  
  self.is_llama3_model = self.tokenizer_args.is_tiktoken()

Add documentation for future tokenizer onboard.

We may need to point people to update the model validation logic:

torchchat/torchchat/cli/builder.py

Lines 290 to 322 in 0299a37

    
           def validate_model( 
        
               self, 
        
               model: Optional[Model], 
        
               model_description: str = "model", 
        
           ) -> None: 
        
               if model is None: 
        
                   return 
        
               if self.tokenizer_type == TokenizerType.NONE: 
        
                   raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}") 
        
               is_tiktoken = self.is_tiktoken() 
        
               is_sentencepiece = self.is_sentencepiece() 
        
               is_hf_tokenizer = self.is_hf_tokenizer() 
        
               use_tiktoken = model.config.use_tiktoken 
        
               use_hf_tokenizer = model.config.use_hf_tokenizer 
        
               use_sentencepiece = not (use_tiktoken or use_hf_tokenizer) 
        
               if ( 
        
                   (is_tiktoken and not use_tiktoken) or 
        
                   (is_hf_tokenizer and not use_hf_tokenizer) or 
        
                   (is_sentencepiece and not use_sentencepiece) 
        
               ): 
        
                   raise RuntimeError( 
        
                       "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format( 
        
                           tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer), 
        
                           tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer), 
        
                           model_description, 
        
                       ) 
        
                   ) 
        
               return

To test, run a model with each tokenizer type:

python torchchat.py generate llama2
python torchchat.py generate llama3
python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw

The text was updated successfully, but these errors were encountered:

srikary12 · 2025-05-05T02:46:48Z

Would like to take this up.

zhenyan-zhang-meta · 2025-05-05T17:26:26Z

@srikary12 Nice, thanks for taking this up. I've just assigned you to this issue. Let us know when there's any PR to review, and chat in #torchchat-contributors if there's any questions.

srikary12 · 2025-05-11T07:40:38Z

@zhenyan-zhang-meta I've made changes. Documentation changes are pending. If the PR looks looks okay, I'll add documentation changes.

zhenyan-zhang-meta added actionable Items in the backlog waiting for an appropriate impl/fix good first issue Good for newcomers labels Apr 28, 2025

zhenyan-zhang-meta mentioned this issue Apr 28, 2025

Simplify TokenizerArgs __post_init__: Unnecessarily verbose #1518

Closed

zhenyan-zhang-meta added this to [torchchat] Looking for Contributors Apr 28, 2025

zhenyan-zhang-meta assigned srikary12 May 5, 2025

zhenyan-zhang-meta added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 5, 2025

srikary12 linked a pull request May 11, 2025 that will close this issue

Unified tokenizer type onboarding #1540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Tokenizer New Type Onboarding #1536

Improve Tokenizer New Type Onboarding #1536

zhenyan-zhang-meta commented Apr 28, 2025 •

edited

Loading

srikary12 commented May 5, 2025

Uh oh!

zhenyan-zhang-meta commented May 5, 2025

Uh oh!

srikary12 commented May 11, 2025

Uh oh!

Improve Tokenizer New Type Onboarding #1536

Improve Tokenizer New Type Onboarding #1536

Comments

zhenyan-zhang-meta commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 The feature, motivation and pitch

Tasks

srikary12 commented May 5, 2025

Uh oh!

zhenyan-zhang-meta commented May 5, 2025

Uh oh!

srikary12 commented May 11, 2025

Uh oh!

zhenyan-zhang-meta commented Apr 28, 2025 •

edited

Loading