Skip to content

Improve Tokenizer New Type Onboarding #1536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhenyan-zhang-meta opened this issue Apr 28, 2025 · 3 comments · May be fixed by #1540
Open

Improve Tokenizer New Type Onboarding #1536

zhenyan-zhang-meta opened this issue Apr 28, 2025 · 3 comments · May be fixed by #1540
Assignees
Labels
actionable Items in the backlog waiting for an appropriate impl/fix good first issue Good for newcomers triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@zhenyan-zhang-meta
Copy link
Contributor

zhenyan-zhang-meta commented Apr 28, 2025

🚀 The feature, motivation and pitch


As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__, we need to further improve it to simplify new tokenizer type onboarding:

Tasks


  • Move TokenizerType to a centralized place
  • Check all getters of tokenizer types
  • Add documentation for future tokenizer onboard.
    • We may need to point people to update the model validation logic:
      def validate_model(
      self,
      model: Optional[Model],
      model_description: str = "model",
      ) -> None:
      if model is None:
      return
      if self.tokenizer_type == TokenizerType.NONE:
      raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")
      is_tiktoken = self.is_tiktoken()
      is_sentencepiece = self.is_sentencepiece()
      is_hf_tokenizer = self.is_hf_tokenizer()
      use_tiktoken = model.config.use_tiktoken
      use_hf_tokenizer = model.config.use_hf_tokenizer
      use_sentencepiece = not (use_tiktoken or use_hf_tokenizer)
      if (
      (is_tiktoken and not use_tiktoken) or
      (is_hf_tokenizer and not use_hf_tokenizer) or
      (is_sentencepiece and not use_sentencepiece)
      ):
      raise RuntimeError(
      "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(
      tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),
      tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),
      model_description,
      )
      )
      return

To test, run a model with each tokenizer type:

  • python torchchat.py generate llama2
  • python torchchat.py generate llama3
  • python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw

@srikary12
Copy link

Would like to take this up.

@zhenyan-zhang-meta zhenyan-zhang-meta added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 5, 2025
@zhenyan-zhang-meta
Copy link
Contributor Author

@srikary12 Nice, thanks for taking this up. I've just assigned you to this issue. Let us know when there's any PR to review, and chat in #torchchat-contributors if there's any questions.

@srikary12 srikary12 linked a pull request May 11, 2025 that will close this issue
@srikary12
Copy link

@zhenyan-zhang-meta I've made changes. Documentation changes are pending. If the PR looks looks okay, I'll add documentation changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix good first issue Good for newcomers triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants