Skip to content

Simplify TokenizerArgs.__post_init__ with Enum Tokenizer Type #1535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zhenyan-zhang-meta
Copy link
Contributor

Summary:
Simplify TokenizerArgs.__post_init__ with enum tokenizer type, since only one of the tokenizer type can be true.

We want to touch as less code outside of __post_init__ as possible at the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518

Summary:
Simplify `TokenizerArgs.__post_init__` with enum tokenizer type, since
only one of the tokenizer type can be true.

We want to touch as less code outside of `__post_init__` as possible at
the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518
Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1535

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c846de9 with merge base 5f8f35d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 25, 2025
Comment on lines 281 to 282
self.tokenizer_type = TokenizerType.NONE
self.t = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to set these as none again since we already set them at the very top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually drop all the logic here after the HF tokenizer check, tokenizer_type and .t are already set to these by default

is_sentencepiece = self.is_sentencepiece()
is_hf_tokenizer = self.is_hf_tokenizer()

if sum([is_tiktoken, is_hf_tokenizer, is_sentencepiece]) != 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace this by just checking if the tokenizer enum is None

Comment on lines 281 to 282
self.tokenizer_type = TokenizerType.NONE
self.t = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually drop all the logic here after the HF tokenizer check, tokenizer_type and .t are already set to these by default

zhenyanzhang and others added 4 commits April 25, 2025 16:27
Summary:
Simplify `TokenizerArgs.__post_init__` with enum tokenizer type, since
only one of the tokenizer type can be true.

We want to touch as less code outside of `__post_init__` as possible at
the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants