Simplify `TokenizerArgs.__post_init__` with Enum Tokenizer Type #1535

zhenyan-zhang-meta · 2025-04-25T18:29:57Z

Summary:
Simplify TokenizerArgs.__post_init__ with enum tokenizer type, since only one of the tokenizer type can be true.

We want to touch as less code outside of __post_init__ as possible at the moment.

Test Plan:
python torchchat.py generate llama2|llama3|granite-code

Reviewers:
@Jack-Khuu

Subscribers:

Issue:
#1518

@Jack-Khuu

Summary: Simplify `TokenizerArgs.__post_init__` with enum tokenizer type, since only one of the tokenizer type can be true. We want to touch as less code outside of `__post_init__` as possible at the moment. Test Plan: python torchchat.py generate llama2|llama3|granite-code Reviewers: @Jack-Khuu Subscribers: Issue: #1518

pytorch-bot · 2025-04-25T18:30:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1535

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 23 Pending

As of commit 03e2019 with merge base 5f8f35d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zhenyan-zhang-meta · 2025-04-25T18:35:16Z

torchchat/cli/builder.py

+        self.tokenizer_type = TokenizerType.NONE
        self.t = None


Do we really have to set these as none again since we already set them at the very top.

We can actually drop all the logic here after the HF tokenizer check, tokenizer_type and .t are already set to these by default

Jack-Khuu · 2025-04-25T22:12:32Z

torchchat/cli/builder.py

+        is_sentencepiece = self.is_sentencepiece()
+        is_hf_tokenizer = self.is_hf_tokenizer()
+
+        if sum([is_tiktoken, is_hf_tokenizer, is_sentencepiece]) != 1:


We can replace this by just checking if the tokenizer enum is None

Jack-Khuu · 2025-04-25T22:15:21Z

torchchat/cli/builder.py

+        self.tokenizer_type = TokenizerType.NONE
        self.t = None


We can actually drop all the logic here after the HF tokenizer check, tokenizer_type and .t are already set to these by default

@Jack-Khuu

Summary: Simplify `TokenizerArgs.__post_init__` with enum tokenizer type, since only one of the tokenizer type can be true. We want to touch as less code outside of `__post_init__` as possible at the moment. Test Plan: python torchchat.py generate llama2|llama3|granite-code Reviewers: @Jack-Khuu Subscribers: Issue: #1518

Jack-Khuu

Looks good, fix the nits then go ahead and merge

Jack-Khuu · 2025-04-28T17:00:58Z

torchchat/cli/builder.py

-        self.is_tiktoken = False
-        self.is_sentencepiece = False
-        self.is_hf_tokenizer = False
-        self.t = None
        return


nit: return is not needed

Jack-Khuu · 2025-04-28T17:09:12Z

torchchat/cli/builder.py

@@ -294,12 +304,13 @@ def validate_model(
        if model is None:
            return

-        if sum([self.is_tiktoken, self.is_hf_tokenizer, self.is_sentencepiece]) != 1:
+        if self.is_tokenizer_none():


Let's keep it simple and check for self.tokenizer_type == TokenizerType.NONE here

We can skip the the assert/helper

zhenyan-zhang-meta requested a review from Jack-Khuu April 25, 2025 18:29

zhenyan-zhang-meta self-assigned this Apr 25, 2025

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 25, 2025

zhenyan-zhang-meta commented Apr 25, 2025

View reviewed changes

Jack-Khuu reviewed Apr 25, 2025

View reviewed changes

zhenyanzhang and others added 4 commits April 25, 2025 16:27

Add check no tokenizer

379c07b

Rollback to 98eaf8f

b896262

Add No Tokenizer Checker

c846de9

Jack-Khuu approved these changes Apr 28, 2025

View reviewed changes

zhenyan-zhang-meta added 2 commits April 28, 2025 10:34

Reply to nits

c752a40

Reply to nits

03e2019

zhenyan-zhang-meta merged commit 0299a37 into main Apr 28, 2025
72 checks passed

zhenyan-zhang-meta deleted the zhenyan-1518 branch April 28, 2025 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify `TokenizerArgs.__post_init__` with Enum Tokenizer Type #1535

Simplify `TokenizerArgs.__post_init__` with Enum Tokenizer Type #1535

Uh oh!

zhenyan-zhang-meta commented Apr 25, 2025

Uh oh!

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading

Uh oh!

zhenyan-zhang-meta Apr 25, 2025

Uh oh!

Jack-Khuu Apr 25, 2025

Uh oh!

Jack-Khuu Apr 25, 2025

Uh oh!

Jack-Khuu Apr 25, 2025

Uh oh!

Jack-Khuu left a comment

Uh oh!

Jack-Khuu Apr 28, 2025

Uh oh!

Jack-Khuu Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

Simplify TokenizerArgs.__post_init__ with Enum Tokenizer Type #1535

Simplify TokenizerArgs.__post_init__ with Enum Tokenizer Type #1535

Uh oh!

Conversation

zhenyan-zhang-meta commented Apr 25, 2025

Uh oh!

pytorch-bot bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1535

⏳ No Failures, 23 Pending

Uh oh!

zhenyan-zhang-meta Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Simplify `TokenizerArgs.__post_init__` with Enum Tokenizer Type #1535

Simplify `TokenizerArgs.__post_init__` with Enum Tokenizer Type #1535

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading