Skip to content

what is the reason set pad_token to unused token? #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
blackcherry88 opened this issue Feb 25, 2025 · 4 comments
Open

what is the reason set pad_token to unused token? #83

blackcherry88 opened this issue Feb 25, 2025 · 4 comments

Comments

@blackcherry88
Copy link

Any special reason for it? This is regarding the following code in sft.py "tokenizer.pad_token = "<|fim_pad|>"

@Muennighoff
Copy link
Contributor

Just to be sure that some code does not accidentally mask tokens that are actually used when it tries to mask all padding tokens

@sangmandu
Copy link

@Muennighoff Can you explain a bit more? Will using an existing pad token cause problems with masking?

@Muennighoff
Copy link
Contributor

i think you should use a token for padding that you dont expect to appear in the regular prompt / completion

@sangmandu
Copy link

sorry but why? what is the reason..??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants