Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for encoding pretokenized sequences #42

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kabachuha
Copy link
Contributor

Useful for batch processing and making embeddings cache of numerous documents with dataloaders.

The results for dict and the vanilla strings list are identical, although for the raw tokenized 'transformers' encoding it differs a bit, but I think it's just the behavior of that library.

Снимок экрана 2024-06-16 124103
Снимок экрана 2024-06-16 124134
Снимок экрана 2024-06-16 124039

@Muennighoff
Copy link
Collaborator

Nice! It is odd that it differs - How do you instantiate the tokenizer? Maybe there is a special token that's missing or something similar

@kabachuha
Copy link
Contributor Author

from gritlm import GritLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("GritLM/GritLM-7B")

tokenizer_max_length = 300

# the part with docs
...

tokenizer_output_x = tokenizer(
    documents,
    padding='max_length',
    truncation=True,
    max_length=tokenizer_max_length,
    return_tensors="pt",
)

Nothing unusual, but I do set the max length to enable batch encode

@Muennighoff
Copy link
Collaborator

Can you try without the max length and see if you get the same results? I think the results should be exactly the same.

@kabachuha
Copy link
Contributor Author

Alright, thank you for noticing! I've found the problem:

I did a generation-only test earlier in the notebook, and it did

Setting pad_token_id to eos_token_id:2 for open-end generation.

Now without launching a generation cell first, the results with dictionary and the tokenizer output class are exactly the same

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants