Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #48

Open
bo-jpg opened this issue Jul 24, 2024 · 6 comments

Comments

@bo-jpg
Copy link

bo-jpg commented Jul 24, 2024

Thank you for your contribution. I encountered the following error when training with toy data:

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I read online that the following reasons may be the cause:

  1. The maximum length of the tokenizer is not set;
  2. There are blank lines in the jsonl file;
  3. The higher version transformer library is incompatible;
  4. There are Nan values ​​in the data.
    However, I tried the solutions corresponding to the above 4 reasons, and this error is still reported. I want to know why. Thank you very much!
@Muennighoff
Copy link
Collaborator

i just checked and the command under GRIT here https://github.com/ContextualAI/gritlm?tab=readme-ov-file#run works fine for me

@bo-jpg
Copy link
Author

bo-jpg commented Jul 24, 2024

i just checked and the command under GRIT here https://github.com/ContextualAI/gritlm?tab=readme-ov-file#run works fine for me

Thanks for the quick response!

Here is my config:
--per_device_train_batch_size 2
--gradient_accumulation_steps 1
--per_device_generative_bs 1 \

I printed out my toy data before entering the tokenizer:

[default6]:['He He Me It I You You You You You', 'Me I He You Me He It Me It She']
[default6]:['Me He She He She It He She She Me', 'It He It She I I It He You She', 'Me You It Me Me She You I It He', 'It She She He Me It I You It You']
[default6]:['大人你大人大人大人他享受大人享受你', None]
[default7]:['Me He Me She I She You I It She', 'She It She She Me Me Me Me She Me']
[default7]:['You He I I She He I I He It', 'Me He It Me It He He She I You', 'I She He You He It You She It He', 'He It Me You He She I It Me He']
[default7]:['我我是享受是他你我他你', None]

We can see that there is an extra None in the batch of generative data, which should be the cause of the error. Why is this? Is it related to the following warning?

[default4]:/home/code/.python_libs/conda_env/myenv/lib/python3.9/site-packages/accelerate/accelerator.py:447: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
[default4]:dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)

@Muennighoff
Copy link
Collaborator

It seems like you're using your own custom data? Maybe you have None's in your data

@bo-jpg
Copy link
Author

bo-jpg commented Jul 24, 2024

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)

I checked my toy data and it does not contain None value. In addition, I used the toy data you provided and it also reported this error, and the batch contained None:

[default1]:['What is the difference between a raspberry pi and an esp32? What is better suited for interfacing with a SD card? The Raspberry Pi is a single-board computer that runs a full-fledged operating system, while the ESP32 is a microcontroller that is typically used for IoT applications. The Raspberry Pi is better suited for interfacing with an SD card as it has a full-fledged operating system and a large number of libraries available for interfacing with various peripherals, including SD cards. The ESP32, on the other hand, has limited memory and processing power, and may require more effort to interface with an SD card.', None]

@bo-jpg
Copy link
Author

bo-jpg commented Jul 25, 2024

@Muennighoff
If I set --mode embedding, the training is OK. But if I set --mode unified, the generative data batch contains None and an error is reported. I want to know why there are some extra None values ​​in the generative data batch?

@bo-jpg
Copy link
Author

bo-jpg commented Jul 26, 2024

Hi, I roughly looked at the code in gritlm/gritlm/training
/data.py, and I think the None in the generative batch data is caused by lines 90, 131, 140, and 141. Looking forward to your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants