Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF8 bytes in default TAGS.txt #5406

Open
s-kganz opened this issue May 10, 2024 · 0 comments
Open

Invalid UTF8 bytes in default TAGS.txt #5406

s-kganz opened this issue May 10, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@s-kganz
Copy link

s-kganz commented May 10, 2024

Short description
Some of the language tags in the default TAGS.txt cause a UnicodeDecodeError.

Environment information

  • Operating System: Windows 11

  • Python version: 3.10.13

  • tensorflow-datasets/tfds-nightly version: tfds-nightly 4.9.4.dev202405100044

  • tensorflow/tf-nightly version: tensorflow 2.10.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes

Reproduction instructions
Make a toy dataset with tfds new test. Then try to instantiate the builder.

from test.test_dataset_builder import *
b = Builder()

Link to logs
Stack trace here

Expected behavior
The builder to instantiate without error.

Additional context
Deleting lines 73, 79, 126, 128, 156, and 173 in TAGS.txt fixes the problem. These are all language tags.

@s-kganz s-kganz added the bug Something isn't working label May 10, 2024
@pierrot0 pierrot0 self-assigned this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants