Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly parsed Unicode char #28

Open
euamotubaina opened this issue Sep 26, 2023 · 3 comments
Open

Incorrectly parsed Unicode char #28

euamotubaina opened this issue Sep 26, 2023 · 3 comments
Labels

Comments

@euamotubaina
Copy link

euamotubaina commented Sep 26, 2023

This private tracker torrent file has a file path which includes an unicode character that's being incorrectly parsed

\x008D chr(189) Vulgar Fraction One Half

I noticed it because after loading the file with the Torrent class, the calculated info_hash was different from the original torrent.

Screenshots of original torrent file and a new one created with Torrent.to_file from the same data in the hex editor

Original:
Screenshot 2023-09-26 142545

Created with Torrent class
Screenshot 2023-09-26 142612

When using the Bencode class to read and write the torrent, the char is correctly parsed and the hashes match.

Here's a version of the original torrent without the tracker url

431f76f60e05250df162c90a73ab8377dc4ca9c8.zip

screenshot of the terminal output when reading the file with Torrent class (the file name is the correct sha1 hash)
Screenshot 2023-09-26 151205

@idlesign
Copy link
Owner

EF BF BD means that filename contains non-utf symbol, we've tried and parsed as utf-8.
What's the encoding used in your filesystem for filenames?

@euamotubaina
Copy link
Author

I'm on Windows 11, which uses unicode to encode file paths, if I understood correctly.

I think this specific torrent used latin-1 encoding for the file paths, so I guess this is very much a corner case

Screenshot 2023-09-27 123011

@idlesign
Copy link
Owner

I think this specific torrent used latin-1 encoding for the file paths, so I guess this is very much a corner case

Hm, latin-1... This comment seems to be relevant
#2 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants