-
Notifications
You must be signed in to change notification settings - Fork 73
Description
🐛 Bug Report
Tartufo does not scan files with alternatives encodings, such as UTF-16 LE.
This is important because (and I discovered this accidentally because) Powershell converts all standard output to UTF-16 LE, so if you use a command like openssl genrsa > private_key.pem, it will save private_key.pem as a UTF-16 LE-encoded text file. Tartufo will NOT scan such files at all.
If you create a UTF-16 LE-encoded file (you can do this by re-encoding a file in VS Code) and you look at it under a hex editor, you'll see two strange bytes at the beginning of the file (the Byte-Order Mark / BOM), and every other byte will be null (0x00). In the eyes of Git, this makes this file count as a "binary" file. Tartufo ignores binary files, so this file will be ignored.
Non-Solution
Do not use the GIT_DIFF_FORCE_TEXT flag to fix this. This will cause problems when a file is actually a binary file, and the files that are actually text will still be converted to Python strings that contain aberrant characters, and will therefore not match regular expressions and possibly entropy checks.
Possible solution
Attempt to detect the text encoding of a file. If it is text, but it is not UTF-8, re-encode the chunk to be scanned (not the original file itself!) into UTF-8 prior to scanning it. I think you will want to change this here:
Lines 571 to 580 in 3f075ab
| for chunk in self.chunks: # pylint: disable=too-many-nested-blocks | |
| # Run regex scans first to trigger a potential fast fail for bad config | |
| if self.global_options.regex and self.rules_regexes: | |
| for issue in self.scan_regex(chunk): | |
| self.store_issue(issue) | |
| yield issue | |
| if self.global_options.entropy: | |
| for issue in self.scan_entropy(chunk): | |
| self.store_issue(issue) | |
| yield issue |
To Reproduce
- In PowerShell, run
openssl genrsa > private_key.pem. This will generate a UTF-16 LE-encoded private key, which should get flagged bytartufo, but does not. - Run
tartufoin the repo. It will not catch this. - Open
private_key.pemin VS Code. On the bottom right, you'll see "UTF-16 LE." Click on this and save it with "UTF-8" encoding instead. - Run
tartufoagain and observe that it scans this file.
Expected Behavior
Tartufo should be able to scan files that are text, but not encoded with UTF-8.
Environment
A Windows machine.