Skip to content

Tartufo does not scan files with alternative encodings, such as UTF-16 LE. #353

@jwilbur-godaddy

Description

@jwilbur-godaddy

🐛 Bug Report

Tartufo does not scan files with alternatives encodings, such as UTF-16 LE.

This is important because (and I discovered this accidentally because) Powershell converts all standard output to UTF-16 LE, so if you use a command like openssl genrsa > private_key.pem, it will save private_key.pem as a UTF-16 LE-encoded text file. Tartufo will NOT scan such files at all.

If you create a UTF-16 LE-encoded file (you can do this by re-encoding a file in VS Code) and you look at it under a hex editor, you'll see two strange bytes at the beginning of the file (the Byte-Order Mark / BOM), and every other byte will be null (0x00). In the eyes of Git, this makes this file count as a "binary" file. Tartufo ignores binary files, so this file will be ignored.

Non-Solution

Do not use the GIT_DIFF_FORCE_TEXT flag to fix this. This will cause problems when a file is actually a binary file, and the files that are actually text will still be converted to Python strings that contain aberrant characters, and will therefore not match regular expressions and possibly entropy checks.

Possible solution

Attempt to detect the text encoding of a file. If it is text, but it is not UTF-8, re-encode the chunk to be scanned (not the original file itself!) into UTF-8 prior to scanning it. I think you will want to change this here:

tartufo/tartufo/scanner.py

Lines 571 to 580 in 3f075ab

for chunk in self.chunks: # pylint: disable=too-many-nested-blocks
# Run regex scans first to trigger a potential fast fail for bad config
if self.global_options.regex and self.rules_regexes:
for issue in self.scan_regex(chunk):
self.store_issue(issue)
yield issue
if self.global_options.entropy:
for issue in self.scan_entropy(chunk):
self.store_issue(issue)
yield issue

To Reproduce

  1. In PowerShell, run openssl genrsa > private_key.pem. This will generate a UTF-16 LE-encoded private key, which should get flagged by tartufo, but does not.
  2. Run tartufo in the repo. It will not catch this.
  3. Open private_key.pem in VS Code. On the bottom right, you'll see "UTF-16 LE." Click on this and save it with "UTF-8" encoding instead.
  4. Run tartufo again and observe that it scans this file.

Expected Behavior

Tartufo should be able to scan files that are text, but not encoded with UTF-8.

Environment

A Windows machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions