Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip binaries files on filesystem scan #201

Open
baruchiro opened this issue Feb 11, 2024 · 5 comments
Open

Skip binaries files on filesystem scan #201

baruchiro opened this issue Feb 11, 2024 · 5 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@baruchiro
Copy link
Contributor

baruchiro commented Feb 11, 2024

Steps to reproduce:

  1. Build 2ms with go build -o 2ms main.go
  2. Run a filesystem scan with ./2ms filesystem --path . --log-level debug
  3. One of the scanned files is the ./2ms executable itself.
  4. After ~4 minutes (it is a long time!) you will receive a lot of results from the binary.

There are two problems here:

  • The scan takes a very long time
  • There are a lot of false positives because the binary content generates sequences like secrets.
@baruchiro baruchiro added bug Something isn't working help wanted Extra attention is needed labels Feb 11, 2024
@nargov
Copy link
Contributor

nargov commented Feb 22, 2024

Hi,

I was thinking of tackling this one using this library.
While the http package has a mime type sniffing function, this has the benefit of the hierarchy of mime types, meaning the determination between binary/text is provided.

What do you think?

@baruchiro
Copy link
Contributor Author

I was thinking of tackling this one using this library. While the http package has a mime type sniffing function, this has the benefit of the hierarchy of mime types, meaning the determination between binary/text is provided.

@nargov from their documentation:

Only use libraries like mimetype as a last resort. Content type detection using magic numbers is slow, inaccurate, and non-standard

I don't want to harm our performance, this library at least makes us read each file twice.

I'm looking for an idea to reduce the binaries scans, but without huge performance issues on one hand, and without doing magics for the user on the other hand.
For example, last time we saw this problem, we added the max-target-megabytes flag to skip large files.
Here, the only thing I can think of, is to somehow measure the time of doing a task for a specific file, and warn in the log about a potential performance issue.

What do you think?

By the way, I'm sorry for the late response, I was sick. I appreciate your help!

@nargov
Copy link
Contributor

nargov commented Feb 28, 2024

As an alternative, I see https://pkg.go.dev/net/http#DetectContentType reads at most 512 bytes to detect the MIME type. Think it's good enough?

@baruchiro
Copy link
Contributor Author

OK, I think we can create a POC for that. Here is what I'm thinking:

  • We should avoid reading the file twice! We need to reuse the []byte.
  • We need to decide which MIME types are ignored.
  • We need to be sure the MIME type identification is not leading to unexpected results (unexpected skipping files)
  • Do we want to allow controlling which MIME types will be skipped?
  • We need to test how it affects the performance.
  • Can we check if and how KICS handled this situation?

You don't have to answer all the questions before you start developing.

@baruchiro
Copy link
Contributor Author

Another option will be to ignore lines that are too long. On one hand, they might be a binary file. But on the other hand, they can be a minified JS file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants