Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HTML/JavaScript recursion #2

Open
jshlbrd opened this issue Sep 25, 2018 · 2 comments
Open

[BUG] HTML/JavaScript recursion #2

jshlbrd opened this issue Sep 25, 2018 · 2 comments
Labels
bug Something isn't working

Comments

@jshlbrd
Copy link
Contributor

jshlbrd commented Sep 25, 2018

Describe the bug
We've identified a bug in the HTML/JavaScript identification and extraction code. It's possible that libmagic will incorrectly identify a file as "text/html" while YARA will correctly identify a file as "javascript_file". When this happens, the ScanHtml scanner is applied to the JavaScript file and enters a recursive file extraction loop until the maximum depth is hit.

Steps to reproduce
Steps to reproduce the behavior:

  1. Find an HTML file that contains embedded JavaScript that gets tasted as "text/html" by libmagic
  2. Run the file through Strelka
  3. Check for Python logs that describe "exceeded maximum depth" or scan results where the same HTML file is being repeatedly extracted

Expected behavior
JavaScript should not be tasted as HTML.

Screenshots
N/A

Server and project version

  • OS: Ubuntu Bionic
  • Commit Hash: N/A (first release)

Additional context
N/A

@jshlbrd jshlbrd added the bug Something isn't working label Sep 25, 2018
@jshlbrd jshlbrd added bug Something isn't working and removed bug Something isn't working labels Dec 12, 2018
@ryanohoro
Copy link
Collaborator

ryanohoro commented Jan 12, 2023

I identified cases where this recursion was happening by looking at file.depth:15 (default limit). The frequency is extremely low (0.00003%). The attached file, a VIM macro, triggers this bug.

less.vim.txt

@ryanohoro
Copy link
Collaborator

ryanohoro commented Jan 12, 2023

Analyzing a large volume of events, it's apparent the mime type matching for text/html is overly zealous.

html_file: 1.05
text/html: 1.6
both: 1

I see two solutions:

  1. Remove the text/html mime type from the default ScanHtml configuration.

    While analyzing the data on this problem, it seems most of what text/html catches, but html_file does not is either not HTML or is broken HTML (from split or partial responses). Some exceptions are things like HTML files that start with white space or comments, which can be addressed by improving the html_file Yara.

  2. Prevent ScanHtml from being a child (source) of itself.

    This will prevent the recursion problem, and may be applicable in some other situations if implemented as a configuration. Some scanners should normally recurse. However, it won't prevent mostly unhelpful analysis of files that will not yield interesting results.

    e.g.

  'ScanHtml':
    - positive:
        flavors:
          - 'hta_file'
          - 'text/html'
          - 'html_file'
      exclude_sources:
          - ScanHtml
      priority: 5
      options:
        parser: "html5lib"

The attached file triggers the javascript variety of this bug.

search.js.txt

@phutelmyer phutelmyer changed the title HTML/JavaScript recursion [BUG] HTML/JavaScript recursion Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants