[BUG] HTML/JavaScript recursion #2

jshlbrd · 2018-09-25T16:23:21Z

Describe the bug
We've identified a bug in the HTML/JavaScript identification and extraction code. It's possible that libmagic will incorrectly identify a file as "text/html" while YARA will correctly identify a file as "javascript_file". When this happens, the ScanHtml scanner is applied to the JavaScript file and enters a recursive file extraction loop until the maximum depth is hit.

Steps to reproduce
Steps to reproduce the behavior:

Find an HTML file that contains embedded JavaScript that gets tasted as "text/html" by libmagic
Run the file through Strelka
Check for Python logs that describe "exceeded maximum depth" or scan results where the same HTML file is being repeatedly extracted

Expected behavior
JavaScript should not be tasted as HTML.

Screenshots
N/A

Server and project version

OS: Ubuntu Bionic
Commit Hash: N/A (first release)

Additional context
N/A

ryanohoro · 2023-01-12T00:14:38Z

I identified cases where this recursion was happening by looking at file.depth:15 (default limit). The frequency is extremely low (0.00003%). The attached file, a VIM macro, triggers this bug.

less.vim.txt

ryanohoro · 2023-01-12T02:40:50Z

Analyzing a large volume of events, it's apparent the mime type matching for text/html is overly zealous.

html_file: 1.05
text/html: 1.6
both: 1

I see two solutions:

Remove the text/html mime type from the default ScanHtml configuration.

While analyzing the data on this problem, it seems most of what text/html catches, but html_file does not is either not HTML or is broken HTML (from split or partial responses). Some exceptions are things like HTML files that start with white space or comments, which can be addressed by improving the html_file Yara.
Prevent ScanHtml from being a child (source) of itself.

This will prevent the recursion problem, and may be applicable in some other situations if implemented as a configuration. Some scanners should normally recurse. However, it won't prevent mostly unhelpful analysis of files that will not yield interesting results.

e.g.

  'ScanHtml':
    - positive:
        flavors:
          - 'hta_file'
          - 'text/html'
          - 'html_file'
      exclude_sources:
          - ScanHtml
      priority: 5
      options:
        parser: "html5lib"

The attached file triggers the javascript variety of this bug.

search.js.txt

jshlbrd added the bug Something isn't working label Sep 25, 2018

jshlbrd added bug Something isn't working and removed bug Something isn't working labels Dec 12, 2018

phutelmyer changed the title ~~HTML/JavaScript recursion~~ [BUG] HTML/JavaScript recursion Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] HTML/JavaScript recursion #2

[BUG] HTML/JavaScript recursion #2

jshlbrd commented Sep 25, 2018 •

edited

Loading

ryanohoro commented Jan 12, 2023 •

edited

Loading

ryanohoro commented Jan 12, 2023 •

edited

Loading

[BUG] HTML/JavaScript recursion #2

[BUG] HTML/JavaScript recursion #2

Comments

jshlbrd commented Sep 25, 2018 • edited Loading

ryanohoro commented Jan 12, 2023 • edited Loading

ryanohoro commented Jan 12, 2023 • edited Loading

jshlbrd commented Sep 25, 2018 •

edited

Loading

ryanohoro commented Jan 12, 2023 •

edited

Loading

ryanohoro commented Jan 12, 2023 •

edited

Loading