Improve URL boundary with quotations and parentheses #254

fhightower · 2022-07-09T13:25:53Z

Fixes #130

Key changes:

If the URL has a ) and not a (, we remove the ) (which we previously did) and everything after it
We remove ) and everything after it before we remove trailing quotation marks so http://example.com/foo') is properly parsed as http://example.com/foo

codecov-commenter · 2022-07-09T13:28:35Z

Codecov Report

Merging #254 (725a113) into main (f46e970) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #254   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines          408       408           
=========================================
  Hits           408       408

Impacted Files	Coverage Δ
ioc_finder/ioc_finder.py	`100.00% <100.00%> (ø)`
ioc_finder/ioc_grammars.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

fhightower · 2022-07-11T09:10:58Z

ioc_finder/ioc_grammars.py

 # some urls are written with the ":" unencoded so we include it below
-url_path_word = Word(alphanums + "-._~!$&'()*+,;:=%")
+url_path_word = Word(alphanums + "-._~!$&'()*+,;=:%")


Switch order of characters to align w/ RFC.

tests/find_iocs_cases/urls.py

fhightower · 2022-07-11T09:14:53Z

tests/find_iocs_cases/urls.py

+    param(
+        "DownloadString('https://example[.]com/rdp.ps1');g $I DownloadString(\"https://example[.]com/rdp.ps2\");g $I",
+        {
+            "urls": ["https://example.com/rdp.ps1", "https://example.com/rdp.ps2"],
+        },
+        {"parse_domain_from_url": False},
+        id="URL boundary w/ single or double quotes handled properly",
+    ),
+    param(
+        "url,https://example.com/g/foo,Malicious Google Groups discussion",
+        {
+            "urls": ["https://example.com/g/foo"],
+        },
+        {"parse_domain_from_url": False},
+        id="URL boundary w/ comma handled properly",
+    ),


To resolve these two failing cases, my approach will be to:

Identify the character before the URL

If the preceding character is not whitespace, either:

Stop parsing once that character is hit again

Remove that character and anything after it from the URL

This will not handle cases like #261 where the url is not preceded by a comma (it is only postceded by one).

I am comfortable removing punctuation from the end of a URL or it the URL is preceded by the same character, but other than that, we would have to break significantly with the RFC (which is something I need to think about more).

fhightower · 2022-08-20T08:32:18Z

ioc_finder/ioc_grammars.py

-    + Combine(one_of("pub- PUB-") + Word(nums, exact=16)).set_parse_action(
-        pyparsing_common.downcase_tokens
-    )
+    + Combine(one_of("pub- PUB-") + Word(nums, exact=16)).set_parse_action(pyparsing_common.downcase_tokens)


Lint update

fhightower · 2022-08-20T08:32:24Z

ioc_finder/ioc_grammars.py

-mac_address_16_bit_section = Combine(
-    (Word(hexnums, exact=2) + one_of("- :")) * 5 + Word(hexnums, exact=2)
-)
+mac_address_16_bit_section = Combine((Word(hexnums, exact=2) + one_of("- :")) * 5 + Word(hexnums, exact=2))


Lint update

fhightower · 2022-08-20T08:34:11Z

tests/test_ioc_finder.py

+    assert sorted(
+        [
+            r"HKEY_LOCAL_MACHINE\Software\Microsoft\Windows",
+            r"HKLM\Software\Microsoft\Windows",
+            r"HKCC\Software\Microsoft\Windows",
+        ]
+    ) == sorted(iocs["registry_key_paths"])


Lint update

fhightower · 2022-08-20T09:11:15Z

ioc_finder/ioc_finder.py

+def _clean_url(url: str) -> str:
+    """Clean the given URL, removing common, unwanted characters which are usually not part of the URL."""
+    # if there is a ")" in the URL and not a "(", remove everything including and after the ")"
+    if ")" in url and "(" not in url:
+        url = url.split(")")[0]
+
+    # remove `"` and `'` characters from the end of a URL
+    url = url.rstrip('"').rstrip("'")
+
+    # remove `'/>` and `"/>` from the end of a URL (this character string occurs at the end of an HMTL tag with )
+    url = string_remove_from_end(url, "'/>")
+    url = string_remove_from_end(url, '"/>')
+
+    return url


Three changes to note:

Moving this into its own function for easier extensibility in the future

Removing ) before rstriping quotation marks so cases like http://example.com/foo') are properly parsed as http://example.com/foo

Previously, we were just doing url.rstrip(")") to remove a closing paren. at the end of a URL; now we do url = url.split(")")[0] to remove the paren and everything after it.

Update changelog

6c6bce5

Add failing test case

0e54956

fhightower commented Jul 11, 2022

View reviewed changes

tests/find_iocs_cases/urls.py Show resolved Hide resolved

fhightower commented Jul 11, 2022

View reviewed changes

Improve handling of ')' character

20b3aa5

fhightower mentioned this pull request Aug 18, 2022

Improve URL boundary regarding commas (Old name: Wrong URL Pattern being Parsed) #261

Open

Fix lint errors

5115f04

fhightower commented Aug 20, 2022

View reviewed changes

Improve url cleaning

1a81f93

fhightower commented Aug 20, 2022

View reviewed changes

fhightower changed the title ~~Improve URL boundary~~ Improve URL boundary with quotations and parentheses Aug 20, 2022

fhightower added 2 commits August 20, 2022 05:19

Clarify changelog

7536ed9

Merge branch 'main' into 130-url-boundary-fix

8d22840

fhightower marked this pull request as ready for review August 20, 2022 09:28

Clarify changelog

725a113

fhightower merged commit ee715ec into main Aug 20, 2022

fhightower deleted the 130-url-boundary-fix branch August 20, 2022 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve URL boundary with quotations and parentheses #254

Improve URL boundary with quotations and parentheses #254

fhightower commented Jul 9, 2022 •

edited

Loading

codecov-commenter commented Jul 9, 2022 •

edited

Loading

fhightower Jul 11, 2022

fhightower Jul 11, 2022 •

edited

Loading

fhightower Aug 19, 2022

fhightower Aug 19, 2022

fhightower Aug 20, 2022

fhightower Aug 20, 2022

fhightower Aug 20, 2022

fhightower Aug 20, 2022 •

edited

Loading

Improve URL boundary with quotations and parentheses #254

Improve URL boundary with quotations and parentheses #254

Conversation

fhightower commented Jul 9, 2022 • edited Loading

codecov-commenter commented Jul 9, 2022 • edited Loading

Codecov Report

fhightower Jul 11, 2022

Choose a reason for hiding this comment

fhightower Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

fhightower Aug 19, 2022

Choose a reason for hiding this comment

fhightower Aug 19, 2022

Choose a reason for hiding this comment

fhightower Aug 20, 2022

Choose a reason for hiding this comment

fhightower Aug 20, 2022

Choose a reason for hiding this comment

fhightower Aug 20, 2022

Choose a reason for hiding this comment

fhightower Aug 20, 2022 • edited Loading

Choose a reason for hiding this comment

fhightower commented Jul 9, 2022 •

edited

Loading

codecov-commenter commented Jul 9, 2022 •

edited

Loading

fhightower Jul 11, 2022 •

edited

Loading

fhightower Aug 20, 2022 •

edited

Loading