-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve URL boundary with quotations and parentheses #254
Conversation
Codecov Report
@@ Coverage Diff @@
## main #254 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 4 4
Lines 408 408
=========================================
Hits 408 408
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
# some urls are written with the ":" unencoded so we include it below | ||
url_path_word = Word(alphanums + "-._~!$&'()*+,;:=%") | ||
url_path_word = Word(alphanums + "-._~!$&'()*+,;=:%") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch order of characters to align w/ RFC.
tests/find_iocs_cases/urls.py
Outdated
param( | ||
"DownloadString('https://example[.]com/rdp.ps1');g $I DownloadString(\"https://example[.]com/rdp.ps2\");g $I", | ||
{ | ||
"urls": ["https://example.com/rdp.ps1", "https://example.com/rdp.ps2"], | ||
}, | ||
{"parse_domain_from_url": False}, | ||
id="URL boundary w/ single or double quotes handled properly", | ||
), | ||
param( | ||
"url,https://example.com/g/foo,Malicious Google Groups discussion", | ||
{ | ||
"urls": ["https://example.com/g/foo"], | ||
}, | ||
{"parse_domain_from_url": False}, | ||
id="URL boundary w/ comma handled properly", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To resolve these two failing cases, my approach will be to:
- Identify the character before the URL
- If the preceding character is not whitespace, either:
- Stop parsing once that character is hit again
- Remove that character and anything after it from the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not handle cases like #261 where the url is not preceded by a comma (it is only postceded by one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am comfortable removing punctuation from the end of a URL or it the URL is preceded by the same character, but other than that, we would have to break significantly with the RFC (which is something I need to think about more).
+ Combine(one_of("pub- PUB-") + Word(nums, exact=16)).set_parse_action( | ||
pyparsing_common.downcase_tokens | ||
) | ||
+ Combine(one_of("pub- PUB-") + Word(nums, exact=16)).set_parse_action(pyparsing_common.downcase_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint update
mac_address_16_bit_section = Combine( | ||
(Word(hexnums, exact=2) + one_of("- :")) * 5 + Word(hexnums, exact=2) | ||
) | ||
mac_address_16_bit_section = Combine((Word(hexnums, exact=2) + one_of("- :")) * 5 + Word(hexnums, exact=2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint update
assert sorted( | ||
[ | ||
r"HKEY_LOCAL_MACHINE\Software\Microsoft\Windows", | ||
r"HKLM\Software\Microsoft\Windows", | ||
r"HKCC\Software\Microsoft\Windows", | ||
] | ||
) == sorted(iocs["registry_key_paths"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint update
def _clean_url(url: str) -> str: | ||
"""Clean the given URL, removing common, unwanted characters which are usually not part of the URL.""" | ||
# if there is a ")" in the URL and not a "(", remove everything including and after the ")" | ||
if ")" in url and "(" not in url: | ||
url = url.split(")")[0] | ||
|
||
# remove `"` and `'` characters from the end of a URL | ||
url = url.rstrip('"').rstrip("'") | ||
|
||
# remove `'/>` and `"/>` from the end of a URL (this character string occurs at the end of an HMTL tag with ) | ||
url = string_remove_from_end(url, "'/>") | ||
url = string_remove_from_end(url, '"/>') | ||
|
||
return url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three changes to note:
- Moving this into its own function for easier extensibility in the future
- Removing
)
before rstriping quotation marks so cases likehttp://example.com/foo')
are properly parsed ashttp://example.com/foo
- Previously, we were just doing
url.rstrip(")")
to remove a closing paren. at the end of a URL; now we dourl = url.split(")")[0]
to remove the paren and everything after it.
Fixes #130
Key changes:
)
and not a(
, we remove the)
(which we previously did) and everything after it)
and everything after it before we remove trailing quotation marks sohttp://example.com/foo')
is properly parsed ashttp://example.com/foo