-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate optimization of rule matching (May, 2024) #2063
Comments
Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve. For example, use the sandwich diagram to identify routines that take up a lot of runtime: |
lots of time spent in instancecheck5.5% of runtime is spent in Line 393 in 824e852
Here's whats happening: in order to evaluate a This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly. Some things we could do to fix this:
We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex. With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules. Today, FeatureSet = Dict[Feature, Set[Address]] or more concretely: {
API("CreateFile"): [0x401000, 0x401005],
API("ExitProcess"): [0x40100A],
String("foo"): [0x40100B],
Mnemonic("mov"): [0x401000, ...]
...
} When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found: assert API("CreateFile") in features We could move to a more explicit structure like: @dataclass
class FeatureSet:
api: Dict[str, List[Address]]
string: Dict[str, List[Address]]
mnemonic: Dict[str, List[Address]]
number: Dict[int, List[Address]]
... or more concretely: features = FeatureSet(
api={
"CreateFile": [0x401000, 0x401005],
"ExitProcess": [0x40100A],
},
string={
"foo": [0x40100B],
],
mnemonic={
"mov": [0x401000, ...]
}
) The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type. On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon). It might be possible to do a hybrid, like: @dataclass
class FeatureSet:
bytes: Dict[bytes, Set[Address]]
string: Dict[str, List[Address]]
other: Dict[Feature, List[Address]] Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason. Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction. See also discussion below on making Bytes hashable, for common prefix lengths. |
pre-filter strings, bytes based on whats found in the fileTo avoid searching for strings/bytes that won't ever be found at a particular scope, we could first check that each string/bytes is present somewhere in the file. If its not, then we can partially evaluate some rule logic (like For example, we have HTTP User-Agent rules that contain tons of strings that match under a single We'd want to ensure that the up-front scan to find the file matches doesn't take much time, and that it doesn't outweigh any performance improvements. Remember we may have hundreds or thousands of terms to look for. We can also use evaluation counts to show that less logic needs to be matched when some branches are pruned. |
prune rule logic using global featuresOnce the global features are known (arch, os, format, etc.), then prune logic from the rules that won't ever match. This way, the same global feature are not re-evaluated over and over again, for each instance of the scope. Edit: |
This logic is used to evaluate other features including Substring, Regex, and OS leading me to think that (2) may be a better approach. |
@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 . |
tighten rule pre-selectionIn (aside: what is the distribution of feature counts by scope? Are there optimizations that only make sense at small/large scopes?) I think we can do even better, at least in some common scenarios, and save more time. Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective". For example, consider the rule: - instruction:
- and:
- number: 6789
- mnemonic: mov Today, anytime we encounter a feature set with any single one of those features, we evaluate the full rule. Because Here, I propose that we only index one of the features (the "best, most selective one") that must be present for the full rule to match. The intuition is that some features will be somewhat common, and if we index those, then we end up evaluating a lot more rules. If we only index features that are uncommon, then we're less like to have false positives in this pre-filtering step and save time. I think Note that for rules like the following, we have to pick a feature from each branch of the - or:
- and:
- number: 1
- number: 2
- and:
- offset: 3
- offset 4 We should be able to use raw evaluation counts to demonstrate that this technique is working. Note that this will introduce a bit more logic to the initial RuleSet constructor: in addition to walking all the features, we have to track the minimal set of features and their scores. But this is a one-time initialization cost that will pay dividends across every scope evaluation (every instruction, every bb, every function, ...). |
slow mimikatz function
Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions. Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated. 6964 total features. |
FeatureSet size distributionsFrom mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100. instructionbasic blockfunctionWe can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers). (notes for future willi)
|
hash lookup common bytes length prefixesToday, we match bytes by doing a prefix search against encountered bytes (up to 0x100 long). Since many sequences of bytes we search for have some structure (well, common length), like a GUID or cryptographic S-Box, we can optimize some of these searches by indexing the bytes by their prefix (for common lengths, like 8, 16, 32, and 64 bytes). Then, when the wanted bytes feature has this same length, we can do This can also help the rule logic planner, since it can pre-filter more rule when the hashable features are known. The tradeoff is that we generate N (probably 4-5) more features per bytes feature. Maybe definitely do 16 (the size of a GUID). 8, 256, and 64 also look nice and round (and probably not-domain-specific), so consider those. 9 comes from OpenSSL SHA constants. 171 comes from Tiger S-Boxes. Against mimikatz with the changes in #2080, we have the following evaluation counts by Bytes feature size:
Indexing the power-of-2 lengths would save about 49% of the scanning evaluations. I'm not sure what this costs in runtime. Will investigate before going deeper. |
hash lookup case insensitive stringsToday, we match case insensitive strings by invoking the regex matching engine. But, we have quite a few of these features in our rule set (409???). Like with bytes above, consider emitting lowercased strings as features, and matching case-insensitive strings via hash lookup. The tradeoff is one additional feature emitted per string, and a bit more code complexity. Again, I think this may be able to help the rule logic planner, since it can pre-filter more rules when the hashable features are known. |
lots of time spent in instancecheck
As you mentionned @williballenthin, this would require a redesign of from typing import Dict, Set, Union
FeatureType = Union[Bytes, Regex, String, Substring, Arch, OS, Format]
FeatureSet = Dict[str, Dict[FeatureType, Set[Address]]]
feature_set: FeatureSet = {
"bytes": {},
"regex": {},
"string": {},
...
}
# Add a Bytes feature
bytes_feature = Bytes(b"\x90\x90")
feature_set["bytes"][bytes_feature] = {absolute(0x401000), absolute(0x402000)} With this, we could avoid iterating over all features. prune rule logic using global features
Here, do you mean that for example, if the extracted
This is corrent, for objdump -dj .text --no-show-raw-insn --no-addresses mimikatz.exe_ | grep -P '^\t' | awk -F '\t' -F ' ' '{ count[$1]++ } END { total = 0; for (i in count) total += count[i]; for (i in count) { printf("%d\t%0.2f%%\t%s\n", count[i], count[i] * 100 / total, i) } }' | sort -nr | head |
As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.
In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.
The text was updated successfully, but these errors were encountered: