Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate optimization of rule matching (May, 2024) #2063

Open
williballenthin opened this issue May 6, 2024 · 12 comments
Open

investigate optimization of rule matching (May, 2024) #2063

williballenthin opened this issue May 6, 2024 · 12 comments
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. performance Related to capa's performance

Comments

@williballenthin
Copy link
Collaborator

As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.

In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.

@williballenthin williballenthin added the enhancement New feature or request label May 6, 2024
@williballenthin
Copy link
Collaborator Author

Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve.

profile-be2.speedscope.zip

For example, use the sandwich diagram to identify routines that take up a lot of runtime:

image

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 6, 2024

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here:
image

if not isinstance(feature, (Bytes,)):

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

  1. collect all the bytes features once, and re-use this list when evaluating each bytes feature
  2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.


Today, FeatureSet looks like:

FeatureSet = Dict[Feature, Set[Address]]

or more concretely:

{
    API("CreateFile"): [0x401000, 0x401005],
    API("ExitProcess"): [0x40100A],
    String("foo"): [0x40100B],
    Mnemonic("mov"): [0x401000, ...]
   ...
}

When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found:

assert API("CreateFile") in features

We could move to a more explicit structure like:

@dataclass
class FeatureSet:
  api: Dict[str, List[Address]]
  string: Dict[str, List[Address]]
  mnemonic: Dict[str, List[Address]]
  number: Dict[int, List[Address]]
  ...

or more concretely:

features = FeatureSet(
  api={
    "CreateFile": [0x401000, 0x401005],
    "ExitProcess": [0x40100A],
  },
  string={
    "foo": [0x40100B],
  ],
  mnemonic={
    "mov": [0x401000, ...]
  }
)

The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type.

On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon).

It might be possible to do a hybrid, like:

@dataclass
class FeatureSet:
  bytes: Dict[bytes, Set[Address]]
  string: Dict[str, List[Address]]
  other: Dict[Feature, List[Address]]

Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason.

Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction.


See also discussion below on making Bytes hashable, for common prefix lengths.

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 6, 2024

pre-filter strings, bytes based on whats found in the file

To avoid searching for strings/bytes that won't ever be found at a particular scope, we could first check that each string/bytes is present somewhere in the file.

If its not, then we can partially evaluate some rule logic (like and statements) to see if further logic can be pruned and/or rules skipped.

For example, we have HTTP User-Agent rules that contain tons of strings that match under a single or. If none are present in the file, we can skip the whole rule.

We'd want to ensure that the up-front scan to find the file matches doesn't take much time, and that it doesn't outweigh any performance improvements. Remember we may have hundreds or thousands of terms to look for. We can also use evaluation counts to show that less logic needs to be matched when some branches are pruned.

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 6, 2024

prune rule logic using global features

Once the global features are known (arch, os, format, etc.), then prune logic from the rules that won't ever match. This way, the same global feature are not re-evaluated over and over again, for each instance of the scope.

Edit:
I think it will be fine to prune logic away from the trees, but I don't think its easy to "pre-evaluate" global features. We would have to find a way to extend the logic tree nodes to cache a "pre-evaluated" result, which I think might be tricky/complex/fragile.

@mike-hunhoff
Copy link
Collaborator

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here: image

if not isinstance(feature, (Bytes,)):

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

  1. collect all the bytes features once, and re-use this list when evaluating each bytes feature
  2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.

This logic is used to evaluate other features including Substring, Regex, and OS leading me to think that (2) may be a better approach.

@mike-hunhoff
Copy link
Collaborator

@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 .

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 8, 2024

tighten rule pre-selection

In RuleSet we have a match routine that uses knowledge of the complete set of rules to more efficiently match features. Essentially, it indexes all features across all the rules, and only attempts to match rules that share a feature with the given feature set. That is, if an instruction's feature set contains only the single feature api(CreateFile), then the matcher will only check rules that have that same feature (of which there's probably like 5, of 700 total rules). This saves quite a bit of computation, since most rules don't have the features found in small feature sets.

(aside: what is the distribution of feature counts by scope? Are there optimizations that only make sense at small/large scopes?)

I think we can do even better, at least in some common scenarios, and save more time. Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

For example, consider the rule:

- instruction:
  - and:
    - number: 6789
    - mnemonic: mov

Today, anytime we encounter a feature set with any single one of those features, we evaluate the full rule. Because mov is a common instruction, then this rule will be evaluated frequently, like in around 20% of all instructions, even though the number 6789 is seen extremely rarely, and must also be present for the rule to match.

Here, I propose that we only index one of the features (the "best, most selective one") that must be present for the full rule to match.

The intuition is that some features will be somewhat common, and if we index those, then we end up evaluating a lot more rules. If we only index features that are uncommon, then we're less like to have false positives in this pre-filtering step and save time.

I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.

Note that for rules like the following, we have to pick a feature from each branch of the or. That's ok - we just want to index a minimal number of features.

- or:
  - and:
    - number: 1
    - number: 2
  - and:
    - offset: 3
    - offset 4

We should be able to use raw evaluation counts to demonstrate that this technique is working.

Note that this will introduce a bit more logic to the initial RuleSet constructor: in addition to walking all the features, we have to track the minimal set of features and their scores. But this is a one-time initialization cost that will pay dividends across every scope evaluation (every instruction, every bb, every function, ...).

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 8, 2024

slow mimikatz function

DEBUG:capa.capabilities.static:analyzed function 0x420a81 and extracted 325 features, 2 matches in 1.57s                                                                                            

Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions.


Ok, well that is gross:
image

Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated.

6964 total features.
852 basic blocks!
4442 instructions.

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 8, 2024

FeatureSet size distributions

From mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100.

instruction

image

basic block

image

function

image

We can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers).


(notes for future willi)

image

bat /tmp/1.txt | grep "^perf: match" | grep "FUNC" | choose 4 | sort | uniq -c | awk '{print $2" "$1}' | sort --human-numeric-sort
gnuplot> plot "/tmp/func.txt" with boxes using 1:2

@williballenthin
Copy link
Collaborator Author

williballenthin commented May 13, 2024

hash lookup common bytes length prefixes

Today, we match bytes by doing a prefix search against encountered bytes (up to 0x100 long). Since many sequences of bytes we search for have some structure (well, common length), like a GUID or cryptographic S-Box, we can optimize some of these searches by indexing the bytes by their prefix (for common lengths, like 8, 16, 32, and 64 bytes). Then, when the wanted bytes feature has this same length, we can do if feature in features rather than for bytes in features: if bytes.startswith(feature).

This can also help the rule logic planner, since it can pre-filter more rule when the hashable features are known.

The tradeoff is that we generate N (probably 4-5) more features per bytes feature.

image

Maybe definitely do 16 (the size of a GUID).

8, 256, and 64 also look nice and round (and probably not-domain-specific), so consider those. 9 comes from OpenSSL SHA constants. 171 comes from Tiger S-Boxes.


Against mimikatz with the changes in #2080, we have the following evaluation counts by Bytes feature size:

feature class evaluation count
evaluate.feature.bytes 261,464
evaluate.feature.bytes.171 71,400
evaluate.feature.bytes.64 35,794
evaluate.feature.bytes.256 34,002
evaluate.feature.bytes.16 24,226
evaluate.feature.bytes.9 18,837
evaluate.feature.bytes.128 17,002
evaluate.feature.bytes.8 10,576
evaluate.feature.bytes.56 10,200
evaluate.feature.bytes.28 7,176
evaluate.feature.bytes.48 6,800
evaluate.feature.bytes.32 6,091
evaluate.feature.bytes.7 3,588
evaluate.feature.bytes.5 3,588
evaluate.feature.bytes.20 3,400
evaluate.feature.bytes.72 3,400
evaluate.feature.bytes.121 1,794
evaluate.feature.bytes.40 897
evaluate.feature.bytes.6 897
evaluate.feature.bytes.4 897
evaluate.feature.bytes.12 897
evaluate.feature.bytes.232 2

Indexing the power-of-2 lengths would save about 49% of the scanning evaluations. I'm not sure what this costs in runtime. Will investigate before going deeper.

@williballenthin
Copy link
Collaborator Author

hash lookup case insensitive strings

Today, we match case insensitive strings by invoking the regex matching engine. But, we have quite a few of these features in our rule set (409???). Like with bytes above, consider emitting lowercased strings as features, and matching case-insensitive strings via hash lookup.

The tradeoff is one additional feature emitted per string, and a bit more code complexity.

Again, I think this may be able to help the rule logic planner, since it can pre-filter more rules when the hashable features are known.

@s-ff
Copy link
Collaborator

s-ff commented May 15, 2024

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here: image

if not isinstance(feature, (Bytes,)):

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

1. collect all the bytes features once, and re-use this list when evaluating each bytes feature

2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.

Today, FeatureSet looks like:

FeatureSet = Dict[Feature, Set[Address]]

or more concretely:

{
    API("CreateFile"): [0x401000, 0x401005],
    API("ExitProcess"): [0x40100A],
    String("foo"): [0x40100B],
    Mnemonic("mov"): [0x401000, ...]
   ...
}

When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found:

assert API("CreateFile") in features

We could move to a more explicit structure like:

@dataclass
class FeatureSet:
  api: Dict[str, List[Address]]
  string: Dict[str, List[Address]]
  mnemonic: Dict[str, List[Address]]
  number: Dict[int, List[Address]]
  ...

or more concretely:

features = FeatureSet(
  api={
    "CreateFile": [0x401000, 0x401005],
    "ExitProcess": [0x40100A],
  },
  string={
    "foo": [0x40100B],
  ],
  mnemonic={
    "mov": [0x401000, ...]
  }
)

The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type.

On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon).

It might be possible to do a hybrid, like:

@dataclass
class FeatureSet:
  bytes: Dict[bytes, Set[Address]]
  string: Dict[str, List[Address]]
  other: Dict[Feature, List[Address]]

Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason.

Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction.

See also discussion below on making Bytes hashable, for common prefix lengths.

As you mentionned @williballenthin, this would require a redesign of FeatureSet. I imagine somthing like this:

from typing import Dict, Set, Union

FeatureType = Union[Bytes, Regex, String, Substring, Arch, OS, Format]
FeatureSet = Dict[str, Dict[FeatureType, Set[Address]]]

feature_set: FeatureSet = {
    "bytes": {},
    "regex": {},
    "string": {},
     ...
}

# Add a Bytes feature
bytes_feature = Bytes(b"\x90\x90")
feature_set["bytes"][bytes_feature] = {absolute(0x401000), absolute(0x402000)}

With this, we could avoid iterating over all features.

prune rule logic using global features

Once the global features are known (arch, os, format, etc.), then prune logic from the rules that won't ever match. This way, the same global feature are not re-evaluated over and over again, for each instance of the scope.

Edit:
I think it will be fine to prune logic away from the trees, but I don't think its easy to "pre-evaluate" global features. We would have to find a way to extend the logic tree nodes to cache a "pre-evaluated" result, which I think might be tricky/complex/fragile.

Here, do you mean that for example, if the extracted os feature for a binary is known to be windows, then any rules that require os: linux can be removed since they will never match? If yes, this is a must-have in my opinion.

I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.

This is corrent, for mimikatz.exe_ around 50% of all ops are either mov, push, or call.

grafik

objdump -dj .text --no-show-raw-insn --no-addresses mimikatz.exe_ | grep -P '^\t' | awk -F '\t' -F ' ' '{ count[$1]++ } END { total = 0; for (i in count) total += count[i]; for (i in count) { printf("%d\t%0.2f%%\t%s\n", count[i], count[i] * 100 / total, i) } }' | sort -nr | head

@mr-tz mr-tz added gsoc Work related to Google Summer of Code project. performance Related to capa's performance labels May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gsoc Work related to Google Summer of Code project. performance Related to capa's performance
Projects
Status: In progress
Development

No branches or pull requests

4 participants