investigate optimization of rule matching (May, 2024) #2063

williballenthin · 2024-05-06T07:45:52Z

As shown in #2061, perhaps 70% of capa runtime is spent evaluating rule logic. That means, if we want to make capa run faster, improvements to the logic evaluation code may have a bigger impact than code analysis changes.

In this thread I'll record findings and ideas for performance enhancements. We can close out this issue when we feel we have a good handle on performance and whether its worthwhile to make changes to capa.

williballenthin · 2024-05-06T07:49:34Z

Here's a speedscope trace captured by py-spy when using capa with the BinExport2 backend against mimikatz. Using this backend removes a bunch of noise related to vivisect analysis, which isn't relevant to this thread, and isn't something we can easily improve.

profile-be2.speedscope.zip

For example, use the sandwich diagram to identify routines that take up a lot of runtime:

williballenthin · 2024-05-06T07:58:10Z

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here:

capa/capa/features/common.py

Line 393 in 824e852

if not isinstance(feature, (Bytes,)):

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

collect all the bytes features once, and re-use this list when evaluating each bytes feature
store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.

Today, FeatureSet looks like:

FeatureSet = Dict[Feature, Set[Address]]

or more concretely:

{
    API("CreateFile"): [0x401000, 0x401005],
    API("ExitProcess"): [0x40100A],
    String("foo"): [0x40100B],
    Mnemonic("mov"): [0x401000, ...]
   ...
}

When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found:

assert API("CreateFile") in features

We could move to a more explicit structure like:

@dataclass
class FeatureSet:
  api: Dict[str, List[Address]]
  string: Dict[str, List[Address]]
  mnemonic: Dict[str, List[Address]]
  number: Dict[int, List[Address]]
  ...

or more concretely:

features = FeatureSet(
  api={
    "CreateFile": [0x401000, 0x401005],
    "ExitProcess": [0x40100A],
  },
  string={
    "foo": [0x40100B],
  ],
  mnemonic={
    "mov": [0x401000, ...]
  }
)

The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type.

On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon).

It might be possible to do a hybrid, like:

@dataclass
class FeatureSet:
  bytes: Dict[bytes, Set[Address]]
  string: Dict[str, List[Address]]
  other: Dict[Feature, List[Address]]

Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason.

Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction.

See also discussion below on making Bytes hashable, for common prefix lengths.

williballenthin · 2024-05-06T10:16:02Z

pre-filter strings, bytes based on whats found in the file

To avoid searching for strings/bytes that won't ever be found at a particular scope, we could first check that each string/bytes is present somewhere in the file.

If its not, then we can partially evaluate some rule logic (like and statements) to see if further logic can be pruned and/or rules skipped.

For example, we have HTTP User-Agent rules that contain tons of strings that match under a single or. If none are present in the file, we can skip the whole rule.

We'd want to ensure that the up-front scan to find the file matches doesn't take much time, and that it doesn't outweigh any performance improvements. Remember we may have hundreds or thousands of terms to look for. We can also use evaluation counts to show that less logic needs to be matched when some branches are pruned.

williballenthin · 2024-05-06T10:17:53Z

prune rule logic using global features

Once the global features are known (arch, os, format, etc.), then prune logic from the rules that won't ever match. This way, the same global feature are not re-evaluated over and over again, for each instance of the scope.

Edit:
I think it will be fine to prune logic away from the trees, but I don't think its easy to "pre-evaluate" global features. We would have to find a way to extend the logic tree nodes to cache a "pre-evaluated" result, which I think might be tricky/complex/fragile.

mike-hunhoff · 2024-05-06T18:43:30Z

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here:

capa/capa/features/common.py

Line 393 in 824e852

if not isinstance(feature, (Bytes,)):

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:

collect all the bytes features once, and re-use this list when evaluating each bytes feature

store each "type" of feature in its own list, so that we can quickly find the features to evaluate

We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.

This logic is used to evaluate other features including Substring, Regex, and OS leading me to think that (2) may be a better approach.

mike-hunhoff · 2024-05-06T18:46:47Z

@s-ff take a look at the research and discussion in this issue to get you thinking about our GSoC project. No action beyond reviewing (and posting any thoughts you have) is needed at this time, we'll discuss more in our upcoming meetings 😄 .

williballenthin · 2024-05-08T12:45:47Z

tighten rule pre-selection

In RuleSet we have a match routine that uses knowledge of the complete set of rules to more efficiently match features. Essentially, it indexes all features across all the rules, and only attempts to match rules that share a feature with the given feature set. That is, if an instruction's feature set contains only the single feature api(CreateFile), then the matcher will only check rules that have that same feature (of which there's probably like 5, of 700 total rules). This saves quite a bit of computation, since most rules don't have the features found in small feature sets.

(aside: what is the distribution of feature counts by scope? Are there optimizations that only make sense at small/large scopes?)

I think we can do even better, at least in some common scenarios, and save more time. Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

For example, consider the rule:

- instruction:
  - and:
    - number: 6789
    - mnemonic: mov

Today, anytime we encounter a feature set with any single one of those features, we evaluate the full rule. Because mov is a common instruction, then this rule will be evaluated frequently, like in around 20% of all instructions, even though the number 6789 is seen extremely rarely, and must also be present for the rule to match.

Here, I propose that we only index one of the features (the "best, most selective one") that must be present for the full rule to match.

The intuition is that some features will be somewhat common, and if we index those, then we end up evaluating a lot more rules. If we only index features that are uncommon, then we're less like to have false positives in this pre-filtering step and save time.

I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.

Note that for rules like the following, we have to pick a feature from each branch of the or. That's ok - we just want to index a minimal number of features.

- or:
  - and:
    - number: 1
    - number: 2
  - and:
    - offset: 3
    - offset 4

We should be able to use raw evaluation counts to demonstrate that this technique is working.

Note that this will introduce a bit more logic to the initial RuleSet constructor: in addition to walking all the features, we have to track the minimal set of features and their scores. But this is a one-time initialization cost that will pay dividends across every scope evaluation (every instruction, every bb, every function, ...).

williballenthin · 2024-05-08T13:36:16Z

slow mimikatz function

DEBUG:capa.capabilities.static:analyzed function 0x420a81 and extracted 325 features, 2 matches in 1.57s

Just 300 features takes 1.5s to evaluate! Why is this so? Maybe if we investigate this one function we can make fixes that help all functions.

Ok, well that is gross:

Lots of small basic blocks means there are going to be many instruction scopes and many basic block scopes to evaluate before the function scope is evaluated.

6964 total features.
852 basic blocks!
4442 instructions.

williballenthin · 2024-05-08T14:10:55Z

FeatureSet size distributions

From mimikatz using Ghidra BinExport2 backend. These plots show the distribution of sizes of FeatureSets by scope. In summary, instructions usually have less than 10 features, basic blocks less than 20, and functions less than 100.

instruction

basic block

function

We can also use this technique to investigate the number of rules selected to be evaluated at each of these scope instances (and then attempt to minimize these numbers).

(notes for future willi)

bat /tmp/1.txt | grep "^perf: match" | grep "FUNC" | choose 4 | sort | uniq -c | awk '{print $2" "$1}' | sort --human-numeric-sort
gnuplot> plot "/tmp/func.txt" with boxes using 1:2

williballenthin · 2024-05-13T09:12:18Z

hash lookup common bytes length prefixes

Today, we match bytes by doing a prefix search against encountered bytes (up to 0x100 long). Since many sequences of bytes we search for have some structure (well, common length), like a GUID or cryptographic S-Box, we can optimize some of these searches by indexing the bytes by their prefix (for common lengths, like 8, 16, 32, and 64 bytes). Then, when the wanted bytes feature has this same length, we can do if feature in features rather than for bytes in features: if bytes.startswith(feature).

This can also help the rule logic planner, since it can pre-filter more rule when the hashable features are known.

The tradeoff is that we generate N (probably 4-5) more features per bytes feature.

Maybe definitely do 16 (the size of a GUID).

8, 256, and 64 also look nice and round (and probably not-domain-specific), so consider those. 9 comes from OpenSSL SHA constants. 171 comes from Tiger S-Boxes.

Against mimikatz with the changes in #2080, we have the following evaluation counts by Bytes feature size:

feature class	evaluation count
evaluate.feature.bytes	261,464
evaluate.feature.bytes.171	71,400
evaluate.feature.bytes.64	35,794
evaluate.feature.bytes.256	34,002
evaluate.feature.bytes.16	24,226
evaluate.feature.bytes.9	18,837
evaluate.feature.bytes.128	17,002
evaluate.feature.bytes.8	10,576
evaluate.feature.bytes.56	10,200
evaluate.feature.bytes.28	7,176
evaluate.feature.bytes.48	6,800
evaluate.feature.bytes.32	6,091
evaluate.feature.bytes.7	3,588
evaluate.feature.bytes.5	3,588
evaluate.feature.bytes.20	3,400
evaluate.feature.bytes.72	3,400
evaluate.feature.bytes.121	1,794
evaluate.feature.bytes.40	897
evaluate.feature.bytes.6	897
evaluate.feature.bytes.4	897
evaluate.feature.bytes.12	897
evaluate.feature.bytes.232	2

Indexing the power-of-2 lengths would save about 49% of the scanning evaluations. I'm not sure what this costs in runtime. Will investigate before going deeper.

williballenthin · 2024-05-13T09:26:51Z

hash lookup case insensitive strings

Today, we match case insensitive strings by invoking the regex matching engine. But, we have quite a few of these features in our rule set (409???). Like with bytes above, consider emitting lowercased strings as features, and matching case-insensitive strings via hash lookup.

The tradeoff is one additional feature emitted per string, and a bit more code complexity.

Again, I think this may be able to help the rule logic planner, since it can pre-filter more rules when the hashable features are known.

s-ff · 2024-05-15T20:18:04Z

lots of time spent in instancecheck

5.5% of runtime is spent in __instancecheck__, including about 2.5% of total runtime on the line here:

capa/capa/features/common.py

Line 393 in 824e852

if not isinstance(feature, (Bytes,)):

Here's whats happening: in order to evaluate a bytes feature at some scope, we enumerate all the features at the scope, using isinstance to find those that are bytes and then do a prefix match.

This is not good, because given that there are dozens to thousands of features at some scopes (like a large function), and we might look for multiple bytes features, we spend a lot of time scanning through all the features repeatedly.

Some things we could do to fix this:
1. collect all the bytes features once, and re-use this list when evaluating each bytes feature

2. store each "type" of feature in its own list, so that we can quickly find the features to evaluate
We might be able to squeeze (1) in without any major breaking changes to demonstrate this strategy is worthwhile. But, its likely to make the code more complex.

With a bit more investigation, we might be able to show the larger effort of (2) is worthwhile. In particular, this problem only gets worse as we add more rules.

Today, FeatureSet looks like:
FeatureSet = Dict[Feature, Set[Address]]
or more concretely:
{
    API("CreateFile"): [0x401000, 0x401005],
    API("ExitProcess"): [0x40100A],
    String("foo"): [0x40100B],
    Mnemonic("mov"): [0x401000, ...]
   ...
}
When we have a rule that requires a feature, we can do a dict lookup to see if the feature is present, and where its found:
assert API("CreateFile") in features
We could move to a more explicit structure like:
@dataclass
class FeatureSet:
  api: Dict[str, List[Address]]
  string: Dict[str, List[Address]]
  mnemonic: Dict[str, List[Address]]
  number: Dict[int, List[Address]]
  ...
or more concretely:
features = FeatureSet(
  api={
    "CreateFile": [0x401000, 0x401005],
    "ExitProcess": [0x40100A],
  },
  string={
    "foo": [0x40100B],
  ],
  mnemonic={
    "mov": [0x401000, ...]
  }
)
The primary benefit is that features that don't match by hash lookup, like string/bytes/regex/substring, can be much more efficiently matched by only considering the candidates of the right type.

On the other hand, this would be yet another structure that has to be updated every time we introduce a new feature (uncommon).

It might be possible to do a hybrid, like:
@dataclass
class FeatureSet:
  bytes: Dict[bytes, Set[Address]]
  string: Dict[str, List[Address]]
  other: Dict[Feature, List[Address]]
Perhaps we should start with this hybrid to show that it works, and then do the full changes if there's a compelling reason.

Note that we can't use evaluation counts to demonstrate performance improvements here. Rather, we need to use benchmarking and wall clock time to show we're going in the right direction.

See also discussion below on making Bytes hashable, for common prefix lengths.

As you mentionned @williballenthin, this would require a redesign of FeatureSet. I imagine somthing like this:

from typing import Dict, Set, Union

FeatureType = Union[Bytes, Regex, String, Substring, Arch, OS, Format]
FeatureSet = Dict[str, Dict[FeatureType, Set[Address]]]

feature_set: FeatureSet = {
    "bytes": {},
    "regex": {},
    "string": {},
     ...
}

# Add a Bytes feature
bytes_feature = Bytes(b"\x90\x90")
feature_set["bytes"][bytes_feature] = {absolute(0x401000), absolute(0x402000)}

With this, we could avoid iterating over all features.

prune rule logic using global features

Once the global features are known (arch, os, format, etc.), then prune logic from the rules that won't ever match. This way, the same global feature are not re-evaluated over and over again, for each instance of the scope.

Edit:
I think it will be fine to prune logic away from the trees, but I don't think its easy to "pre-evaluate" global features. We would have to find a way to extend the logic tree nodes to cache a "pre-evaluated" result, which I think might be tricky/complex/fragile.

Here, do you mean that for example, if the extracted os feature for a binary is known to be windows, then any rules that require os: linux can be removed since they will never match? If yes, this is a must-have in my opinion.

I think mnemonic: mov is common and not selective. Same with small numbers and offsets. But APIs and strings are very selective, so we should prefer these when possible. We could craft some handmade rules to score features, or do a large scale rule to collect features seen in the wild and score by global prevalence. I think either will work.

This is corrent, for mimikatz.exe_ around 50% of all ops are either mov, push, or call.

objdump -dj .text --no-show-raw-insn --no-addresses mimikatz.exe_ | grep -P '^\t' | awk -F '\t' -F ' ' '{ count[$1]++ } END { total = 0; for (i in count) total += count[i]; for (i in count) { printf("%d\t%0.2f%%\t%s\n", count[i], count[i] * 100 / total, i) } }' | sort -nr | head

williballenthin added the enhancement New feature or request label May 6, 2024

williballenthin mentioned this issue May 14, 2024

tighten rule pre-selection #2080

Draft

6 tasks

mr-tz added gsoc Work related to Google Summer of Code project. performance Related to capa's performance labels May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate optimization of rule matching (May, 2024) #2063

investigate optimization of rule matching (May, 2024) #2063

williballenthin commented May 6, 2024

williballenthin commented May 6, 2024

williballenthin commented May 6, 2024 •

edited

williballenthin commented May 6, 2024 •

edited

williballenthin commented May 6, 2024 •

edited

mike-hunhoff commented May 6, 2024

lots of time spent in instancecheck

mike-hunhoff commented May 6, 2024

williballenthin commented May 8, 2024 •

edited

williballenthin commented May 8, 2024 •

edited

williballenthin commented May 8, 2024 •

edited

williballenthin commented May 13, 2024 •

edited

williballenthin commented May 13, 2024

s-ff commented May 15, 2024 •

edited

investigate optimization of rule matching (May, 2024) #2063

investigate optimization of rule matching (May, 2024) #2063

Comments

williballenthin commented May 6, 2024

williballenthin commented May 6, 2024

williballenthin commented May 6, 2024 • edited

lots of time spent in instancecheck

williballenthin commented May 6, 2024 • edited

pre-filter strings, bytes based on whats found in the file

williballenthin commented May 6, 2024 • edited

prune rule logic using global features

mike-hunhoff commented May 6, 2024

lots of time spent in instancecheck

mike-hunhoff commented May 6, 2024

williballenthin commented May 8, 2024 • edited

tighten rule pre-selection

williballenthin commented May 8, 2024 • edited

slow mimikatz function

williballenthin commented May 8, 2024 • edited

FeatureSet size distributions

instruction

basic block

function

williballenthin commented May 13, 2024 • edited

hash lookup common bytes length prefixes

williballenthin commented May 13, 2024

hash lookup case insensitive strings

s-ff commented May 15, 2024 • edited

williballenthin commented May 6, 2024 •

edited

williballenthin commented May 6, 2024 •

edited

williballenthin commented May 6, 2024 •

edited

williballenthin commented May 8, 2024 •

edited

williballenthin commented May 8, 2024 •

edited

williballenthin commented May 8, 2024 •

edited

williballenthin commented May 13, 2024 •

edited

s-ff commented May 15, 2024 •

edited