[TOC]
We are trying to build Centipede based on our experience with libFuzzer (and to less extent, with AFL and others). We keep what worked well, change what didn't.
See README for user documentation and most of the terminology.
Centipede reasons about execution feedback in terms of features. A feature is some unique behavior of the target on a given input. So, executing an input is a way to compute the input's features.
The currently supported features (see feature.h for details) are:
- Control flow edges with 8-bit counters .
- Simplified data flow edges: either {store-PC, load-PC} or {global-address, load-PC}.
- Bounded control flow paths.
- Instrumented CMP instructions.
- ... more coming.
However, the target may generate features of its own type, without having Centipede to support it explicitly.
Centipede aims at large and slow targets, such that even a minimized corpus may consist of 100K-1M inputs, and executing every input takes 1ms-1s. We also aim at supporting much more feature types than most other engines, which will in turn bloat the corpus further.
At the same time, we aim at executing on cheap ("preemptible") cloud VMs, and so we need to minimize the startup computations.
So, we try hard to eliminate all redundant executions.
Centipede state consists of the following:
- Corpus. A set of inputs. The corpus is a property of a group of fuzz targets sharing the same input data format.
- Feature sets. For every corpus element we preserve the set of its features. Features are a property of a specific target binary. Different binary (e.g. from a different revision, or different build options, or from different code) will have its own persistent feature set. A feature set is associated with an input via the input's hash.
On startup, Centipede loads the corpus, and checks which corpus elements have their corresponding feature sets. Only when the feature set is not present for an input in the corpus, Centipede will recompute it.
Centipede jobs run concurrently (in separate processes, potentially on different machines). They peek at each other's corpus (and feature sets) periodically. Every jobs writes only to its own persistent state, but can read any other job's state concurrently with that job writing to it.
Centipede implements this via appendable storage format.
Very simple and inefficient homebrewed appendable data format is currently used,
see PackBytesForAppendFile()
in util.h. We may need to replace it
with something more efficient when this one stops scaling.
Centipede executes the target out of process. In order to minimize the process startup costs, it passes inputs to the target in batches, and receives features in batches as well.
The specific mechanism of execution and passing the data between processes can
be overridden. The default implementation is CentipedeCallbacks::Execute()
in
centipede_interface.h.
It is possible to override the execution to do it in-process, but this way Centipede will lose the ability to set RAM and time limit, and it will not tolerate crashes in the target.
Centipede is decoupled from the mechanism that collects the execution feedback.
Any source of feedback can be used: compiler instrumentation, run-time
instrumentation, simulation, hardware-based tracing, etc. The default
implementation in runner_main.cc and other runner_*.cc
files
relies on
SanitizerCoverage
Centipede is decoupled from Mutations - they are provided by the user via
CentipedeCallbacks::Mutate()
.
Centipede provides a default implementation of mutations via ByteArrayMutator
.
One of the important heuristics during fuzzing is which corpus elements to mutate, and which to discard.
Centipede tries to mutate only those corpus elements that have rare features.
TODO(kcc): [design] explain how this works.
- [Entropic] Boosting fuzzer efficiency: an information theoretic perspective. https://dl.acm.org/doi/abs/10.1145/3368089.3409748
- [Nezha]: Efficient Domain-Independent Differential Testing
https://www.cs.columbia.edu/~suman/docs/nezha.pdf
- centipede doesn't do most of that, but aspires to do more than that :)