Pastelito uses fuzz testing to ensure the parsing, tokenization and part of speech tagging stages are robust against invalid input. The ARCHITECTURE.md file gives a full overview of how Pastelito works but the flow of data when parsing a document is roughly:
- Parse the markdown input
- Transform parts of the markdown document into a series of "blocks", using
pulldown-cmark
- Each block is then tokenized into "words"
- Each word is then tagged with a part of speech tag
It's simple to set up a fuzzing harness:
use fuzz_target;
use ;
fuzz_target!;
In this harness we're fuzzing steps (1), (2), (3) and (4). Step (1) is third-party code that is already being fuzzing upstream though. When fuzzing projects, we're more concerned about bugs in our own code: steps (2), (3) and (4) in this example.
This was an issue while developing Pastelito because the latest version of the pulldown-cmark
crate would panic on some malformed Markdown. These issues had been fixed in the main
branch but had not yet made it to a release branch. I had two options: switch the pulldown-cmark
dependency to track main
— which adds some complexity to the dependency management — or find a way to ignore panics from pulldown-cmark
while fuzzing.
Ignoring panics
Panics in Rust are similar to exceptions, in that they unwind the stack when triggered and can be caught by handlers. This isn't true in all instances -- a program can be compiled with -C panic=abort
to make panics terminate the program immediately -- but it's the default behaviour.
std::panic::catch_unwind
and std::panic::update_hook
can be combined to filter panics to a subset that we care about. Here's how it's implemented in Pastelito.
Matching panic sources
Panic hooks are called with a PanicHookInfo
argument which gives information about the source of a panic. We want to filter panics that originate from the pulldown-cmark
crate, based on the filename:
Installing a panic hook
We can then combine that with update_hook
to replace the default panic hook with our own. The libfuzzer-sys
crate already installs its own hook at initialization time so we don't want to ignore the existing hook. Instead, we skip the existing hook if the panic is not one we're interested in:
static REPORT_ALL_PANICS: = new;
A few notes:
- There is no global initialization for libfuzzer harnesses so we use a
LazyLock
to install the panic hook once. - By default we want to ignore panics from
pulldown-cmark
but it might be useful to see them in some cases. We use theREPORT_ALL_PANICS
environment variable to control this behaviour.
Running the fuzz harness
We can then combine these helpers into a top-level function that runs a testcase and filters panics:
Firstly, we take the raw input from the fuzzer and convert it to UTF-8 input. The Corpus::Reject return value tells the fuzzer to ignore this testcase and not use it for generating future inputs. This helps us to steer the fuzzer towards valid UTF-8 input. You could also use Arbitrary::arbitrary(...)
which will return the prefix of data
which is valid UTF-8.
Next, we have two slightly different codepaths depending on the REPORT_ALL_PANICS
env var. If we're reporting all panics, then we will not have installed our panic hook. We call cb(markdown)
as usual and let the process panic as usual.
If we are not reporting all panics, then our custom panic hook will have been installed at this point. We call cb(markdown)
inside a catch_unwind
block. If err
is Ok
then the testcase was successful and we propagate the return value of the callback. If an Err(...)
is returned we know that a panic was triggered in pulldown-cmark
. We return Corpus::Reject
to tell the fuzzer to ignore this testcase.
Using the fuzz harness
The fuzz_markdown
function is generic is used in a couple of different fuzz harnesses. For document parsing, we simply call the Document::new
constructor:
fuzz_target!;