Pastelito uses fuzz testing to ensure the parsing, tokenization and part of speech tagging stages are robust against invalid input. The ARCHITECTURE.md file gives a full overview of how Pastelito works but the flow of data when parsing a document is roughly:

  1. Parse the markdown input
  2. Transform parts of the markdown document into a series of "blocks", using pulldown-cmark
  3. Each block is then tokenized into "words"
  4. Each word is then tagged with a part of speech tag

It's simple to set up a fuzzing harness:

#![no_main]
use libfuzzer_sys::fuzz_target;
use pastelito_core::{parsers::MarkdownParser, Document};

fuzz_target!(|markdown: String| {
    let _doc = Document::new(&MarkdownParser::default(), &markdown);
});

In this harness we're fuzzing steps (1), (2), (3) and (4). Step (1) is third-party code that is already being fuzzing upstream though. When fuzzing projects, we're more concerned about bugs in our own code: steps (2), (3) and (4) in this example.

This was an issue while developing Pastelito because the latest version of the pulldown-cmark crate would panic on some malformed Markdown. These issues had been fixed in the main branch but had not yet made it to a release branch. I had two options: switch the pulldown-cmark dependency to track main — which adds some complexity to the dependency management — or find a way to ignore panics from pulldown-cmark while fuzzing.

Ignoring panics

Panics in Rust are similar to exceptions, in that they unwind the stack when triggered and can be caught by handlers. This isn't true in all instances -- a program can be compiled with -C panic=abort to make panics terminate the program immediately -- but it's the default behaviour.

std::panic::catch_unwind and std::panic::update_hook can be combined to filter panics to a subset that we care about. Here's how it's implemented in Pastelito.

Matching panic sources

Panic hooks are called with a PanicHookInfo argument which gives information about the source of a panic. We want to filter panics that originate from the pulldown-cmark crate, based on the filename:

fn should_ignore_panic(panic_info: &PanicHookInfo<'_>) -> bool {
    if let Some(location) = panic_info.location() {
        // `location.file()` is the file where the panic occurred, e.g:
        // "/home/ferris/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pulldown-cmark-0.12.2/src/firstpass.rs"
        //
        // It is _not_ a `Path`, so we just do a simple string match to
        // determine if the panic came from the pulldown-cmark crate.
        location.file().contains("/pulldown-cmark-")
    } else {
        false
    }
}

Installing a panic hook

We can then combine that with update_hook to replace the default panic hook with our own. The libfuzzer-sys crate already installs its own hook at initialization time so we don't want to ignore the existing hook. Instead, we skip the existing hook if the panic is not one we're interested in:

static REPORT_ALL_PANICS: LazyLock<bool> = LazyLock::new(|| {
    let report_all_panics = std::env::var("REPORT_ALL_PANICS").is_ok();

    if !report_all_panics {
        std::panic::update_hook(move |prev, panic_info| {
            if should_ignore_panic(panic_info) {
                // Ignore the panic. The call to `catch_unwind(...)` in
                // `filter_panics` will catch the panic and ignore it
            } else {
                // Trigger the original libfuzzer-sys panic hook. This will
                // abort the process after dumping a backtrace
                prev(panic_info);
            }
        });
    }

    report_all_panics
});

A few notes:

Running the fuzz harness

We can then combine these helpers into a top-level function that runs a testcase and filters panics:

pub fn fuzz_markdown<F, C>(data: &[u8], cb: F) -> Corpus
where
    F: FnOnce(&str) -> C + panic::UnwindSafe,
    C: Into<Corpus>,
{
    let Ok(markdown) = std::str::from_utf8(data) else {
        return Corpus::Reject;
    };

    let report_all_panics = *REPORT_ALL_PANICS;

    if report_all_panics {
        cb(markdown).into()
    } else {
        let err = panic::catch_unwind(|| cb(markdown));

        match err {
            Ok(corpus) => corpus.into(),
            Err(_) => {
                // We only reach this point if the panic was ignored. Ignore
                // this testcase so the fuzzer does not generate similar
                // testcases which is a waste of time
                Corpus::Reject
            }
        }
    }
}

Firstly, we take the raw input from the fuzzer and convert it to UTF-8 input. The Corpus::Reject return value tells the fuzzer to ignore this testcase and not use it for generating future inputs. This helps us to steer the fuzzer towards valid UTF-8 input. You could also use Arbitrary::arbitrary(...) which will return the prefix of data which is valid UTF-8.

Next, we have two slightly different codepaths depending on the REPORT_ALL_PANICS env var. If we're reporting all panics, then we will not have installed our panic hook. We call cb(markdown) as usual and let the process panic as usual.

If we are not reporting all panics, then our custom panic hook will have been installed at this point. We call cb(markdown) inside a catch_unwind block. If err is Ok then the testcase was successful and we propagate the return value of the callback. If an Err(...) is returned we know that a panic was triggered in pulldown-cmark. We return Corpus::Reject to tell the fuzzer to ignore this testcase.

Using the fuzz harness

The fuzz_markdown function is generic is used in a couple of different fuzz harnesses. For document parsing, we simply call the Document::new constructor:

fn fuzz_one(markdown: &str) {
    let _doc = Document::new(&MarkdownParser::default(), markdown);
}

fuzz_target!(|data: &[u8]| -> Corpus { fuzz_markdown(data, fuzz_one) });