Pastelito is a VSCode extension to help you write better markdown documents. It aims to highlight grammatical features in your writing as you type. As such, it needs to be fast and responsive. In this post, I'll run through some of the optimizations I made to the perceptron model to make it ~10x faster.
- Overview
- Core tagging loop
- Setup
- Optimizations
- Remove
HashMap
lookups when calculating scores - Use array indexes when calculating scores
- Switch to
FxHashMap
- Use an enum for features
- Do not lookup "bias" feature
- Inline
Scores::update
- Accumulate weights, not features
- Re-use
WordWeights
allocation - Remove
HashMap
lookups when calculating scores - Use array indexes when calculating scores
- Remove
- Conclusion
Overview
Pastelito uses an a perceptron model to predict the part of speech of each word in a document. A Good Part-of-Speech Tagger in about 200 Lines of Python gives a good overview of how a perceptron model works. Perceptron models are not the most accurate, but they are fast.
The high-level architecture of Pastelito is:
- Parse a markdown document into a series of "blocks" using pulldown-cmark. Title, paragraphs, and list items are all blocks
- Each block is then tokenized into "words"
- Each word is then tagged with a part of speech tag
It is the final tagging step that we're going to optimize.
Core tagging loop
The core loop of the tagging logic is handled by Tagger
, which uses a Perceptron
to do the actual prediction for a block of words:
The Perceptron
then predicts the tag for each individual word, while taking into account the surrounding context:
Setup
Before we dive in and start restructuring the code, let's set up a few things.
1. Regression tests
The codebase already includes various unit-tests but let's add a larger test to prevent any subtle bugs from creeping in. I used a copy of LogLog Games's Leaving Rust gamedev after 3 years post, converted to markdown. It contains ~18,000 words and should be large enough to catch any unintended changes.
We store a pristine copy of the tagged words (expected_tags
) and compare it against the output from the model (actual_tags
):
2. Benchmarks
Let's also set up some benchmarks to ensure that the changes we make actually increase performance. Here's the core of the benchmark, using criterion, with a few clarifying comments:
const BLOG_POST: &str = include_str!;
...
Running this benchmark on an MacBook Air M2 gives us our baseline of ~50ms to tag a large blog post:
$ cargo bench
...
Tagger/tag time: [50.349 ms 50.657 ms 50.992 ms]
thrpt: [408.05 Kelem/s 410.74 Kelem/s 413.26 Kelem/s]
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
...
Optimizations
Let's run through the various optimizations and measure the improvements. This is a neat, linear list of things that sped up the main tagging loop. As with most optimization work though, there was much trial-and-error and a few dead-ends.
The perceptron code in Pastelito is not general-purpose and is heavily tied to the current model. This is a trade-off I made that allowed me to make various assumptions about the data and model. It is highly unlikely I will ever change the model, so this trade-off made sense for this project.
1. Remove HashMap
lookups when calculating scores
A perceptron works by summing the weights of various features and taking the maximum to predict the part of speech of a word:
...
The model contains a fixed list of ~50 parts of speech. Instead of using a HashMap
to store the scores, we can use a linear array instead:
Implementing this gives us a 12% speedup against the baseline:
Tagger/tag time: [44.364 ms 44.540 ms 44.745 ms]
thrpt: [465.02 Kelem/s 467.15 Kelem/s 469.00 Kelem/s]
change:
time: [-12.764% -12.075% -11.426%] (p = 0.00 < 0.05)
thrpt: [+12.900% +13.733% +14.631%]
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) high mild
7 (7.00%) high severe
2. Use array indexes when calculating scores
In retrospect, this is a pretty obvious optimization. Instead of using a linear search in Scores
, POS
can be used as an index by added a #[repr(..)]
attribute. I'm using some of strum's macros (POS::COUNT
, POS::iter()
) to make this easier:
This gives us a huge speedup:
Tagger/tag time: [20.320 ms 20.404 ms 20.498 ms]
thrpt: [1.0151 Melem/s 1.0198 Melem/s 1.0239 Melem/s]
change:
time: [-54.470% -54.190% -53.871%] (p = 0.00 < 0.05)
thrpt: [+116.78% +118.29% +119.64%]
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe
At this point, we're ~2.6x faster than the baseline just by tweaking the inner loop.
3. Switch to FxHashMap
We were able to use array indexes for the inner loop but the outer loop still relies on HashMap
s. The model maps features to weights using a HashMap
. There are up to 14 possible features for a word, so the outer loop performs up to 14 HashMap
lookups per word. A quick look at the Rust Performance Book leads us to the section on Hashing.
HashDos is not a concern for Pastelito, so we can switch to FxHashMap
which gives us a small speedup:
Tagger/tag time: [19.412 ms 19.687 ms 20.003 ms]
thrpt: [1.0402 Melem/s 1.0569 Melem/s 1.0718 Melem/s]
change:
time: [-5.0613% -3.5157% -1.9949%] (p = 0.00 < 0.05)
thrpt: [+2.0355% +3.6438% +5.3312%]
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
4. Use an enum for features
Let's take a quick look at the underlying model. It contains ~75,000 features that are encoded as strings, such as:
"i pref1 l"
: The current word begins with "l""i word flora"
: The current word is "flora""i suffix ity"
: The current word has a suffix of "ity""i-1 suffix ity"
: The previous word has a suffix of "ity""i+1 suffix cly"
: The next word has a suffix of "cly""i-1 tag+i word CC freight"
: The previous word's tag is a Coordinating Conjunction and the current word is "freight"
In the prediction loop, we build these features as strings based on the current word and its context. This leads to up to 14 allocations per word:
...
But the format of these strings is highly structured:
- There are only 14 different kinds of features (
i word
,i-1 suffix
,i+1 suffix
, etc) - Suffixes are limited to one, two or three characters
- All words are 23 characters or less
- All characters are ASCII
Instead of using strings, we can encode this as an enum which fits into a snug 24 bytes:
...
The build.rs
script in pastelito-model
converts the JSON formatted model from prose
and converts it to this more efficient format. The model now stores a mapping of Feature
to weights and our main loop looks almost identical:
This gives us another nice speedup and we're ~6.5x faster than the baseline:
Tagger/tag time: [7.5560 ms 7.5750 ms 7.5968 ms]
thrpt: [2.7389 Melem/s 2.7468 Melem/s 2.7537 Melem/s]
change:
time: [-62.129% -61.522% -60.967%] (p = 0.00 < 0.05)
thrpt: [+156.19% +159.89% +164.06%]
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
5 (5.00%) high mild
10 (10.00%) high severe
5. Do not lookup "bias" feature
The Feature::Bias
feature represents the initial bias of the perceptron and should be included for every word. The algorithm looks like:
Instead, we can modify our build.rs
script to generate a default set of scores which includes the Feature::Bias
data. This saves us one hash-lookup per predicted word:
This gives us a ~5% speedup compared to the previous commit:
Tagger/tag time: [7.1422 ms 7.1571 ms 7.1749 ms]
thrpt: [2.9000 Melem/s 2.9072 Melem/s 2.9132 Melem/s]
change:
time: [-5.8425% -5.5177% -5.1983%] (p = 0.00 < 0.05)
thrpt: [+5.4834% +5.8399% +6.2050%]
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) high mild
10 (10.00%) high severe
6. Inline Scores::update
At this point we're starting to run out of simple algorithmic changes. I started to look at the generated assembly. Looking at the main prediction loop, I noticed that the Scores::update
function was not being inlined.
Here is the scoring loop in Rust:
for feature in features
And here is the corresponding disassembly:
$ gobjdump -d target/release/deps/pastelito-72b4e8c44e83f86f | rustfilt
...
10003b758: f9401fe0 ldr x0, [sp, #56]
10003b75c: 94008436 bl 10005c834
<_pastelito_data::Model::get>
10003b760: b4fffd80 cbz x0, 10003b710
<_pastelito_core::perceptron::Perceptron::predict+0x914>
10003b764: d37df03a lsl x26, x1, #3
10003b768: b4fffd5a cbz x26, 10003b710
<_pastelito_core::perceptron::Perceptron::predict+0x914>
10003b76c: 91002013 add x19, x0, #0x8
10003b770: bd400400 ldr s0, [x0, #4]
10003b774: d100235a sub x26, x26, #0x8
10003b778: 39400001 ldrb w1, [x0]
10003b77c: 910443e0 add x0, sp, #0x110
10003b780: 9400826d bl 10005c134
<_pastelito_data::Scores::update>
10003b784: aa1303e0 mov x0, x19
10003b788: 17fffff8 b 10003b768
<_pastelito_core::perceptron::Perceptron::predict+0x96c>
10003b78c: b4ffbed7 cbz x23, 10003af64
<_pastelito_core::perceptron::Perceptron::predict+0x168>
...
The call to Scores::update(...)
is not being inlined! It's just an array lookup and addition, so you would expect it to be inlined:
This is possibly because Scores::update(...)
is implemented in the pastelito_data
crate but the prediction logic is implemented in the pastelito_core
crate. Adding an #[inline]
hint fixes the issue. The Scores::update(...)
is inlined as the ldr
/fadd
/str
sequence:
10003a2b8: 9400895c bl 10005c828
<_pastelito_data::Model::get>
10003a2bc: b4fffd80 cbz x0, 10003a26c
<_pastelito_core::perceptron::Perceptron::predict+0x900>
10003a2c0: b4fffd61 cbz x1, 10003a26c
<_pastelito_core::perceptron::Perceptron::predict+0x900>
10003a2c4: 8b010c08 add x8, x0, x1, lsl #3
10003a2c8: 39400009 ldrb w9, [x0]
10003a2cc: bc697a60 ldr s0, [x19, x9, lsl #2]
10003a2d0: bd400401 ldr s1, [x0, #4]
10003a2d4: 1e202820 fadd s0, s1, s0
10003a2d8: bc297a60 str s0, [x19, x9, lsl #2]
10003a2dc: 91002000 add x0, x0, #0x8
10003a2e0: eb08001f cmp x0, x8
10003a2e4: 54ffff21 b.ne 10003a2c8
<_pastelito_core::perceptron::Perceptron::predict+0x95c> // b.any
10003a2e8: 17ffffe1 b 10003a26c
<_pastelito_core::perceptron::Perceptron::predict+0x900>
10003a2ec: b40000d7 cbz x23, 10003a304
<_pastelito_core::perceptron::Perceptron::predict+0x998>
This gives us a ~5% speedup compared to the previous commit and we're now ~7.5x faster than the baseline:
Tagger/tag time: [6.7355 ms 6.7525 ms 6.7725 ms]
thrpt: [3.0723 Melem/s 3.0814 Melem/s 3.0892 Melem/s]
change:
time: [-5.9900% -5.6528% -5.3410%] (p = 0.00 < 0.05)
thrpt: [+5.6424% +5.9914% +6.3716%]
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
7. Accumulate weights, not features
What other micro-optimizations can we make? The prediction loop now looks like this:
So the scoring loop ping-pongs between Model::get(...)
and Scores::update(...)
. Instead, what would happen if we grouped all the Model::get(...)
calls together and then updated the scores in a single loop? The WordWeights
struct wraps up some of this logic:
...
Would this make a measurable difference? Surprisingly, yes! This is a whopping ~28% speedup compared to the previous commit:
Tagger/tag time: [4.8053 ms 4.8241 ms 4.8478 ms]
thrpt: [4.2920 Melem/s 4.3131 Melem/s 4.3300 Melem/s]
change:
time: [-28.894% -28.558% -28.159%] (p = 0.00 < 0.05)
thrpt: [+39.196% +39.973% +40.635%]
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
6 (6.00%) high mild
9 (9.00%) high severe
8. Re-use WordWeights
allocation
The WordWeights
struct is allocated for each word in Perceptron::predict_one
. Instead, we can allocate it once per block and re-use it for each word in the block, so the code becomes:
This gives another ~5% speedup compared to the previous commit:
Tagger/tag time: [4.6106 ms 4.6176 ms 4.6259 ms]
thrpt: [4.4980 Melem/s 4.5060 Melem/s 4.5129 Melem/s]
change:
time: [-4.7761% -4.2819% -3.8686%] (p = 0.00 < 0.05)
thrpt: [+4.0243% +4.4734% +5.0157%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
This isn't actually a huge speedup because of how allocators are structured. We're constantly freeing and allocating the Vec
storage for WordWeights
, without any other allocations in between. This is a common pattern that allocators optimize for.
Looking specifically at the macOS allocator, the Vec
allocation is small so it is under the "tiny" threshold defined in the allocator. As an optimization, the allocator keeps a per-CPU cache of the most recently freed allocation. The prediction loop constantly hits this fast path.
Conclusion
We started with a baseline of ~50ms to tag a large blog post. After a series of small optimizations, we reduced the time to ~4.6ms — a ~10x speedup. This involved a variety of techniques, some algorithmic and some micro-optimizations.
After all those changes though, it is nice to see that the code of the tagger, Perceptron::predict_one
hasn't changed much. We can have confidence that the algorithm is equivalent to the original:
diff --git a/pastelito-core/src/perceptron.rs b/pastelito-core/src/perceptron.rs
index 5bea678..7531ed1 100644
+ weights: &mut WordWeights,
context: &Context,
word_index: usize,
token: &str,
p1: POS,
p2: POS,
) -> POS {
let context_index = word_index + 2;
- let mut features = Vec::new();
-
- features.push(pastelito_data::bias());
+ weights.clear();
if let Ok(suffix) = token.try_into() {
- features.push(pastelito_data::suffix(suffix));
+ weights.push(&Feature::Suffix(suffix));
}
if let Some(c) = token.chars().next().unwrap().as_ascii() {
- features.push(pastelito_data::pref1(c.to_u8()));
+ weights.push(&Feature::Pref1(c.to_u8()));
}
- features.push(pastelito_data::iminus1tag(p1));
- features.push(pastelito_data::iminus2tag(p2));
- features.push(pastelito_data::itagplusiminus2tag(p1, p2));
+ weights.push(&Feature::IMinus1Tag(p1));
+ weights.push(&Feature::IMinus2Tag(p2));
+ weights.push(&Feature::ITagPlusIMinus2Tag(p1, p2));
if let Some(word) = context.word(context_index) {
- features.push(pastelito_data::iword(word));
- features.push(pastelito_data::iminus1tagplusiword(p1, word));
+ weights.push(&Feature::IWord(word));
+ weights.push(&Feature::IMinus1TagPlusIWord(p1, word));
}
if let Some(word) = context.word(context_index - 1) {
- features.push(pastelito_data::iminus1word(word));
- features.push(pastelito_data::iminus1suffix(word.suffix()));
+ weights.push(&Feature::IMinus1Word(word));
+ weights.push(&Feature::IMinus1Suffix(word.suffix()));
}
if let Some(word) = context.word(context_index - 2) {
- features.push(pastelito_data::iminus2word(word));
+ weights.push(&Feature::IMinus2Word(word));
}
if let Some(word) = context.word(context_index + 1) {
- features.push(pastelito_data::iplus1word(word));
- features.push(pastelito_data::iplus1suffix(word.suffix()));
+ weights.push(&Feature::IPlus1Word(word));
+ weights.push(&Feature::IPlus1Suffix(word.suffix()));
}
if let Some(word) = context.word(context_index + 2) {
- features.push(pastelito_data::iplus2word(word));
+ weights.push(&Feature::IPlus2Word(word));
}
- let mut scores = HashMap::<POS, f32>::new();
+ let mut scores = self.model.initial_scores();
- for feature in features {
- if let Some(weights) = self.model.get_feature(&feature) {
- for (pos, weight) in weights {
- scores
- .entry(*pos)
- .and_modify(|s| *s += *weight)
- .or_insert(*weight);
- }
+ for w in weights.as_slice() {
+ for (pos, weight) in *w {
+ scores.update(*pos, *weight);
}
}
- scores
- .into_iter()
- .max_by(|a, b| a.1.partial_cmp(&b.1).unwrap())
- .map(|(pos, _)| pos)
- .unwrap()
+ scores.max()
}
/// Predict the POS tag for a single word.
fn predict_one(
&self,