Making Pastelito 10x faster

2024-11-10

Pastelito is a VSCode extension to help you write better markdown documents. It aims to highlight grammatical features in your writing as you type. As such, it needs to be fast and responsive. In this post, I'll run through some of the optimizations I made to the perceptron model to make it ~10x faster.

Overview

Pastelito uses an a perceptron model to predict the part of speech of each word in a document. A Good Part-of-Speech Tagger in about 200 Lines of Python gives a good overview of how a perceptron model works. Perceptron models are not the most accurate, but they are fast.

The high-level architecture of Pastelito is:

Parse a markdown document into a series of "blocks" using pulldown-cmark. Title, paragraphs, and list items are all blocks
Each block is then tokenized into "words"
Each word is then tagged with a part of speech tag

It is the final tagging step that we're going to optimize.

Core tagging loop

The core loop of the tagging logic is handled by Tagger, which uses a Perceptron to do the actual prediction for a block of words:

pub struct Tagger {
    model: &'static Model,
    perceptron: Perceptron,
}

impl Tagger {
    /// Tag the words in the given block.
    ///
    /// The words are modified in place.
    pub fn tag(&self, block: &mut Block<'_, Word>) {
        // First, add any "static" tags based solely on the token. Some simple
        // words -- such as `on`, `whose`, `after`, etc -- always have the same
        // POS, so we can tag them immediately without looking at any
        // surrounding context. This is quick and helps the following perceptron
        // step
        self.add_static_tags(block);

        // Next, predict the POS tags for the remaining words using the
        // perceptron.
        self.perceptron.predict(block);
    }

    ...
}

The Perceptron then predicts the tag for each individual word, while taking into account the surrounding context:

impl Perceptron {
    ...

    /// Predict the POS tags for all words in the given block.
    ///
    /// This should be run after the "static tags" phase. The words are modified
    /// in place.
    pub fn predict(&self, block: &mut Block<'_, Word>) {
        let context = Context::new(block);

        let mut p1 = POS::Start;
        let mut p2 = POS::Start2;

        for (i, (word, str)) in block.iter_mut_with_str().enumerate() {
            let next_p1 = match word.pos {
                None => {
                    // Only predict if the POS tag is currently unknown.
                    let pos = self.predict_one(&context, i, str, p1, p2);
                    word.pos = Some(pos);
                    pos
                }
                Some(pos) => pos,
            };

            p2 = p1;
            p1 = next_p1;
        }
    }

    /// Predict the POS tag for a single word.
    fn predict_one(
        &self,
        context: &Context,
        word_index: usize,
        token: &str,
        p1: POS,
        p2: POS,
    ) -> POS {
        ...
    }
}

Setup

Before we dive in and start restructuring the code, let's set up a few things.

1. Regression tests

The codebase already includes various unit-tests but let's add a larger test to prevent any subtle bugs from creeping in. I used a copy of LogLog Games's Leaving Rust gamedev after 3 years post, converted to markdown. It contains ~18,000 words and should be large enough to catch any unintended changes.

We store a pristine copy of the tagged words (expected_tags) and compare it against the output from the model (actual_tags):

#[test]
fn blog_tags() {
    ...

    assert_eq!(actual_tags.len(), expected_tags.len());
    for (i, (a, e)) in actual_tags.iter().zip(expected_tags.iter()).enumerate() {
        assert_eq!(a, e, "i: {}", i);
    }
}

2. Benchmarks

Let's also set up some benchmarks to ensure that the changes we make actually increase performance. Here's the core of the benchmark, using criterion, with a few clarifying comments:

const BLOG_POST: &str = include_str!("data/leaving-rust-gamedev.md");

...

fn benchmark_tag(c: &mut Criterion) {
    let mut group = c.benchmark_group("Tagger");

    // Parse the blog post as a markdown document
    let doc = Document::new(
        &MarkdownParser::default(),
        BLOG_POST
    );

    // The "tagger" which tags words with a part of speech
    let tagger = Tagger::default();

    // Count the number of words in the document so we can
    // measure both the elapsed time and the throughput in
    // words per second
    let num_words = count_words(&doc);
    group.throughput(Throughput::Elements(num_words as u64));

    group.bench_function("tag", |b| {
        b.iter_batched_ref(
            || {
                // The tagger modifies the document in place, so we need
                // to clone it and reset the part of speech tags
                let mut doc = doc.clone();
                for block in doc.iter_mut() {
                    for word in block.iter_mut() {
                        word.pos = None;
                    }
                }
                doc
            },
            |doc| {
                // The actual benchmark: tag each block in the document
                for block in doc.iter_mut() {
                    tagger.tag(block);
                }
            },
            criterion::BatchSize::SmallInput,
        );
    });
}

Running this benchmark on an MacBook Air M2 gives us our baseline of ~50ms to tag a large blog post:

$ cargo bench
...

Tagger/tag              time:   [50.349 ms 50.657 ms 50.992 ms]
                        thrpt:  [408.05 Kelem/s 410.74 Kelem/s 413.26 Kelem/s]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

...

Optimizations

Let's run through the various optimizations and measure the improvements. This is a neat, linear list of things that sped up the main tagging loop. As with most optimization work though, there was much trial-and-error and a few dead-ends.

The perceptron code in Pastelito is not general-purpose and is heavily tied to the current model. This is a trade-off I made that allowed me to make various assumptions about the data and model. It is highly unlikely I will ever change the model, so this trade-off made sense for this project.

1. Remove `HashMap` lookups when calculating scores

A perceptron works by summing the weights of various features and taking the maximum to predict the part of speech of a word:

pub enum POS {
    ...
    Open,
    Close,
    Comma,
    EndOfSentence,
    Colon,
    ...
}

...

fn predict_one(...) -> POS {
    ...

    let mut scores = HashMap::<POS, f32>::new();

    for feature in features {
        if let Some(weights) = self.model.get_feature(&feature) {
            for (pos, weight) in weights {
                // Update (or insert) the score for this part of speech
                scores
                    .entry(*pos)
                    .and_modify(|s| *s += *weight)
                    .or_insert(*weight);
            }
        }
    }

    // Find the part of speech with the highest score
    scores
        .into_iter()
        .max_by(|a, b| a.1.partial_cmp(&b.1).unwrap())
        .map(|(pos, _)| pos)
        .unwrap()
}

The model contains a fixed list of ~50 parts of speech. Instead of using a HashMap to store the scores, we can use a linear array instead:

#[derive(Copy, Clone, Readable, Writable)]
pub struct Scores {
    scores: [(POS, f32); POS::COUNT],
}

impl Default for Scores {
    fn default() -> Self {
        Scores {
            scores: [
                ...
                (POS::Open, 0.0),
                (POS::Close, 0.0),
                (POS::Comma, 0.0),
                (POS::EndOfSentence, 0.0),
                (POS::Colon, 0.0),
                ...
                //etc
                //etc
            ],
        }
    }
}

impl Scores {
    /// Add `score` to the current score for `pos`.
    pub fn update(&mut self, pos: POS, score: f32) {
        self.scores.iter_mut().find(|(p, _)| *p == pos).unwrap().1 += score;
    }

    /// Get the `POS` with the highest score.
    pub fn max(&self) -> POS {
        self.scores
            .iter()
            .max_by(|a, b| a.1.partial_cmp(&b.1).unwrap())
            .unwrap()
            .0
    }
}

Implementing this gives us a 12% speedup against the baseline:

Tagger/tag              time:   [44.364 ms 44.540 ms 44.745 ms]
                        thrpt:  [465.02 Kelem/s 467.15 Kelem/s 469.00 Kelem/s]
                 change:
                        time:   [-12.764% -12.075% -11.426%] (p = 0.00 < 0.05)
                        thrpt:  [+12.900% +13.733% +14.631%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

2. Use array indexes when calculating scores

In retrospect, this is a pretty obvious optimization. Instead of using a linear search in Scores, POS can be used as an index by added a #[repr(..)] attribute. I'm using some of strum's macros (POS::COUNT, POS::iter()) to make this easier:

#[repr(u8)]
pub enum POS {
    ...
    Open,
    Close,
    Comma,
    EndOfSentence,
    Colon,
    ...
}

#[derive(Copy, Clone, Readable, Writable)]
pub struct Scores {
    scores: [f32; POS::COUNT],
}

impl Default for Scores {
    fn default() -> Self {
        Scores {
            scores: [0.0; POS::COUNT],
        }
    }
}

impl Scores {
    /// Add `score` to the current score for `pos`.
    pub fn update(&mut self, pos: POS, score: f32) {
        self.scores[pos as usize] += score;
    }

    /// Get the `POS` with the highest score.
    pub fn max(&self) -> POS {
        self.scores
            .iter()
            .zip(POS::iter())
            .max_by(|a, b| a.0.partial_cmp(b.0).unwrap())
            .unwrap()
            .1
    }
}

This gives us a huge speedup:

Tagger/tag              time:   [20.320 ms 20.404 ms 20.498 ms]
                        thrpt:  [1.0151 Melem/s 1.0198 Melem/s 1.0239 Melem/s]
                 change:
                        time:   [-54.470% -54.190% -53.871%] (p = 0.00 < 0.05)
                        thrpt:  [+116.78% +118.29% +119.64%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

At this point, we're ~2.6x faster than the baseline just by tweaking the inner loop.

3. Switch to `FxHashMap`

We were able to use array indexes for the inner loop but the outer loop still relies on HashMaps. The model maps features to weights using a HashMap. There are up to 14 possible features for a word, so the outer loop performs up to 14 HashMap lookups per word. A quick look at the Rust Performance Book leads us to the section on Hashing.

HashDos is not a concern for Pastelito, so we can switch to FxHashMap which gives us a small speedup:

Tagger/tag              time:   [19.412 ms 19.687 ms 20.003 ms]
                        thrpt:  [1.0402 Melem/s 1.0569 Melem/s 1.0718 Melem/s]
                 change:
                        time:   [-5.0613% -3.5157% -1.9949%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0355% +3.6438% +5.3312%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

4. Use an enum for features

Let's take a quick look at the underlying model. It contains ~75,000 features that are encoded as strings, such as:

"i pref1 l": The current word begins with "l"
"i word flora": The current word is "flora"
"i suffix ity": The current word has a suffix of "ity"
"i-1 suffix ity": The previous word has a suffix of "ity"
"i+1 suffix cly": The next word has a suffix of "cly"
"i-1 tag+i word CC freight": The previous word's tag is a Coordinating Conjunction and the current word is "freight"

In the prediction loop, we build these features as strings based on the current word and its context. This leads to up to 14 allocations per word:

pub fn suffix(suffix: ContextSuffix) -> String {
    format!("i suffix {}", format_context_suffix(suffix))
}

pub fn pref1(c: u8) -> String {
    format!("i pref1 {}", c as char)
}

...

fn predict_one(
    ...
    token: &str,
    ...
) -> POS {
    let mut features: Vec<String> = Vec::with_capacity(max_features());

    if let Ok(suffix) = token.try_into() {
        features.push(pastelito_data::suffix(suffix));
    }

    if let Some(c) = token.chars().next().unwrap().as_ascii() {
        features.push(pastelito_data::pref1(c.to_u8()));
    }

    ...
}

But the format of these strings is highly structured:

There are only 14 different kinds of features (i word, i-1 suffix, i+1 suffix, etc)
Suffixes are limited to one, two or three characters
All words are 23 characters or less
All characters are ASCII

Instead of using strings, we can encode this as an enum which fits into a snug 24 bytes:

pub struct ContextSuffix {
    chars: [u8; 3],
}

...

pub struct ContextWord {
    chars: [u8; 23],
}

pub enum Feature {
    Bias,
    Suffix(ContextSuffix),
    Pref1(u8),
    IMinus1Tag(POS),
    IMinus2Tag(POS),
    ITagPlusIMinus2Tag(POS, POS),
    IWord(ContextWord),
    IMinus1TagPlusIWord(POS, ContextWord),
    IMinus1Word(ContextWord),
    IMinus1Suffix(ContextSuffix),
    IMinus2Word(ContextWord),
    IPlus1Word(ContextWord),
    IPlus1Suffix(ContextSuffix),
    IPlus2Word(ContextWord),
}

The build.rs script in pastelito-model converts the JSON formatted model from prose and converts it to this more efficient format. The model now stores a mapping of Feature to weights and our main loop looks almost identical:

fn predict_one(
    ...
    token: &str,
    ...
) -> POS {
    let mut features: Vec<Feature> = Vec::with_capacity(max_features());

    if let Ok(suffix) = token.try_into() {
        // Create a `Suffix` feature without any allocations
        features.push(Feature::Suffix(suffix));
    }

    if let Some(c) = token.chars().next().unwrap().as_ascii() {
        // Create a `Pref1` feature without any allocations
        features.push(Feature::Pref1(c.to_u8()));
    }
    ...
}

This gives us another nice speedup and we're ~6.5x faster than the baseline:

Tagger/tag              time:   [7.5560 ms 7.5750 ms 7.5968 ms]
                        thrpt:  [2.7389 Melem/s 2.7468 Melem/s 2.7537 Melem/s]
                 change:
                        time:   [-62.129% -61.522% -60.967%] (p = 0.00 < 0.05)
                        thrpt:  [+156.19% +159.89% +164.06%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

5. Do not lookup "bias" feature

The Feature::Bias feature represents the initial bias of the perceptron and should be included for every word. The algorithm looks like:

fn predict_one(
    ...
) -> POS {
    let mut features = Vec::with_capacity(Feature::COUNT);

    // Add the bias feature
    features.push(Feature::Bias);

    // Add the suffix feature
    if let Ok(suffix) = token.try_into() {
        weights.push(Feature::Suffix(suffix));
    }

    // Repeat for the other twelve features
    ...

    // Initialize the scores to zero
    let mut scores = Scores::default();

    // Update the scores for each feature
    for w in weights.as_slice() {
        for (pos, weight) in *w {
            scores.update(*pos, *weight);
        }
    }

    scores.max()
}

Instead, we can modify our build.rs script to generate a default set of scores which includes the Feature::Bias data. This saves us one hash-lookup per predicted word:

fn predict_one(
    ...
    token: &str,
    ...
) -> POS {
    let mut features = Vec::with_capacity(Feature::COUNT);

    // Add the suffix feature
    if let Ok(suffix) = token.try_into() {
        weights.push(&Feature::Suffix(suffix));
    }

    // Repeat for the other twelve features
    ...

    // Initialize the scores to the bias features
    let mut scores = self.model.initial_scores();

    // Update the scores for each feature
    for w in weights.as_slice() {
        for (pos, weight) in *w {
            scores.update(*pos, *weight);
        }
    }

    scores.max()
}

This gives us a ~5% speedup compared to the previous commit:

Tagger/tag              time:   [7.1422 ms 7.1571 ms 7.1749 ms]
                        thrpt:  [2.9000 Melem/s 2.9072 Melem/s 2.9132 Melem/s]
                 change:
                        time:   [-5.8425% -5.5177% -5.1983%] (p = 0.00 < 0.05)
                        thrpt:  [+5.4834% +5.8399% +6.2050%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

6. Inline `Scores::update`

At this point we're starting to run out of simple algorithmic changes. I started to look at the generated assembly. Looking at the main prediction loop, I noticed that the Scores::update function was not being inlined.

Here is the scoring loop in Rust:

for feature in features {
    if let Some(weights) = self.model.get(&feature) {
        for (pos, weight) in weights {
            scores.update(*pos, *weight);
        }
    }
}

And here is the corresponding disassembly:

$ gobjdump -d target/release/deps/pastelito-72b4e8c44e83f86f | rustfilt
...
   10003b758:	f9401fe0 	ldr	x0, [sp, #56]
   10003b75c:	94008436 	bl	10005c834
                            <_pastelito_data::Model::get>
   10003b760:	b4fffd80 	cbz	x0, 10003b710
                            <_pastelito_core::perceptron::Perceptron::predict+0x914>
   10003b764:	d37df03a 	lsl	x26, x1, #3
   10003b768:	b4fffd5a 	cbz	x26, 10003b710
                            <_pastelito_core::perceptron::Perceptron::predict+0x914>
   10003b76c:	91002013 	add	x19, x0, #0x8
   10003b770:	bd400400 	ldr	s0, [x0, #4]
   10003b774:	d100235a 	sub	x26, x26, #0x8
   10003b778:	39400001 	ldrb	w1, [x0]
   10003b77c:	910443e0 	add	x0, sp, #0x110
   10003b780:	9400826d 	bl	10005c134
                            <_pastelito_data::Scores::update>
   10003b784:	aa1303e0 	mov	x0, x19
   10003b788:	17fffff8 	b	10003b768
                            <_pastelito_core::perceptron::Perceptron::predict+0x96c>
   10003b78c:	b4ffbed7 	cbz	x23, 10003af64
                            <_pastelito_core::perceptron::Perceptron::predict+0x168>
...

The call to Scores::update(...) is not being inlined! It's just an array lookup and addition, so you would expect it to be inlined:

impl Scores {
    /// Add `score` to the current score for `pos`.
    pub fn update(&mut self, pos: POS, score: f32) {
        self.scores[pos as usize] += score;
    }

    ...
}

This is possibly because Scores::update(...) is implemented in the pastelito_data crate but the prediction logic is implemented in the pastelito_core crate. Adding an #[inline] hint fixes the issue. The Scores::update(...) is inlined as the ldr/fadd/str sequence:

   10003a2b8:	9400895c 	bl	10005c828
                            <_pastelito_data::Model::get>
   10003a2bc:	b4fffd80 	cbz	x0, 10003a26c
                            <_pastelito_core::perceptron::Perceptron::predict+0x900>
   10003a2c0:	b4fffd61 	cbz	x1, 10003a26c
                            <_pastelito_core::perceptron::Perceptron::predict+0x900>
   10003a2c4:	8b010c08 	add	x8, x0, x1, lsl #3
   10003a2c8:	39400009 	ldrb	w9, [x0]
   10003a2cc:	bc697a60 	ldr	s0, [x19, x9, lsl #2]
   10003a2d0:	bd400401 	ldr	s1, [x0, #4]
   10003a2d4:	1e202820 	fadd	s0, s1, s0
   10003a2d8:	bc297a60 	str	s0, [x19, x9, lsl #2]
   10003a2dc:	91002000 	add	x0, x0, #0x8
   10003a2e0:	eb08001f 	cmp	x0, x8
   10003a2e4:	54ffff21 	b.ne	10003a2c8
                            <_pastelito_core::perceptron::Perceptron::predict+0x95c>  // b.any
   10003a2e8:	17ffffe1 	b	10003a26c
                            <_pastelito_core::perceptron::Perceptron::predict+0x900>
   10003a2ec:	b40000d7 	cbz	x23, 10003a304
                            <_pastelito_core::perceptron::Perceptron::predict+0x998>

This gives us a ~5% speedup compared to the previous commit and we're now ~7.5x faster than the baseline:

Tagger/tag              time:   [6.7355 ms 6.7525 ms 6.7725 ms]
                        thrpt:  [3.0723 Melem/s 3.0814 Melem/s 3.0892 Melem/s]
                 change:
                        time:   [-5.9900% -5.6528% -5.3410%] (p = 0.00 < 0.05)
                        thrpt:  [+5.6424% +5.9914% +6.3716%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

7. Accumulate weights, not features

What other micro-optimizations can we make? The prediction loop now looks like this:

 fn predict_one(..., token: &str, ...) -> POS {
    // Build a list of features for this token
    let mut features = Vec::with_capacity(Feature::COUNT);

    if let Ok(suffix) = token.try_into() {
        features.push(Feature::Suffix(suffix));
    }
    if let Some(c) = token.chars().next().unwrap().as_ascii() {
        features.push(Feature::Pref1(c.to_u8()));
    }

    // Repeat for the other eleven features

    // For each feature...
    for feature in features {
        // ... look up the weights for this feature...
        if let Some(weights) = self.model.get(&feature) {
            // ... and update the scores
            for (pos, weight) in weights {
                scores.update(*pos, *weight);
            }
        }
    }
}

So the scoring loop ping-pongs between Model::get(...) and Scores::update(...). Instead, what would happen if we grouped all the Model::get(...) calls together and then updated the scores in a single loop? The WordWeights struct wraps up some of this logic:

struct WordWeights {
    model: &'static Model,
    weights: Vec<&'static [(POS, f32)]>,
}

impl WordWeights {
    fn new(model: &'static Model) -> Self {
        WordWeights {
            model,
            weights: Vec::with_capacity(Feature::COUNT),
        }
    }
    
    fn push(&mut self, feature: &Feature) {
        if let Some(weights) = self.model.get(feature) {
            self.weights.push(weights);
        }
    }
}

...

fn predict_one(..., token: &str, ...) -> POS {
    // Build a list of _weights_ for this token
     let mut weights = WordWeights::new(self.model);

    if let Ok(suffix) = token.try_into() {
        weights.push(&Feature::Suffix(suffix));
    }
    if let Some(c) = token.chars().next().unwrap().as_ascii() {
        weights.push(&Feature::Pref1(c.to_u8()));
    }

    // Repeat for the other eleven features

    // For each group of weights ...
    for w in weights.as_slice() {
        // ... update the scores
        for (pos, weight) in *w {
            scores.update(*pos, *weight);
        }
    }
}

Would this make a measurable difference? Surprisingly, yes! This is a whopping ~28% speedup compared to the previous commit:

Tagger/tag              time:   [4.8053 ms 4.8241 ms 4.8478 ms]
                        thrpt:  [4.2920 Melem/s 4.3131 Melem/s 4.3300 Melem/s]
                 change:
                        time:   [-28.894% -28.558% -28.159%] (p = 0.00 < 0.05)
                        thrpt:  [+39.196% +39.973% +40.635%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

8. Re-use `WordWeights` allocation

The WordWeights struct is allocated for each word in Perceptron::predict_one. Instead, we can allocate it once per block and re-use it for each word in the block, so the code becomes:

pub fn predict(&self, block: &mut Block<'_, Word>) {
    ...
    let mut weights = WordWeights::new(self.model);
    ...

    for (i, (word, str)) in block.iter_mut_with_str().enumerate() {
        let next_p1 = match word.pos {
            None => {
                // Only predict if the POS tag is currently unknown.
                let pos = self.predict_one(&mut weights, ...);
                ...
            }
            ...
        }
    }
    ...
}

fn predict(&self, weights: &mut WordWeights, ...) -> POS {
    weights.clear();

    ...
}

This gives another ~5% speedup compared to the previous commit:

Tagger/tag              time:   [4.6106 ms 4.6176 ms 4.6259 ms]
                        thrpt:  [4.4980 Melem/s 4.5060 Melem/s 4.5129 Melem/s]
                 change:
                        time:   [-4.7761% -4.2819% -3.8686%] (p = 0.00 < 0.05)
                        thrpt:  [+4.0243% +4.4734% +5.0157%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

This isn't actually a huge speedup because of how allocators are structured. We're constantly freeing and allocating the Vec storage for WordWeights, without any other allocations in between. This is a common pattern that allocators optimize for.

Looking specifically at the macOS allocator, the Vec allocation is small so it is under the "tiny" threshold defined in the allocator. As an optimization, the allocator keeps a per-CPU cache of the most recently freed allocation. The prediction loop constantly hits this fast path.

Conclusion

We started with a baseline of ~50ms to tag a large blog post. After a series of small optimizations, we reduced the time to ~4.6ms — a ~10x speedup. This involved a variety of techniques, some algorithmic and some micro-optimizations.

After all those changes though, it is nice to see that the code of the tagger, Perceptron::predict_one hasn't changed much. We can have confidence that the algorithm is equivalent to the original:

diff --git a/pastelito-core/src/perceptron.rs b/pastelito-core/src/perceptron.rs
index 5bea678..7531ed1 100644
--- a/pastelito-core/src/perceptron.rs
+++ b/pastelito-core/src/perceptron.rs
     /// Predict the POS tag for a single word.
     fn predict_one(
         &self,
+        weights: &mut WordWeights,
         context: &Context,
         word_index: usize,
         token: &str,
         p1: POS,
         p2: POS,
     ) -> POS {
         let context_index = word_index + 2;
-        let mut features = Vec::new();
-
-        features.push(pastelito_data::bias());
+        weights.clear();
 
         if let Ok(suffix) = token.try_into() {
-            features.push(pastelito_data::suffix(suffix));
+            weights.push(&Feature::Suffix(suffix));
         }
         if let Some(c) = token.chars().next().unwrap().as_ascii() {
-            features.push(pastelito_data::pref1(c.to_u8()));
+            weights.push(&Feature::Pref1(c.to_u8()));
         }
-        features.push(pastelito_data::iminus1tag(p1));
-        features.push(pastelito_data::iminus2tag(p2));
-        features.push(pastelito_data::itagplusiminus2tag(p1, p2));
+        weights.push(&Feature::IMinus1Tag(p1));
+        weights.push(&Feature::IMinus2Tag(p2));
+        weights.push(&Feature::ITagPlusIMinus2Tag(p1, p2));
 
         if let Some(word) = context.word(context_index) {
-            features.push(pastelito_data::iword(word));
-            features.push(pastelito_data::iminus1tagplusiword(p1, word));
+            weights.push(&Feature::IWord(word));
+            weights.push(&Feature::IMinus1TagPlusIWord(p1, word));
         }
 
         if let Some(word) = context.word(context_index - 1) {
-            features.push(pastelito_data::iminus1word(word));
-            features.push(pastelito_data::iminus1suffix(word.suffix()));
+            weights.push(&Feature::IMinus1Word(word));
+            weights.push(&Feature::IMinus1Suffix(word.suffix()));
         }
 
         if let Some(word) = context.word(context_index - 2) {
-            features.push(pastelito_data::iminus2word(word));
+            weights.push(&Feature::IMinus2Word(word));
         }
 
         if let Some(word) = context.word(context_index + 1) {
-            features.push(pastelito_data::iplus1word(word));
-            features.push(pastelito_data::iplus1suffix(word.suffix()));
+            weights.push(&Feature::IPlus1Word(word));
+            weights.push(&Feature::IPlus1Suffix(word.suffix()));
         }
 
         if let Some(word) = context.word(context_index + 2) {
-            features.push(pastelito_data::iplus2word(word));
+            weights.push(&Feature::IPlus2Word(word));
         }
 
-        let mut scores = HashMap::<POS, f32>::new();
+        let mut scores = self.model.initial_scores();
 
-        for feature in features {
-            if let Some(weights) = self.model.get_feature(&feature) {
-                for (pos, weight) in weights {
-                    scores
-                        .entry(*pos)
-                        .and_modify(|s| *s += *weight)
-                        .or_insert(*weight);
-                }
+        for w in weights.as_slice() {
+            for (pos, weight) in *w {
+                scores.update(*pos, *weight);
             }
         }
 
-        scores
-            .into_iter()
-            .max_by(|a, b| a.1.partial_cmp(&b.1).unwrap())
-            .map(|(pos, _)| pos)
-            .unwrap()
+        scores.max()
     }