Understanding the performance of a piece of code can sometimes be difficult and require different approaches. I recently went down the rabbit-hole of optimizing a piece of Rust code. My first hunch was that cache locality was an issue but cachegrind didn't reveal anything obvious. I then turned to LLVM's Machine Code Analyzer, llvm-mca, to view the same code from a different perspective.

I learned a lot, and here are my notes about llvm-mca.

llvm-mca

llvm-mca simulates the execution of code on a specific processor, aiming for cycle-level accuracy. It is focused on understanding the bottlenecks in the most performance-critical parts of the code, not the whole program.

How does it do this? When compiling code, LLVM uses a scheduling model to choose which instructions to emit and in what order they should be arranged. llvm-mca uses those same definitions to simulate the execution of code.

As this is a simulation, llvm-mca can be used across different processor architectures. For example, you can simulate x86 code on an ARM processor and vice versa. Of course, the simulation is only as accurate as LLVM's model.

Example

Let's use this self-contained code as an example, which calculates the solutions to a quadratic equation. Assume that the QuadraticEquation::solutions method is our bottleneck and we want to analyze it:

/// A quadratic equation of the form `ax² + bx + c = 0`
struct QuadraticEquation {
    a: f64,
    b: f64,
    c: f64,
}

impl QuadraticEquation {
    fn new(a: f64, b: f64, c: f64) -> Self {
        Self { a, b, c }
    }

    fn solutions(&self) -> Option<(f64, f64)> {
        let d = (self.b * self.b - 4.0 * self.a * self.c).sqrt();
        let r0 = (-self.b + d) / 2.0 * self.a;
        let r1 = (-self.b - d) / 2.0 * self.a;

        if r0.is_nan() || r1.is_nan() {
            None
        } else {
            // Sort solutions in ascending order
            Some(if r0 < r1 { (r0, r1) } else { (r1, r0) })
        }
    }
}

impl std::fmt::Display for QuadraticEquation {
    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
        write!(f, "{}x² + {}x + {}", self.a, self.b, self.c)
    }
}

fn main() {
    // Read the equation from the command line
    let equation = {
        let args = std::env::args()
            .skip(1)
            .map(|arg| arg.parse::<i32>().unwrap() as f64)
            .collect::<Vec<f64>>();

        QuadraticEquation::new(args[0], args[1], args[2])
    };

    // Calculate the solutions
    let solutions = equation.solutions();

    // Print 'em
    match solutions {
        Some(solutions) => {
            println!("The solutions to '{}' are:", equation);
            println!("  {}", solutions.0);
            println!("and");
            println!("  {}", solutions.1);
        }
        None => {
            println!("The equation '{}' has no real solutions", equation)
        }
    }
}

This gives the expected output:

$ cargo run --quiet -- 1 -1 -2
The solutions for '1x² + -1x + -2' are:
  -1
and
  2
$ cargo run --quiet -- 1 -1 2
The equation '1x² + -1x + 2' has no real solutions

llvm-mca parses and interprets assembly files, not compiled code. There are a couple of options for generating assembly from Rust code. cargo-show-asm is probably the most user-friendly but I'm going to stick with plain cargo to show how the various pieces fit together.

We first need to tell rustc to output assembly by setting our RUSTFLAGS:

$ RUSTFLAGS="--emit asm" cargo build --release
   Compiling qe v0.1.0
    Finished `release` profile [optimized] target(s) in 0.44s

This dumps the assembly into the target/release directory, which we can then demangle with rustfilt. Here is the beginning of the method we're interested in:

$ rustfilt -i target/release/deps/qe-b832971ea91f2efc.s | \
  grep -B1 -A12 ::solutions:
    .p2align        2
_qe::QuadraticEquation::solutions:
        .cfi_startproc
        ldp     d1, d0, [x1]
        fmul    d2, d0, d0
        fmov    d3, #-4.00000000
        fmul    d3, d1, d3
        ldr     d4, [x1, #16]
        fmul    d3, d3, d4
        fadd    d2, d2, d3
        fsqrt   d2, d2
        fneg    d3, d0
        fsub    d0, d2, d0
        fmov    d4, #0.50000000

In addition to the instructions generated by the compiler, the assembly also includes directives for how the linker should align sections (.p2align) and for generating Call Frame Information (CFI) (.cfi_startproc).

By default, llvm-mca will analyze every instruction in the assembly listing which usually isn't what we want. Placing # LLVM-MCA-BEGIN and # LLVM-MCA-END comments in the assembly allows us to limit the range of analysis. We could compile our crate, manually add these markers to the assembly, and then run llvm-mca.

Instead of that tedious process, though, the asm!(...) macro can add those markers for us:

impl QuadraticEquation {
    ...

    #[inline(never)]
    fn solutions(a: f64, b: f64, c: f64) -> (f64, f64) {
        unsafe {
            std::arch::asm!(";# LLVM-MCA-BEGIN");
        }

        let d = (b * b - 4.0 * a * c).sqrt();
        let r0 = (-b + d) / 2.0 * a;
        let r1 = (-b - d) / 2.0 * a;

        // Sort solutions in ascending order
        let ret = if r0 < r1 { (r0, r1) } else { (r1, r0) };

        unsafe {
            std::arch::asm!(";# LLVM-MCA-END");
        }

        ret
    }
}

There's a few things to note. Firstly, we had to restructure the end of the method slightly to store the return value in ret. This method returns a value so asm!(...) can't be the final statement.

Secondly, the marker comments must begin with a # character. The asm!(...) macro accepts either ; or # for comments but, in my testing, the compilation process normalizes both forms to ;. The ;# form ensures that the # character is preserved in the generated assembly.

Lastly, we're also going to add an #[inline(never)] attribute to the method. This prevents the method from being inlined by the compiler, which prevents multiple copies of the method from appearing in the assembly listing, each with their own LLVM-MCA-BEGIN and LLVM-MCA-END markers.

Stack frames

After rebuilding the code, the markers appear in the assembly alongside some unexpected changes:

_qe::QuadraticEquation::solutions:
        .cfi_startproc
        stp     x29, x30, [sp, #-16]!
        .cfi_def_cfa_offset 16
        mov     x29, sp
        .cfi_def_cfa w29, 16
        .cfi_offset w30, -8
        .cfi_offset w29, -16
        ; InlineAsm Start
        ;# LLVM-MCA-BEGIN
        ; InlineAsm End
        ldp     d1, d0, [x1]
        fmul    d2, d0, d0
        ...

The asm!(...) statements appear at the beginning and end of the method, so the markers are placed after the function prologue and before the function epilogue. This may or may not be what you want, depending on what you're trying to analyze. For this example, this is fine though.

The prologue is slightly different now, with some extra CFI directives and some manipulation of the stack pointer (sp). We are using inline assembly blocks to insert the marker comments and the contents of them are largely opaque to the compiler. By default, the compiler will assume that your hand-written assembly can do anything -- push values to the stack, call other functions, etc -- so it generates a stack frame.

Passing the option(nostack) flag to !asm(...) prevents this:

    unsafe {
        std::arch::asm!(";# LLVM-MCA-BEGIN", options(nostack));
    }

We now have our markers in the assembly output without affecting anything else:

_qe::QuadraticEquation::solutions:
        .cfi_startproc
        ; InlineAsm Start
        ;# LLVM-MCA-BEGIN
        ; InlineAsm End
        ldp     d1, d0, [x1]
        fmul    d2, d0, d0
        fmov    d3, #-4.00000000
        fmul    d3, d1, d3
        ldr     d4, [x1, #16]
        fmul    d3, d3, d4
        fadd    d2, d2, d3
        fsqrt   d2, d2
        ...

Analysis

My copy of llvm-mca has scheduling information for ~70 different ARM processors. I'm specifying -mcpu=exynos-m5 to simulate an older Samsung Exynos processor, running one iteration, and disabling all output except for the timeline. The assembly output from rustc is not in the exact format that llvm-mca expects, so I'm also skipping any instructions that can't be parsed:

llvm-mca-19 \
    -mcpu=exynos-m5 \
    -iterations=1 \
    -timeline \
    -instruction-info=false \
    -resource-pressure=false \
    -dispatch-stats=false \
    -skip-unsupported-instructions=parse-failure \
    ./target/release/deps/qe-b832971ea91f2efc.s

This gives a timeline output which summarizes how the instructions are scheduled. Time flows horizontally while our instructions are arranged vertically. It takes 42 cycles to calculate the solutions to our quadratic equation on this particular processor:

Timeline view:
                    0123456789          0123456789
Index     0123456789          0123456789          01

[0,0]     DeeeeeeER .    .    .    .    .    .    ..   ldp	d1, d0, [x1]
[0,1]     D======eeeER   .    .    .    .    .    ..   fmul	d2, d0, d0
[0,2]     DeE--------R   .    .    .    .    .    ..   fmov	d3, #-4.00000000
[0,3]     D======eeeER   .    .    .    .    .    ..   fmul	d3, d1, d3
[0,4]     DeeeeeeE---R   .    .    .    .    .    ..   ldr	d4, [x1, #16]
[0,5]     D=========eeeER.    .    .    .    .    ..   fmul	d3, d3, d4
[0,6]     .D===========eeER   .    .    .    .    ..   fadd	d2, d2, d3
[0,7]     .D=============eeeeeeeeeeeeER .    .    ..   fsqrt	d2, d2
[0,8]     .D=====eeE------------------R .    .    ..   fneg	d3, d0
[0,9]     .D=========================eeER    .    ..   fsub	d0, d2, d0
[0,10]    .DeE--------------------------R    .    ..   fmov	d4, #0.50000000
[0,11]    .D===========================eeeER .    ..   fmul	d0, d0, d4
[0,12]    . D=============================eeeER   ..   fmul	d0, d1, d0
[0,13]    . D========================eeE------R   ..   fsub	d2, d3, d2
[0,14]    . D==========================eeeE---R   ..   fmul	d2, d2, d4
[0,15]    . D=============================eeeER   ..   fmul	d1, d1, d2
[0,16]    . D================================eeER ..   fcmp	d0, d1
[0,17]    . D==================================eER..   cset	w8, vc
[0,18]    .  D=================================eeER.   fcsel	d2, d1, d0, mi
[0,19]    .  D==================================eeER   fcsel	d0, d0, d1, mi

The timeline shows how instructions flows through a number of stages:

In addition, some instructions can also be stalled:

The overall shape of the timeline shows that our linear set of instructions is not so linear after all — they are executed out-of-order. On the first clock cycle, five instructions are dispatched. On the second clock cycle, another six instructions are dispatched. By the fifth clock cycle, all nineteen instructions have been dispatched, and are either executing or waiting to execute.

It's useful to compare the timeline against a high-level block diagram of the Exynos M5 processor to understand the flow of instructions. llvm-mca simulates the Execution Engine block in the center of the diagram. After decoding, up to six μOPs flow into the Dispatch Queue where they are sent to the appropriate execution unit. Most of the instructions in our example handle floating-point numbers so they will flow to the FP Cluster on the right:

Exynos M5 block diagram
Exynos M5 block diagram from WikiChip

Various effects of out-of-order execution are visible in the first five instructions:

[0,0]     DeeeeeeER .    .    .    .    .    .    ..   ldp	d1, d0, [x1]
[0,1]     D======eeeER   .    .    .    .    .    ..   fmul	d2, d0, d0
[0,2]     DeE--------R   .    .    .    .    .    ..   fmov	d3, #-4.00000000
[0,3]     D======eeeER   .    .    .    .    .    ..   fmul	d3, d1, d3
[0,4]     DeeeeeeE---R   .    .    .    .    .    ..   ldr	d4, [x1, #16]

We then reach the multiply/add/square-root core of the method:

[0,5]     D=========eeeER.    .    .    .    .    ..   fmul	d3, d3, d4
[0,6]     .D===========eeER   .    .    .    .    ..   fadd	d2, d2, d3
[0,7]     .D=============eeeeeeeeeeeeER .    .    ..   fsqrt	d2, d2

These instructions all have data dependencies on previous instructions so they spend a lot of cycles stalled waiting to execute. It also shows the speed difference between multiplication (6 cycles) and square root (18 cycles).

As you might expect for such a simple piece of code, there isn't really anything to optimize here. Everything before and after our bottleneck — fsqrt — is dispatched optimally.

Other processors

Does the timeline differ much between processors? Here is the timeline from simulating an ampere1 processor instead. The exynos-m5 is designed for consumer devices, while the ampere1 is designed for servers. The fsqrt instruction accounts for most of the difference between them — 13 cycles for exynos-m5 and 63 cycles for ampere1:

Timeline view:
                    0123456789          0123456789          0123456789          0123456789          0123456789          0
Index     0123456789          0123456789          0123456789          0123456789          0123456789          0123456789

[0,0]     DeeeeeER  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   ldp	d1, d0, [x0]
[0,1]     D=====eeeeeER  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   fmul	d2, d0, d0
[0,2]     DeeE--------R  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   fmov	d3, #-4.00000000
[0,3]     .D====eeeeeER  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   fmul	d3, d1, d3
[0,4]     .DeeeeE-----R  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   ldr	d4, [x0, #16]
[0,5]     .D=========eeeeeER  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   fmul	d3, d3, d4
[0,6]     .D==============eeER.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   fadd	d2, d2, d3
[0,7]     . D===============eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeER   .    .    .    .    .    .   fsqrt	d2, d2
[0,8]     . D====eeE-----------------------------------------------------------------------R   .    .    .    .    .    .   fneg	d3, d0
[0,9]     . D=============================================================================eeER .    .    .    .    .    .   fsub	d0, d2, d0
[0,10]    . DeeE-----------------------------------------------------------------------------R .    .    .    .    .    .   fmov	d4, #0.50000000
[0,11]    .  D==============================================================================eeeeeER .    .    .    .    .   fmul	d0, d0, d4
[0,12]    .  D===================================================================================eeeeeER .    .    .    .   fmul	d0, d1, d0
[0,13]    .  D============================================================================eeE----------R .    .    .    .   fsub	d2, d3, d2
[0,14]    .  D==============================================================================eeeeeE-----R .    .    .    .   fmul	d2, d2, d4
[0,15]    .   D==================================================================================eeeeeER .    .    .    .   fmul	d2, d1, d2
[0,16]    .   D=======================================================================================eeeeeER .    .    .   fcmp	d0, d2
[0,17]    .    D===========================================================================================eeeeeeeeeeeER.   fcsel	d1, d2, d0, mi
[0,18]    .    .D===========================================================================================eeeeeeeeeeeER   fcsel	d0, d0, d2, mi

Wrapping it all up

Manually adding the LLVM-MCA-BEGIN and LLVM-MCA-END markers to the code is tedious and error-prone. We can wrap up all the logic in a procedural-macro to abstract away the details:

#![feature(proc_macro_quote)]
use proc_macro::TokenStream;
use quote::{quote, ToTokens as _};
use syn::{parse_quote, Attribute, Item, ItemFn};

// Define a new macro, `llvm_mca`
#[proc_macro_attribute]
pub fn llvm_mca(attrs: TokenStream, input: TokenStream) -> TokenStream {
    // Get the element that this macro is applied to
    let function = match syn::parse(input) {
        Ok(Item::Fn(function)) => function,
        _ => {
            return syn::Error::new(
                Span::call_site().into(),
                "llvm_mca can only be applied to functions",
            )
            .to_compile_error()
            .into()
        }
    };

    // Take the original block and wedge it between the two markers while
    // handling the return value
    let original_block = function.block;
    let block = syn::parse(
        quote! {{
            unsafe {
                std::arch::asm!(";# LLVM-MCA-BEGIN", options(nostack));
            }
            let ret = #original_block;
            unsafe {
                std::arch::asm!(";# LLVM-MCA-END", options(nostack));
            }
            ret
        }}
        .into(),
    )
    .unwrap();

    // Add `#[inline(never)]` to the function attributes
    let attrs = function
        .attrs
        .into_iter()
        .chain(std::iter::once(parse_quote! {
            #[inline(never)]
        }))
        .collect();

    // Replace the attributes and block of the original function
    let function = ItemFn {
        attrs,
        block,
        ..function
    };

    // Convert the new function into a token stream
    function.into_token_stream().into()
}

After wrapping the macro in a crate, functions tagged with #[llvm_mca] will include the LLVM-MCA-{BEGIN,END} markers in the generated assembly automatically.

We can now tag any method:

impl QuadraticEquation {
    #[llvm_mca]
    fn new(a: f64, b: f64, c: f64) -> Self {
        Self { a, b, c }
    }

    ...
}

Dump the assembly:

_qe::QuadraticEquation::new:
        .cfi_startproc
        ; InlineAsm Start
        ;# LLVM-MCA-BEGIN
        ; InlineAsm End
        stp     d0, d1, [x0]
        str     d2, [x0, #16]
        ; InlineAsm Start 
        ;# LLVM-MCA-END
        ; InlineAsm End
        ret     
        .cfi_endproc

And view the timeline:

Timeline view:
Index     0123

[0,0]     DeER   stp	d0, d1, [x0]
[0,1]     DeER   str	d2, [x0, #16]

If you'd like to follow along at home, I published the llvm-mca and llvm-mca-macros crates:

The llvm-mca docs also have great information about analyzing bottlenecks and diagnosing performance issues.