Understanding the performance of a piece of code can sometimes be difficult and require different approaches. I recently went down the rabbit-hole of optimizing a piece of Rust code. My first hunch was that cache locality was an issue but cachegrind
didn't reveal anything obvious. I then turned to LLVM's Machine Code Analyzer, llvm-mca
, to view the same code from a different perspective.
I learned a lot, and here are my notes about llvm-mca
.
llvm-mca
llvm-mca
simulates the execution of code on a specific processor, aiming for cycle-level accuracy. It is focused on understanding the bottlenecks in the most performance-critical parts of the code, not the whole program.
How does it do this? When compiling code, LLVM uses a scheduling model to choose which instructions to emit and in what order they should be arranged. llvm-mca
uses those same definitions to simulate the execution of code.
As this is a simulation, llvm-mca
can be used across different processor architectures. For example, you can simulate x86 code on an ARM processor and vice versa. Of course, the simulation is only as accurate as LLVM's model.
Example
Let's use this self-contained code as an example, which calculates the solutions to a quadratic equation. Assume that the QuadraticEquation::solutions
method is our bottleneck and we want to analyze it:
/// A quadratic equation of the form `ax² + bx + c = 0`
This gives the expected output:
$ cargo run --quiet -- 1 -1 -2
The solutions for '1x² + -1x + -2' are:
-1
and
2
$ cargo run --quiet -- 1 -1 2
The equation '1x² + -1x + 2' has no real solutions
llvm-mca
parses and interprets assembly files, not compiled code. There are a couple of options for generating assembly from Rust code. cargo-show-asm
is probably the most user-friendly but I'm going to stick with plain cargo
to show how the various pieces fit together.
We first need to tell rustc
to output assembly by setting our RUSTFLAGS
:
$ RUSTFLAGS="--emit asm" cargo build --release
Compiling qe v0.1.0
Finished `release` profile [optimized] target(s) in 0.44s
This dumps the assembly into the target/release
directory, which we can then demangle with rustfilt
. Here is the beginning of the method we're interested in:
$ rustfilt -i target/release/deps/qe-b832971ea91f2efc.s | \
grep -B1 -A12 ::solutions:
.p2align 2
_qe::QuadraticEquation::solutions:
.cfi_startproc
ldp d1, d0, [x1]
fmul d2, d0, d0
fmov d3, #-4.00000000
fmul d3, d1, d3
ldr d4, [x1, #16]
fmul d3, d3, d4
fadd d2, d2, d3
fsqrt d2, d2
fneg d3, d0
fsub d0, d2, d0
fmov d4, #0.50000000
In addition to the instructions generated by the compiler, the assembly also includes directives for how the linker should align sections (.p2align
) and for generating Call Frame Information (CFI) (.cfi_startproc
).
By default, llvm-mca
will analyze every instruction in the assembly listing which usually isn't what we want. Placing # LLVM-MCA-BEGIN
and # LLVM-MCA-END
comments in the assembly allows us to limit the range of analysis. We could compile our crate, manually add these markers to the assembly, and then run llvm-mca
.
Instead of that tedious process, though, the asm!(...)
macro can add those markers for us:
There's a few things to note. Firstly, we had to restructure the end of the method slightly to store the return value in ret
. This method returns a value so asm!(...)
can't be the final statement.
Secondly, the marker comments must begin with a #
character. The asm!(...)
macro accepts either ;
or #
for comments but, in my testing, the compilation process normalizes both forms to ;
. The ;#
form ensures that the #
character is preserved in the generated assembly.
Lastly, we're also going to add an #[inline(never)]
attribute to the method. This prevents the method from being inlined by the compiler, which prevents multiple copies of the method from appearing in the assembly listing, each with their own LLVM-MCA-BEGIN
and LLVM-MCA-END
markers.
Stack frames
After rebuilding the code, the markers appear in the assembly alongside some unexpected changes:
_qe::QuadraticEquation::solutions:
.cfi_startproc
stp x29, x30, [sp, #-16]!
.cfi_def_cfa_offset 16
mov x29, sp
.cfi_def_cfa w29, 16
.cfi_offset w30, -8
.cfi_offset w29, -16
; InlineAsm Start
;# LLVM-MCA-BEGIN
; InlineAsm End
ldp d1, d0, [x1]
fmul d2, d0, d0
...
The asm!(...)
statements appear at the beginning and end of the method, so the markers are placed after the function prologue and before the function epilogue. This may or may not be what you want, depending on what you're trying to analyze. For this example, this is fine though.
The prologue is slightly different now, with some extra CFI directives and some manipulation of the stack pointer (sp
). We are using inline assembly blocks to insert the marker comments and the contents of them are largely opaque to the compiler. By default, the compiler will assume that your hand-written assembly can do anything -- push values to the stack, call other functions, etc -- so it generates a stack frame.
Passing the option(nostack)
flag to !asm(...)
prevents this:
unsafe
We now have our markers in the assembly output without affecting anything else:
_qe::QuadraticEquation::solutions:
.cfi_startproc
; InlineAsm Start
;# LLVM-MCA-BEGIN
; InlineAsm End
ldp d1, d0, [x1]
fmul d2, d0, d0
fmov d3, #-4.00000000
fmul d3, d1, d3
ldr d4, [x1, #16]
fmul d3, d3, d4
fadd d2, d2, d3
fsqrt d2, d2
...
Analysis
My copy of llvm-mca
has scheduling information for ~70 different ARM processors. I'm specifying -mcpu=exynos-m5
to simulate an older Samsung Exynos processor, running one iteration, and disabling all output except for the timeline. The assembly output from rustc
is not in the exact format that llvm-mca
expects, so I'm also skipping any instructions that can't be parsed:
llvm-mca-19 \
-mcpu=exynos-m5 \
-iterations=1 \
-timeline \
-instruction-info=false \
-resource-pressure=false \
-dispatch-stats=false \
-skip-unsupported-instructions=parse-failure \
./target/release/deps/qe-b832971ea91f2efc.s
This gives a timeline output which summarizes how the instructions are scheduled. Time flows horizontally while our instructions are arranged vertically. It takes 42 cycles to calculate the solutions to our quadratic equation on this particular processor:
Timeline view:
0123456789 0123456789
Index 0123456789 0123456789 01
[0,0] DeeeeeeER . . . . . . .. ldp d1, d0, [x1]
[0,1] D======eeeER . . . . . .. fmul d2, d0, d0
[0,2] DeE--------R . . . . . .. fmov d3, #-4.00000000
[0,3] D======eeeER . . . . . .. fmul d3, d1, d3
[0,4] DeeeeeeE---R . . . . . .. ldr d4, [x1, #16]
[0,5] D=========eeeER. . . . . .. fmul d3, d3, d4
[0,6] .D===========eeER . . . . .. fadd d2, d2, d3
[0,7] .D=============eeeeeeeeeeeeER . . .. fsqrt d2, d2
[0,8] .D=====eeE------------------R . . .. fneg d3, d0
[0,9] .D=========================eeER . .. fsub d0, d2, d0
[0,10] .DeE--------------------------R . .. fmov d4, #0.50000000
[0,11] .D===========================eeeER . .. fmul d0, d0, d4
[0,12] . D=============================eeeER .. fmul d0, d1, d0
[0,13] . D========================eeE------R .. fsub d2, d3, d2
[0,14] . D==========================eeeE---R .. fmul d2, d2, d4
[0,15] . D=============================eeeER .. fmul d1, d1, d2
[0,16] . D================================eeER .. fcmp d0, d1
[0,17] . D==================================eER.. cset w8, vc
[0,18] . D=================================eeER. fcsel d2, d1, d0, mi
[0,19] . D==================================eeER fcsel d0, d0, d1, mi
The timeline shows how instructions flows through a number of stages:
- D The instruction is dispatched to an execution unit.
- e The instruction is executing.
- E The instruction has finished executing.
- R The instruction has been retired. The result of the instruction is now visible in registers/memory/etc. Not every instruction is retired. For example, speculative execution can result in instructions not being retired if they are from a branch the processor didn't take.
In addition, some instructions can also be stalled:
- = The instruction can't start executing yet. It has a dependency on a previous instruction that has not finished executing yet.
- - The instruction has finished executing but can't be retired yet. It must wait until previous instructions have been retired.
The overall shape of the timeline shows that our linear set of instructions is not so linear after all — they are executed out-of-order. On the first clock cycle, five instructions are dispatched. On the second clock cycle, another six instructions are dispatched. By the fifth clock cycle, all nineteen instructions have been dispatched, and are either executing or waiting to execute.
It's useful to compare the timeline against a high-level block diagram of the Exynos M5 processor to understand the flow of instructions. llvm-mca
simulates the Execution Engine block in the center of the diagram. After decoding, up to six μOPs flow into the Dispatch Queue where they are sent to the appropriate execution unit. Most of the instructions in our example handle floating-point numbers so they will flow to the FP Cluster on the right:
Exynos M5 block diagram from WikiChip |
Various effects of out-of-order execution are visible in the first five instructions:
[0,0] DeeeeeeER . . . . . . .. ldp d1, d0, [x1]
[0,1] D======eeeER . . . . . .. fmul d2, d0, d0
[0,2] DeE--------R . . . . . .. fmov d3, #-4.00000000
[0,3] D======eeeER . . . . . .. fmul d3, d1, d3
[0,4] DeeeeeeE---R . . . . . .. ldr d4, [x1, #16]
[0,0]
: Load two values from memory intod1
andd0
. This can start executing (e) immediately.[0,1]
: Squared0
and store the result ind2
. This can't start executing until the load from[0,0]
has finished though, so it is initially stalled (=).[0,2]
: Store-4
ind3
. Even though this instruction is after[0,1]
, it can start executing because it has no dependencies. However, it can't be retired until[0,1]
has finished executing to maintain the illusion that our instructions are executed in-order.[0,3]
: Multiplyd3
andd1
and store the result ind3
. Similar to[0,1]
this is stalled until the load in[0,0]
has finished executing.[0,4]
: Load our third value from memory in tod4
. This can start executing immediately. Again, it can't be retired until previous instructions have finished executing.
We then reach the multiply/add/square-root core of the method:
[0,5] D=========eeeER. . . . . .. fmul d3, d3, d4
[0,6] .D===========eeER . . . . .. fadd d2, d2, d3
[0,7] .D=============eeeeeeeeeeeeER . . .. fsqrt d2, d2
These instructions all have data dependencies on previous instructions so they spend a lot of cycles stalled waiting to execute. It also shows the speed difference between multiplication (6 cycles) and square root (18 cycles).
As you might expect for such a simple piece of code, there isn't really anything to optimize here. Everything before and after our bottleneck — fsqrt
— is dispatched optimally.
Other processors
Does the timeline differ much between processors? Here is the timeline from simulating an ampere1
processor instead. The exynos-m5
is designed for consumer devices, while the ampere1
is designed for servers. The fsqrt
instruction accounts for most of the difference between them — 13 cycles for exynos-m5
and 63 cycles for ampere1
:
Timeline view:
0123456789 0123456789 0123456789 0123456789 0123456789 0
Index 0123456789 0123456789 0123456789 0123456789 0123456789 0123456789
[0,0] DeeeeeER . . . . . . . . . . . . . . . . . . . . . ldp d1, d0, [x0]
[0,1] D=====eeeeeER . . . . . . . . . . . . . . . . . . . . fmul d2, d0, d0
[0,2] DeeE--------R . . . . . . . . . . . . . . . . . . . . fmov d3, #-4.00000000
[0,3] .D====eeeeeER . . . . . . . . . . . . . . . . . . . . fmul d3, d1, d3
[0,4] .DeeeeE-----R . . . . . . . . . . . . . . . . . . . . ldr d4, [x0, #16]
[0,5] .D=========eeeeeER . . . . . . . . . . . . . . . . . . . fmul d3, d3, d4
[0,6] .D==============eeER. . . . . . . . . . . . . . . . . . . fadd d2, d2, d3
[0,7] . D===============eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeER . . . . . . fsqrt d2, d2
[0,8] . D====eeE-----------------------------------------------------------------------R . . . . . . fneg d3, d0
[0,9] . D=============================================================================eeER . . . . . . fsub d0, d2, d0
[0,10] . DeeE-----------------------------------------------------------------------------R . . . . . . fmov d4, #0.50000000
[0,11] . D==============================================================================eeeeeER . . . . . fmul d0, d0, d4
[0,12] . D===================================================================================eeeeeER . . . . fmul d0, d1, d0
[0,13] . D============================================================================eeE----------R . . . . fsub d2, d3, d2
[0,14] . D==============================================================================eeeeeE-----R . . . . fmul d2, d2, d4
[0,15] . D==================================================================================eeeeeER . . . . fmul d2, d1, d2
[0,16] . D=======================================================================================eeeeeER . . . fcmp d0, d2
[0,17] . D===========================================================================================eeeeeeeeeeeER. fcsel d1, d2, d0, mi
[0,18] . .D===========================================================================================eeeeeeeeeeeER fcsel d0, d0, d2, mi
Wrapping it all up
Manually adding the LLVM-MCA-BEGIN
and LLVM-MCA-END
markers to the code is tedious and error-prone. We can wrap up all the logic in a procedural-macro to abstract away the details:
use TokenStream;
use ;
use ;
// Define a new macro, `llvm_mca`
After wrapping the macro in a crate, functions tagged with #[llvm_mca]
will include the LLVM-MCA-{BEGIN,END}
markers in the generated assembly automatically.
We can now tag any method:
Dump the assembly:
_qe::QuadraticEquation::new:
.cfi_startproc
; InlineAsm Start
;# LLVM-MCA-BEGIN
; InlineAsm End
stp d0, d1, [x0]
str d2, [x0, #16]
; InlineAsm Start
;# LLVM-MCA-END
; InlineAsm End
ret
.cfi_endproc
And view the timeline:
Timeline view:
Index 0123
[0,0] DeER stp d0, d1, [x0]
[0,1] DeER str d2, [x0, #16]
If you'd like to follow along at home, I published the llvm-mca
and llvm-mca-macros
crates:
llvm-mca-macros
contains the#[llvm_mca]
procedural macro.llvm-mca
contains allvm_mca_begin!(...)
andllvm_mca_end!(...)
macro that give finer control over placing markers.
The llvm-mca
docs also have great information about analyzing bottlenecks and diagnosing performance issues.