In this blog, I will introduce how I integrate a precise, relocation GC with LLVM using its statepoint api. We will also talk about some optimization techniques, debug support and so on.
LLVM statepoint is an eperimental api to support precise relocation GC. Comparing
with the old gc.root
api, it has the following advantages:
Also, it has some disadvantages:
If you have tried to use gc.root
to support a relocation GC,
you will find that it is really hard to cover all stack roots.
And it is even harder to play with all those derived pointers--A
relocation GC may change the address of objects during collection,
so after any points that may trigger a GC(those points were called safepoints),
we have to reload all GC pointers that we may use later from stack or register, include base pointers and
derived pointers(base pointer with offset, like the result of getelementptr of a base pointer).
The gc.root
api is so difficult to get right for a compiler frontend, that I've
spent days and nights to make it work. And until its retirement, it still has some
bugs that I never figured out(which crash the program from time to time). Also, it's
performance is really poor, probably because I'm generating it from frontend, and
those instructions prevent many optimizations in LLVM.
The statepoint api, on the other hand, doesn't need frontend to do so much. There's a
builtin pass called RewriteStatepointsForGC
, which can help us to generate statepoint
instructions.
The statepoint
api is a combination of gc.statepoint
\ gc.result
and gc.relocate
instructions. The gc.statepoint
instruction is used to mark a safepoint, it's basically
a wrapper around a call instruction, but it doesn't directly return the result of the
call, instead, it returns a token
the pointers that may relocate after the statepoint is placed in the gc-live
operand bundle just at the end of the instruction. The gc.result
instruction is used to get the
result of the call by accepting the token
as its arguments. The gc.relocate
instruction
is used to get the relocated address of a GC pointer, it accepts the token
and the index
of the pointer in the gc-live
operand bundle as its arguments, and produce the relocated
pointer. The exaple usage is as follow[1]:
%statepoint_token46 = call token (i64, i32, ptr, i32, i32, ...) @llvm.experimental.gc.statepoint.p0(i64 2882400000, i32 0, ptr elementtype(ptr addrspace(1) (i64, i8, i64)) @DioGC__malloc, i32 3, i32 0, i64 32, i8 2, i64 %3, i32 0, i32 0) [ "gc-live"(ptr addrspace(1) %heapptr_retvalue3.relocated45) ]
%heapptr_string47 = call ptr addrspace(1) @llvm.experimental.gc.result.p1(token %statepoint_token46)
%heapptr_retvalue3.relocated48 = call coldcc ptr addrspace(1) @llvm.experimental.gc.relocate.p1(token %statepoint_token46, i32 0, i32 0) ; (%heapptr_retvalue3.relocated45, %heapptr_retvalue3.relocated45)
Apparently, the gc.statepoint
instruction is really complicated, so I won't recommend
you to generate it by yourself. Instead, you shall use the builtin pass RewriteStatepointsForGC
,
which treats all call instructions as safepoints, and generate gc.statepoint
instructions
for them. The RewriteStatepointsForGC
pass will analyze the liveness of all GC pointers
and generate gc.relocate
instructions as well.
Now, we know the basis about the statepoint api, but how could we use it in GC?
The first thing we need to do is to choose or implement a GCStraegy. The GCStraegy
can be used to distinguish the GC pointers from other pointers. However, as the
statepoint api is still experimental, before LLVM 18, only the builtin GCStraegy
statepoint-example
and coreclr
can be used with RewriteStatepointsForGC
pass.
So in this blog, we will choose the statepoint-example
GCStraegy as an example.
Both of two builtin GCStraegies assume that all GC pointers are in address space 1, so we need to make sure that all GC pointers are correctly annotated in the frontend.
[!TIP] Should I use stack alloca?
Of course you can, but I don't recommend you to do so. The
RewriteStatepointsForGC
pass will generate stack roots automatically for all GC pointers it identified, so you don't need to do it by yourself. If you still did so, and store the GC pointers in stack alloca, remember your alloca won't be treated as a GC root, so you must manually tell your GC about this root or the heap pointer in it won't be correct after relocation. Generally, you should never use stack alloca when you are using LLVM statepoint api. The allocation added byRewriteStatepointsForGC
pass is enough for you, and can be optimized better by LLVM.
After you have successfully generated the llvm ir and run the RewriteStatepointsForGC
pass, next you need to implement a stackmap parser.
The llvm stackmap is a binary format that contains all the information about the
GC roots at safepoints, you need to parse it to something like a hashtable for runtime usage.
The format is pretty simple[2] [3], you can also refer to
my stackmap parser[4] for more details. Basically, the parsed stackmap will
become a hashtable with the safepoint(call site) address as the key, and the
GC roots offset(from sp
) as the value.
Now, you have all the information you need at runtime, theoretically, you can use it to locate all GC roots, but how?
To locate all GC roots, you need to walk the stack, and find all the call site addresses, then use the stackmap to find all the GC roots at those call sites.
I'll provide you two ways to do so, one is easy to implement but not so fast, the other is more hacky but much more efficient.
The first way is to use an unwind library to walk the stack, there are many
choices, like libunwind, libbacktrace, libgcc, etc. I'm using Rust so I choose the
backtrace-rs
crate, which is merely a wrapper around those C libraries.
Those libraries can walk the stack by themselves, with some unwinding information in
the binary (like eh_frame) or register, the can provide you every stack frame's ip
and sp
value. The key of our parsed stackmap hashtable is expected to be the same as the
ip
register at call site, so it's really easy to use them to locate all
stack roots. But most of them are pretty heavy, and
may introduse some synchronization overheads(I believe there's a global lock in
backtrace-rs
), so if your GC is designed to be parallel, you may not want to use
them.
If you look at the stackmap format carefully, you will find that besides the root offset,
it also contains the stack size at the call site. With this information, and some basic
knowledge about the stack frame layout, we can manually walk the stack using the
sp
register's value at the beginning of the GC.
The stack frame layout is different on different platforms, but they are all similar if you are using C calling convention. The stack frame layout of a function on aarch64 is like this:
+-------------------+ high address
| main |
| .... |
| | <- previous sp ----
| return addr | ↑
| func a frame... | frame size
| .... | ↓
| | <- sp // safe point call site, before the GC logic was called, the sp is poining here
| return addr | // return address, key of the stackmap hashtable
| GC frames ... | // GC function frames
The algorithm to walk the stack is pretty simple:
\begin{aligned}
&\text{let } sp = \text{current sp} \\
&\text{loop } \\
&\quad \text{let } ip = *(sp - 8) \\
&\quad \text{if } ip \text{ is not in stackmaptable} \\
&\quad\quad \text{break} \\
&\quad \text{let } safepoint = \text{stackmaptable[ip]} \\
&\quad \text{for each root in safepoint.roots} \\
&\quad\quad \text{let } root = \text{sp + root.offset} \\
&\quad\quad \text{do GC logic} \\
&\quad sp = sp + safepoint.stack\_size \\
&\text{end loop}
\end{aligned}
The stack layout on x86_64 is a little bit different, it's actual frame size is
frame_size + 8
, but the algorithm is the same.
[!NOTE] Linking and stackmap
The stackmap is really low level, and may not work well with some linker options. For example, if you are targeting x86_64 windows, you must specify the
/INCREMENTAL:NO
option to the linker, or the stackmap would not work correctly. The reason is that the incremental linking will change the function address, but the stackmap is generated before linking, so it will become invalid after linking.
[!TIP] How to get sp pointer value in LLVM
You can use inline assembly to get the sp pointer value in LLVM, for example, on aarch64:
%sp = call ptr asm alignstack "mov $0, sp", "=r"()
On x86_64:
%sp = call ptr asm alignstack "mov %rsp, $0", "=r"()
[!NOTE] Call convention
The stack walking algorithm above is based on C calling convention, if you are using other calling convention, you may need to adjust the algorithm a little bit. I personally recommend you to use C calling convention for it's simplicity and compatibility.
LLVM's most optimization passes are not aware of the statepoint api, so
we should do as much optimization as possible before the RewriteStatepointsForGC
pass.
LLVM has a builtin capture tracking pass, which is just like the escape analysis
in Go, it can automatically determine whether a pointer is escaping or not, and remove
unnecessary heap allocations, but you need to give it some hints. Basically, add
allockind("alloc")
[5] to your gcmalloc
function would be enough.
The relocation GC is really hard to debug, because the GC may move things around during collection.
I recommend you to disable the relocation during debugging, and only enable it
when you are sure that your program is correct. Also, I've wrote a simple
pass [6] to remove the relocation instructions, which remove the usage of relocated
pointers. It's based on the origin StripGCRelocates
pass, but it make sure
the stackmap is still valid after stripping.
There's several bugs about passing large struct as argument or return value:
https://github.com/llvm/llvm-project/issues/75162
https://github.com/llvm/llvm-project/issues/74612
The LLVM statepoint api is really hard to use, and I've really learned a lot during the integration. I hope this blog can help you to integrate your GC with LLVM.
本文章使用limfx的vscode插件快速发布