LLVM and GC

In this blog, I will introduce how I integrate a precise, relocation GC with LLVM using its statepoint api. We will also talk about some optimization techniques, debug support and so on.

LLVM gc.root and statepoint

LLVM statepoint is an eperimental api to support precise relocation GC. Comparing with the old gc.root api, it has the following advantages:

can be generate by builtin pass
works well with other optimization passes

Also, it has some disadvantages:

not stable, which means it may change in the future, and have some bugs
not well documented--makes it really hard to use
cannot customize the stack map -- at least in LLVM 16

If you have tried to use gc.root to support a relocation GC, you will find that it is really hard to cover all stack roots. And it is even harder to play with all those derived pointers--A relocation GC may change the address of objects during collection, so after any points that may trigger a GC(those points were called safepoints), we have to reload all GC pointers that we may use later from stack or register, include base pointers and derived pointers(base pointer with offset, like the result of getelementptr of a base pointer).

The gc.root api is so difficult to get right for a compiler frontend, that I've spent days and nights to make it work. And until its retirement, it still has some bugs that I never figured out(which crash the program from time to time). Also, it's performance is really poor, probably because I'm generating it from frontend, and those instructions prevent many optimizations in LLVM.

The statepoint api, on the other hand, doesn't need frontend to do so much. There's a builtin pass called RewriteStatepointsForGC, which can help us to generate statepoint instructions.

LLVM statepoint usage

The statepoint api is a combination of gc.statepoint \ gc.result and gc.relocate instructions. The gc.statepoint instruction is used to mark a safepoint, it's basically a wrapper around a call instruction, but it doesn't directly return the result of the call, instead, it returns a token the pointers that may relocate after the statepoint is placed in the gc-live operand bundle just at the end of the instruction. The gc.result instruction is used to get the result of the call by accepting the token as its arguments. The gc.relocate instruction is used to get the relocated address of a GC pointer, it accepts the token and the index of the pointer in the gc-live operand bundle as its arguments, and produce the relocated pointer. The exaple usage is as follow[1]:

%statepoint_token46 = call token (i64, i32, ptr, i32, i32, ...) @llvm.experimental.gc.statepoint.p0(i64 2882400000, i32 0, ptr elementtype(ptr addrspace(1) (i64, i8, i64)) @DioGC__malloc, i32 3, i32 0, i64 32, i8 2, i64 %3, i32 0, i32 0) [ "gc-live"(ptr addrspace(1) %heapptr_retvalue3.relocated45) ]
%heapptr_string47 = call ptr addrspace(1) @llvm.experimental.gc.result.p1(token %statepoint_token46)
%heapptr_retvalue3.relocated48 = call coldcc ptr addrspace(1) @llvm.experimental.gc.relocate.p1(token %statepoint_token46, i32 0, i32 0) ; (%heapptr_retvalue3.relocated45, %heapptr_retvalue3.relocated45)

Apparently, the gc.statepoint instruction is really complicated, so I won't recommend you to generate it by yourself. Instead, you shall use the builtin pass RewriteStatepointsForGC, which treats all call instructions as safepoints, and generate gc.statepoint instructions for them. The RewriteStatepointsForGC pass will analyze the liveness of all GC pointers and generate gc.relocate instructions as well.

How to integrate a GC with LLVM statepoint

Now, we know the basis about the statepoint api, but how could we use it in GC?

GC strategy

The first thing we need to do is to choose or implement a GCStraegy. The GCStraegy can be used to distinguish the GC pointers from other pointers. However, as the statepoint api is still experimental, before LLVM 18, only the builtin GCStraegy statepoint-example and coreclr can be used with RewriteStatepointsForGC pass. So in this blog, we will choose the statepoint-example GCStraegy as an example.

Frontend support

Both of two builtin GCStraegies assume that all GC pointers are in address space 1, so we need to make sure that all GC pointers are correctly annotated in the frontend.

[!TIP] Should I use stack alloca?

Of course you can, but I don't recommend you to do so. The RewriteStatepointsForGC pass will generate stack roots automatically for all GC pointers it identified, so you don't need to do it by yourself. If you still did so, and store the GC pointers in stack alloca, remember your alloca won't be treated as a GC root, so you must manually tell your GC about this root or the heap pointer in it won't be correct after relocation. Generally, you should never use stack alloca when you are using LLVM statepoint api. The allocation added by RewriteStatepointsForGC pass is enough for you, and can be optimized better by LLVM.

Parse stackmap

After you have successfully generated the llvm ir and run the RewriteStatepointsForGC pass, next you need to implement a stackmap parser.

The llvm stackmap is a binary format that contains all the information about the GC roots at safepoints, you need to parse it to something like a hashtable for runtime usage. The format is pretty simple[2] [3], you can also refer to my stackmap parser[4] for more details. Basically, the parsed stackmap will become a hashtable with the safepoint(call site) address as the key, and the GC roots offset(from sp) as the value.

Use stackmap during GC

Now, you have all the information you need at runtime, theoretically, you can use it to locate all GC roots, but how?

To locate all GC roots, you need to walk the stack, and find all the call site addresses, then use the stackmap to find all the GC roots at those call sites.

I'll provide you two ways to do so, one is easy to implement but not so fast, the other is more hacky but much more efficient.

Use an unwind library to walk the stack

The first way is to use an unwind library to walk the stack, there are many choices, like libunwind, libbacktrace, libgcc, etc. I'm using Rust so I choose the backtrace-rs crate, which is merely a wrapper around those C libraries.

Those libraries can walk the stack by themselves, with some unwinding information in the binary (like eh_frame) or register, the can provide you every stack frame's ip and sp value. The key of our parsed stackmap hashtable is expected to be the same as the ip register at call site, so it's really easy to use them to locate all stack roots. But most of them are pretty heavy, and may introduse some synchronization overheads(I believe there's a global lock in backtrace-rs), so if your GC is designed to be parallel, you may not want to use them.

Manually walk the stack

If you look at the stackmap format carefully, you will find that besides the root offset, it also contains the stack size at the call site. With this information, and some basic knowledge about the stack frame layout, we can manually walk the stack using the sp register's value at the beginning of the GC.

Stack frame layout & stack walking algorithm

The stack frame layout is different on different platforms, but they are all similar if you are using C calling convention. The stack frame layout of a function on aarch64 is like this:

+-------------------+ high address
|        main       |
|        ....       |
|                   | <- previous sp  ----
|    return addr    |                   ↑
|  func a frame...  |               frame size
|        ....       |                   ↓
|                   | <- sp  // safe point call site, before the GC logic was called, the sp is poining here
|    return addr    |        // return address, key of the stackmap hashtable
|   GC frames  ...  |        // GC function frames

The algorithm to walk the stack is pretty simple:

\begin{aligned}
&\text{let } sp = \text{current sp} \\
&\text{loop } \\
&\quad \text{let } ip = *(sp - 8) \\
&\quad \text{if } ip \text{ is not in stackmaptable} \\
&\quad\quad \text{break} \\
&\quad \text{let } safepoint = \text{stackmaptable[ip]} \\
&\quad \text{for each root in safepoint.roots} \\
&\quad\quad \text{let } root = \text{sp + root.offset} \\
&\quad\quad \text{do GC logic} \\
&\quad sp = sp + safepoint.stack\_size \\
&\text{end loop}
\end{aligned}

The stack layout on x86_64 is a little bit different, it's actual frame size is frame_size + 8, but the algorithm is the same.

[!NOTE] Linking and stackmap

The stackmap is really low level, and may not work well with some linker options. For example, if you are targeting x86_64 windows, you must specify the /INCREMENTAL:NO option to the linker, or the stackmap would not work correctly. The reason is that the incremental linking will change the function address, but the stackmap is generated before linking, so it will become invalid after linking.

[!TIP] How to get sp pointer value in LLVM

You can use inline assembly to get the sp pointer value in LLVM, for example, on aarch64:

%sp = call ptr asm alignstack "mov $0, sp", "=r"()

On x86_64:

%sp = call ptr asm alignstack "mov %rsp, $0", "=r"()

[!NOTE] Call convention

The stack walking algorithm above is based on C calling convention, if you are using other calling convention, you may need to adjust the algorithm a little bit. I personally recommend you to use C calling convention for it's simplicity and compatibility.

Optimization

LLVM's most optimization passes are not aware of the statepoint api, so we should do as much optimization as possible before the RewriteStatepointsForGC pass.

LLVM has a builtin capture tracking pass, which is just like the escape analysis in Go, it can automatically determine whether a pointer is escaping or not, and remove unnecessary heap allocations, but you need to give it some hints. Basically, add allockind("alloc")[5] to your gcmalloc function would be enough.

Debug support

The relocation GC is really hard to debug, because the GC may move things around during collection.

I recommend you to disable the relocation during debugging, and only enable it when you are sure that your program is correct. Also, I've wrote a simple pass [6] to remove the relocation instructions, which remove the usage of relocated pointers. It's based on the origin StripGCRelocates pass, but it make sure the stackmap is still valid after stripping.

Statepoint Bugs

There's several bugs about passing large struct as argument or return value:

https://github.com/llvm/llvm-project/issues/75162

https://github.com/llvm/llvm-project/issues/74612

Conclusion

The LLVM statepoint api is really hard to use, and I've really learned a lot during the integration. I hope this blog can help you to integrate your GC with LLVM.

本文章使用limfx的vscode插件快速发布