Skip to content

Suggest using preserve_none to improve tail call optimization in interpreter #601

@yangyi625

Description

@yangyi625

I’ve been exploring rv32emu's interpreter mode and found that while the RVOP macro is designed to support MUST_TAIL tail calls via:

MUST_TAIL return next->impl(rv, next, cycle, PC);

In practice, the compiler often emits full stack frames and callee-saved register spills, which breaks the intended tail call optimization (TCO). This happens even though the control flow would otherwise be eligible for a proper tail jump.

For example, in a simple handler like do_addi, the compiler emits a proper tail call via jmp *%rax, and avoids saving any callee-saved registers:

af38:   ff e0    jmp *%rax   # tail call to next->impl

However, in do_lw, due to a non-tail call (e.g., memory access through a function pointer), the compiler emits callee-saved register preservation and a full frame:

aaa0:   55        push   %rbp
aaa1:   41 57     push   %r15
...
ab75:   c3        ret

This overhead defeats the benefits of using MUST_TAIL, especially in a hot loop of instruction dispatch.

##Suggestion

Clang provides the attribute((preserve_none)) attribute (available since 19.1.0) which modifies the calling convention to avoid saving any registers. It’s suitable when the function is only ever tail-called and doesn’t return via ret.

This attribute allows better register usage and completely avoids unnecessary stack setup for tail calls — perfect for interpreter dispatch functions.

References

Clang docs: preserve_none

A Tail Calling Interpreter For Python (And Other Updates)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions