https://news.ycombinator.com/item?id=18692133

I suspect that for many languages, what you really want is a compact SSA distribution format, and at program / library installation time, compile each extended basic block to a block of size-optimized blocks of strait-line native code. Functions would compile to stubs that pass arrays of basic block addresses to a common dispatch loop. In effect, you get a direct-threaded interpreter, where each “opcode” is a piece of strait-line native code from the application. Each basic block would need to set a known register to the address of the next basic block. In other words, the code would be compiled to native continuation-passing-style.

This provides the basis for a very low-overhead tracing JIT for native code. When a profiling timer triggers, the signal handler can walk the stack and perform some hot code detection heuristics. If a hot section is detected, the handler can perform on-stack replacement of the dispatch loop return address with a tracing version of the same. Once a hot loop is detected, the SSA form for each of the involved basic blocks can be stitched together and passed to an optimizing compiler. This would give you the fast start-up of size-optimized native code along with the long-term profile-optimized, cross-library-inlined-and-optimized code of a high performance JIT. The main downside would be the on-disk storage size of keeping both compressed SSA and size-optimized native code.

Alternatively, I could imagine processors with built-in support for native code tracing and optimization. If a trace register was non-zero, then every conditional or indirect branch would store the effective branch target at the location pointed to by the trace register, and increment the trace register by sizeof(size_t). If the trace register were equal to the trace limit register, or a performance timer had expired, then the CPU would trap to a userspace handler indicated by another register. Though, I think you’d need the threaded interpreter version of the idea to become popular before CPU manufactures started to consider the hardware supported version.