Skip to content

Experiment: Hybrid JIT - Pystons DynASM JIT + CPythons Stencil JIT#146235

Draft
undingen wants to merge 2 commits intopython:mainfrom
undingen:dynasm_clean4
Draft

Experiment: Hybrid JIT - Pystons DynASM JIT + CPythons Stencil JIT#146235
undingen wants to merge 2 commits intopython:mainfrom
undingen:dynasm_clean4

Conversation

@undingen
Copy link

@undingen undingen commented Mar 20, 2026

Experimental proof-of-concept, currently only working on x86-64 Linux. It is not intended for merging or detailed review, but rather to demonstrate the direction CPython's JIT could take and make a case for why this approach is worth pursuing.

This PR replaces CPython's copy-and-patch JIT with one using DynASM while keeping Clang-compiled C stencil templates as the source of the assembly. No hand-written assembly opcodes are required - we get the best of both worlds: LLVM-quality machine code with DynASM's runtime flexibility for encoding, linking, and layout.

Background

While I haven't worked on Pyston in many years, I always had in the back of my mind that I wanted to port some of the things that worked well over to CPython so everyone could benefit. Pyston's JIT used DynASM with hand-written assembly for every micro-op - it gave precise control over code generation but was an large maintenance/porting burden.

I finally found some time over the last few weeks to explore this, and with the help of Claude Opus 4.6, I was able to carry out this experiment. I mostly described what should be implemented and how, gave it Pyston as a reference, and let it implement things. Before AI assistance this wouldn't have been realistic given the time constraints I have.

One thing which I remember which made Pyston fast: stay as much time as possible inside the JIT but make sure that code size does not explode.
In practice what this meant: inline caches, inlining combined with splitting hot and cold code apart. Lots of tiny optimization which individiually don't really show up but all together increase performance noticable.
And I found DynASM was the right tool for that, because it generates code super fast and is flexible and quite easy to use.
It makes for example hot/cold splitting super easy as you will see further down.

A lot has changed over the years in CPython so I only focused on the things I know speedup Pyston and where I could see an easy way how to integrate it into currents CPython. But there are definitely more things which can be ported but for this experiment it should be enough.

Also there are currently quite a few peephole optimizations which work directly at the asm instructions I think likely some of them can be removed without any performance difference...

I will go further down more into details, I'm looking forward to feedback but I don't know how much more time I'll be able to spend on this.

Note: Everything should work normally except that the reference tracer (_PyReftracerTrack) is disabled in the JIT paths.

Performance

Compiled with ./configure --enable-experimental-jit=yes on Intel Core Ultra 7 155H laptop, turbo boost disabled.
pyperformance run sorted by speedup only showing the significant ones:

| Benchmark                        | main (JIT) -| This PR         | Speedup      |
|----------------------------------|-------------|-----------------|--------------|
| nbody                            | 111 ms      | 61.7 ms         | 1.80x faster |
| spectral_norm                    | 119 ms      | 78.7 ms         | 1.51x faster |
| scimark_sparse_mat_mult          | 6.68 ms     | 4.72 ms         | 1.41x faster |
| scimark_fft                      | 370 ms      | 266 ms          | 1.39x faster |
| float                            | 80.3 ms     | 66.5 ms         | 1.21x faster |
| scimark_sor                      | 130 ms      | 110 ms          | 1.18x faster |
| scimark_monte_carlo              | 75.9 ms     | 64.9 ms         | 1.17x faster |
| chaos                            | 82.5 ms     | 70.6 ms         | 1.17x faster |
| base32_large                     | 672 ms      | 577 ms          | 1.16x faster |
| scimark_lu                       | 110 ms      | 96.0 ms         | 1.15x faster |
| fannkuch                         | 493 ms      | 429 ms          | 1.15x faster |
| html5lib                         | 84.0 ms     | 74.0 ms         | 1.14x faster |
| bench_mp_pool                    | 20.7 ms     | 18.3 ms         | 1.13x faster |
| raytrace                         | 441 ms      | 392 ms          | 1.12x faster |
| base32_small                     | 12.7 ms     | 11.4 ms         | 1.11x faster |
| tomli_loads                      | 2.41 sec    | 2.21 sec        | 1.09x faster |
| nqueens                          | 127 ms      | 116 ms          | 1.09x faster |
| dulwich_log                      | 86.9 ms     | 80.1 ms         | 1.09x faster |
| regex_compile                    | 183 ms      | 169 ms          | 1.08x faster |
| pyflate                          | 490 ms      | 455 ms          | 1.08x faster |
| go                               | 147 ms      | 136 ms          | 1.08x faster |
| crypto_pyaes                     | 100 ms      | 93.0 ms         | 1.08x faster |
| hexiom                           | 8.72 ms     | 8.19 ms         | 1.07x faster |
| base64_small                     | 331 us      | 312 us          | 1.06x faster |
| unpickle_pure_python             | 288 us      | 275 us          | 1.05x faster |
| richards                         | 28.5 ms     | 27.1 ms         | 1.05x faster |
| docutils                         | 4.15 sec    | 3.94 sec        | 1.05x faster |
| base16_small                     | 436 us      | 414 us          | 1.05x faster |
| urlsafe_base64_small             | 471 us      | 452 us          | 1.04x faster |
| sqlglot_v2_parse                 | 1.70 ms     | 1.64 ms         | 1.04x faster |
| many_optionals                   | 1.30 ms     | 1.25 ms         | 1.04x faster |
| mako                             | 14.5 ms     | 13.8 ms         | 1.04x faster |
| bpe_tokeniser                    | 6.51 sec    | 6.24 sec        | 1.04x faster |
| base85_small                     | 260 us      | 250 us          | 1.04x faster |
| xml_etree_iterparse              | 122 ms      | 118 ms          | 1.03x faster |
| xml_etree_generate               | 122 ms      | 119 ms          | 1.03x faster |
| xdsl_constant_fold               | 69.6 ms     | 67.7 ms         | 1.03x faster |
| tornado_http                     | 185 ms      | 181 ms          | 1.03x faster |
| pickle_pure_python               | 447 us      | 436 us          | 1.03x faster |
| meteor_contest                   | 157 ms      | 153 ms          | 1.03x faster |
| mdp                              | 2.10 sec    | 2.04 sec        | 1.03x faster |
| deltablue                        | 4.10 ms     | 3.98 ms         | 1.03x faster |
| async_tree_none_tg               | 362 ms      | 352 ms          | 1.03x faster |
| async_tree_io                    | 829 ms      | 805 ms          | 1.03x faster |
| richards_super                   | 36.1 ms     | 35.2 ms         | 1.02x faster |
| deepcopy_memo                    | 31.1 us     | 30.4 us         | 1.02x faster |
| bench_thread_pool                | 2.18 ms     | 2.13 ms         | 1.02x faster |
| async_tree_memoization           | 439 ms      | 430 ms          | 1.02x faster |
| ascii85_small                    | 771 us      | 753 us          | 1.02x faster |

Note:
this PR contains two commits the second one inplace modifies some float objects with refcnt==1 (Pyston had this also)
it's mainly here to show that the new JIT can generate okay code with the hot/cold splitting.
This is the perf changes from the first commit to second:

| spectral_norm                    | 95.7 ms         | 78.7 ms     | 1.22x faster
| nbody                            | 68.1 ms         | 61.7 ms     | 1.10x faster
| float                            | 72.3 ms         | 66.5 ms     | 1.09x faster
| chaos                            | 75.5 ms         | 70.6 ms     | 1.07x faster
| python_startup_no_site           | 15.7 ms         | 15.1 ms     | 1.04x faster
| raytrace                         | 404 ms          | 392 ms      | 1.03x faster
| comprehensions                   | 22.8 us         | 23.4 us     | 1.03x slower
| asyncio_tcp                      | 495 ms          | 507 ms      | 1.03x slower
| unpickle_list                    | 6.63 us         | 6.78 us     | 1.02x slower
| telco                            | 215 ms          | 219 ms      | 1.02x slower
| pickle                           | 18.7 us         | 19.1 us     | 1.02x slower

How to try it out:

You need to have luajit and llvm-21-dev installed:

sudo apt install luajit llvm-21-dev

While I added luajit for DynASM as a git submodule - it comes with minilua.c which one could use instead of installing luajit but I did not hook it up in the Makefile so installing luajit is necessary right now...
Note: lua is only used during CPython build time to run the DynASM dasc preprocessor which spits out C code, its not used at runtime.

How It Works

Stencil Build Pipeline

clang .c → LLVM "fold" pass → .s asm → _optimizers.py (hot/cold splitting) → _asm_to_dasc.py → .dasc file → dynasm.lua converts to C → jit_stencils.h

Whats new:

  • A LLVM pass Tools/jit/jit_fold_pass.cpp it's tasks is to replace the use of _JIT_OPARG, _JIT_OPERAND0,.. values with more optimized code.
  • _asm_to_dasc.py transforms the assembly into a dasc file which DynASM understands and does some peephole optimizations

The DynASM | Syntax inside the jit_stencils-x86_64-unknown-linux-gnu.dasc file

Lines with | emit machine code at runtime; everything else is normal C:

E.g. Load a value into a register

static void emit_mov_imm(Jit *Dst, int reg, uintptr_t val) {
    if (val == 0) {
        | xor Rd(reg), Rd(reg)               // 2 bytes
    } else if (val <= UINT32_MAX) {
        | mov Rd(reg), (unsigned int)val     // 5 bytes
    } else {
        | mov64 Rq(reg), (unsigned long)val  // 10 bytes
    }
}

This is impossible with copy-and-patch: values aren't known at build time, so every load must be worst-case 10 bytes. With DynASM, we pick the optimal encoding at JIT compile time.

What a Stencil Looks Like

// LOAD_FAST - load a local variable onto the evaluation stack
static void emit__LOAD_FAST_r01(dasm_State **Dst, ...) {
    |=>uop_label:
    | mov rdi, qword [r13 + (instruction->oparg * 8 + 80)]
    | test dil, 1
    | jne =>L(0)
    | inc dword [rdi]
    |=>L(0):
}

The instruction->oparg * 8 + 80 offset is computed at JIT compile time and folded into a single addressing mode.
The current CPython JIT replicates some opcodes to generate optimized code for the most commen opargs
DynASM always lets us the best encoding.

What Changed vs. Copy-and-Patch

Removed

  • no manual relocation handling / patching anymore DynASM does it for us
  • machinde code writer can be removed

DynASM handles this automatically

  • Label resolution: Branch targets via =>N PC labels
  • Section layout: .code (hot) / .cold with automatic placement
  • Relative calls: call rel32 with correct displacement
  • Linking: dasm_link() resolves everything in one pass

Key Optimizations

1. Hot/Cold Splitting

Error paths and deoptimization stubs go in the .cold section, which DynASM lays out after all hot code:

|.code
| cmp rdx, rax
| jne =>L(deopt)  // branch to cold section
// ... hot path ...

|.cold
|=>L(deopt):
// ... deopt handling (rarely executed) ...

This keeps hot code compact for better I-cache utilization.

2. Runtime-Optimal Immediates

// Value 0:     xor rax, rax          (2 bytes - also zeroes flags, fastest)
// ≤ 4GB:       mov eax, imm32        (5 bytes - zero-extends to 64 bit)
// Near JIT:    lea rax, [rip+disp32] (7 bytes - RIP-relative)
// > 4GB:       mov64 rax, imm64      (10 bytes - fallback)

Most type objects and runtime functions land in the 5 or 7-byte tier, saving 3–5 bytes each vs. the old fixed 10-byte movabs.

3. LLVM Fold Pass + Peephole Optimizer

An LLVM pass (jit_fold_pass.cpp) folds JIT-time constants directly in IR - tracing SSA chains from _JIT_OPARG, _JIT_OPERAND0/1 and _PyRuntime, folding GEP address arithmetic, and emitting compact inline-asm markers. The downstream peephole optimizer then fuses assembly patterns:

LOAD_SMALL_INT - before (4 instructions): mov + shl + mov + add
After (1 instruction - entire address computed at JIT time):

before:
    movabsq $0x0, %rax # _JIT_OPARG
    movzwl  %ax, %eax
    shll    $0x5, %eax
    movabsq $0x0, %rsi #_PyRuntime+0x37c8
    addq    %rax, %rsi
    orq     $0x1, %rsi

now:
emit_mov_imm(Dst, JREG_RSI,
    (((uintptr_t)&_PyRuntime + 14120)
     + (uintptr_t)((instruction->oparg + 5)) * 32) | 1);

Things like this happen in a lot of stencils e.g. emit__LIST_APPEND_r10...

Convert the boolean result into Py_False or Py_True:

before:
    movabsq $0x0, %rax  #_Py_FalseStruct
    movabsq $0x0, %rdi  #_Py_TrueStruct
    cmoveq  %rax, %rdi
    orq     $0x1, %rdi  # set the borrowed flag

now:
    # folds the or into the constant
    emit_mov_imm_preserve_flags(Dst, JREG_RAX, ((uintptr_t)&_Py_FalseStruct | 1));
    emit_mov_imm_preserve_flags(Dst, JREG_RSI, ((uintptr_t)&_Py_TrueStruct | 1));
    | cmove rsi, rax

4. Direct Calls

; Before (GOT indirection):  call qword [rip + GOT_OFFSET]  - 6 bytes, memory load
; After (direct):            call rel32                     - 5 bytes, no memory access

We hint to mmap() an address for the JIT code is allocated within ±2GB of CPython's text segment,
so nearly all calls use the shorter direct form.

5. Shared Trace Frame

All stencils in a trace share a single stack frame. At trace entry we do push rbp; mov rbp, rsp; sub rsp, N once; individual stencil prologues/epilogues are stripped at build time. Each trace exit inlines mov rsp, rbp; pop rbp; ret. This eliminates per-stencil frame setup overhead on the hottest paths.

6. Float Inplace Operations

Experimental _BINARY_OP_INPLACE_ADD_FLOAT and friends modify float values in-place when the refcount is 1.
The hot path reduces to not much more than single SSE instruction:

; Hot path for x += y (when x has refcount 1):
addsd xmm0, qword [rsi + 16]    ; add y's value
movsd qword [rdi + 16], xmm0    ; store back in-place

I told the AI to implement it in a way which does not use up opcodes (because I assumed they are still very limited but now I'm not sure if this was necessary maybe it could have implemented it in a more straightforward way...

What DynASM Enables for the Future

Beyond the current optimizations, DynASM gives us a flexible code generation foundation that copy-and-patch fundamentally cannot provide. Since we control instruction emission at runtime, future work can:

  • Allocate registers across stencil boundaries - track live values at join points, eliminate redundant loads between consecutive stencils
  • Propagate constants through traces - when a trace specializes on a known value, fold it into all downstream operations
  • Perform cross-stencil optimizations - e.g. removed the is borrowed checks etc..

What This Cannot Solve

This work is purely a backend improvement. Higher-level optimizations must happen elsewhere:

  • Type specialization - decided by the trace optimizer
  • Float unboxing - requires changes to the optimizer's value representation
  • Escape analysis - stack-allocating objects that don't escape
  • Inline caching - faster attribute/method lookups

Profiling shows JIT-emitted code accounts for ~25 of runtime on numeric benchmarks. The remaining 75% is C runtime functions (PyFloat_FromDouble, _Py_Dealloc, etc.). Backend optimizations can't produce dramatic speedups alone, but they ensure generated code is as tight as possible and remove the backend as a bottleneck for higher-level improvements.

Testing

  • All CPython tests pass (472/472; only pre-existing test_pyrepl failure)

Replace the copy-and-patch relocation engine with a DynASM-based pipeline.
Instead of manually copying pre-compiled stencil blobs and patching GOT
entries / trampolines at runtime, Clang-generated assembly is converted at
build time into DynASM .dasc source, which is then compiled into a C header
(jit_stencils_dynasm.h).  At runtime the DynASM assembler encodes native
x86-64 directly, resolving all labels, jumps, and data references in a
single pass.

Key changes:

Build pipeline (Tools/jit/):
  - _asm_to_dasc.py: New peephole optimizer that converts Clang AT&T asm
    to DynASM Intel-syntax .dasc.  Uses typed operand classes (Reg, Mem,
    Imm) with Python 3.10+ match/case for pattern matching.  Includes
    15+ optimization patterns (immediate narrowing, test-self elimination,
    indexed memory folding, ALU immediate folding, redundant stack reload
    elimination, dead label removal, etc.).
  - _dasc_writer.py: Generates jit_stencils.h with DynASM preamble,
    emit helpers (emit_mov_imm, emit_call_ext, emit_cmp_reg_imm,
    emit_test/and/or/xor_reg_imm), and per-stencil emit functions.
  - _targets.py: Reworked to drive the DynASM pipeline — compiles
    stencils, converts asm, generates .dasc, runs the DynASM preprocessor,
    and produces the final header.
  - _stencils.py: Adds COLD_CODE HoleValue for hot/cold section splitting.
  - _optimizers.py: Extended with stencil frame-size tracking and
    frame-group merging infrastructure.
  - build.py: Adds --peephole-stats flag for optimization statistics.
  - test_peephole.py: unit tests covering peephole patterns and
    the line classification infrastructure.
  - Lib/test/test_jit_peephole.py: Hooks peephole tests into make test.

Runtime (Python/jit.c):
  - Complete rewrite of _PyJIT_Compile: uses DynASM dasm_init / dasm_setup
    / per-stencil emit / dasm_link / dasm_encode instead of memcpy+patch.
  - Hot/cold code splitting: cold (error) paths are placed in a separate
    DynASM section after the hot code, improving i-cache locality.
  - Frame merging: stencils share a single prologue/epilogue, eliminating redundant rsp adjustments.
  - SET_IP delta encoding: incremental IP updates avoid redundant full
    address loads.
  - Hint-based mmap: jit_alloc() places JIT code near the CPython text
    segment for short (±2 GB) RIP-relative calls and LEAs.
  - jit_shrink(): releases unused pages at the end of each compiled trace.
  - emit_call_ext: emits direct RIP-relative call when target is within
    ±2 GB, otherwise falls back to indirect call through register.
  - emit_mov_imm: picks the shortest encoding (xor/mov32/mov64/lea rip)
    based on the runtime value.

Freelist inlining (Tools/jit/jit.h + template.c):
  - Macro overrides redirect float/int allocation and deallocation to
    JIT-inlined versions that directly access the thread-state freelists,
    avoiding function call overhead for the most common object types.
  - _PyJIT_FloatFromDouble / _PyJIT_FloatDealloc: inline float freelist.
  - _PyJIT_LongDealloc / _PyJIT_FastDealloc: inline int/generic dealloc.
  - _PyJIT_CompactLong_{Add,Subtract,Multiply}: inline compact long ops.
  - PyStackRef_CLOSE / Py_DECREF overrides use the fast dealloc path.

LuaJIT submodule:
  - Added as Tools/jit/LuaJIT for the DynASM assembler (dynasm/ only
    used at build time; no LuaJIT runtime code is linked).

This is an experimental port, currently tested on x86_64 Linux only.
The approach is a hybrid between Pyston's fully hand-written DynASM JIT
(https://github.com/pyston/pyston/blob/pyston_main/Python/aot_ceval_jit.c)
and CPython's Clang-generated stencils: Clang produces the stencil
assembly, and DynASM handles encoding and relocation at runtime.
…true divide, power, floor divide, modulo

Add a proper specialization-based approach for inplace float modification.
This uses the existing _PyBinaryOpCache.external_cache[0..2] to store
profiling hints from the specializer, which the optimizer reads to select
the best tier2 inplace variant.

The key insight: when a float binary operation produces a result and one
of the input operands has refcount 1 (or 2 when a STORE_FAST targets it),
we can modify that float object in-place instead of allocating a new one.
This eliminates allocation overhead in common patterns like:

    x += y       # STORE_FAST_LEFT: left operand is the local
    x = a + b    # generic: whichever operand has refcount 1
    total += a * b  # chained: intermediate has refcount 1

Specializer (specialize.c):
- binary_op_float_inplace_candidate(): checks if either operand has
  refcount 1 at specialization time
- binary_op_float_inplace_store_fast_hint(): checks if next instruction
  is STORE_FAST targeting left (source=1) or right (source=2) operand
- Stores hints in external_cache[0]=use_inplace, [1]=source, [2]=local_index

New tier2 ops:
- Float/int mixed: _BINARY_OP_{ADD,SUBTRACT,MULTIPLY,TRUE_DIVIDE}_FLOAT_INT
- Float true divide: _BINARY_OP_TRUE_DIVIDE_FLOAT (with zero check)
- Float power: _BINARY_OP_POWER_FLOAT (positive base, finite exponent)
- Int floor divide: _BINARY_OP_FLOOR_DIVIDE_INT (Python semantics)
- Int modulo: _BINARY_OP_MODULO_INT (Python semantics)
- Inplace float: _BINARY_OP_INPLACE_{TRUE_DIVIDE,POWER}_FLOAT
- Inplace int: _BINARY_OP_INPLACE_{ADD,SUBTRACT,MULTIPLY}_INT

Architecture: The specializer writes JIT hints to external_cache[3] (an enum
indicating which tier2 op to use) and calls unspecialize() — the interpreter
runs the generic BINARY_OP path while the JIT optimizer reads the hints and
emits the specialized tier2 op.  This avoids wasting interpreter opcode slots
while still getting full JIT specialization.

Also adds _PyCompactLong_FloorDivide() and _PyCompactLong_Modulo() helpers
in Objects/longobject.c with correct Python floor-division/modulo semantics
(sign correction for negative operands).
@python-cla-bot
Copy link

The following commit authors need to sign the Contributor License Agreement:

CLA not signed

@bedevere-app
Copy link

bedevere-app bot commented Mar 20, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz picnixz closed this Mar 20, 2026
@Fidget-Spinner
Copy link
Member

@picnixz why did you close this? I think leaving it open just as a read is interesting.

@bedevere-app
Copy link

bedevere-app bot commented Mar 20, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz
Copy link
Member

picnixz commented Mar 20, 2026

Do not create PRs for PoCs. In addition this looks too big for a change without an issue, a PEP and a discussion.

@picnixz
Copy link
Member

picnixz commented Mar 20, 2026

why did you close this? I think leaving it open just as a read is interestin

You do not need a PR for PoCs. This wastes CI resources whenever there is a change and this diverts our attention to works we want to merge.

@Fidget-Spinner
Copy link
Member

You do not need a PR for PoCs. This wastes CI resources whenever there is a change

We could just ask the author to not push any changes to this branch.

@picnixz
Copy link
Member

picnixz commented Mar 20, 2026

No, please. We already have too many PRs. Or at least create an issue for that. But we really try to avoid making this in general. And IIRC we did ask people not to do that in the devguide.

If you do want to sponsor that PR though please create an issue so that the feedback is not lost in the PR.

@Fidget-Spinner
Copy link
Member

If you do want to sponsor that PR though please create an issue so that the feedback is not lost in the PR.

No an issue is for something actionable. This PR is quite valuable in that it teaches a lot, but isn't an actionable item on its own.

I will close the PR next Wednesday. So you don't have to worry about that.

@Fidget-Spinner
Copy link
Member

@undingen thanks a lot for running this experiment. It's really valuable info:

Could you please post the geometric mean speedup you get from pyperf compare_to? IIUC, this copy and patches but instead of targeting machine code, it targets dasc.

  1. The speedups are impressive. However, for the effort and lines of handwritten code, it's surprisingly also less than I expected. I think the conclusion I'm deriving here is that the trace optimizer middle-end/frontend is a lot more important than the backend for perf. This is still really good info---it shows us what is a possible "upper bound" if we were to change to a different level of abstraction.
  2. We definitely want to do hot-cold splitting (linked issue), and we're planning on mmapping regions near each other already. This PR is really valuable to show us what we can expect from those.

./configure --enable-experimental-jit=yes

In that case, the real speedups are probably slightly lower, because PGO+LTO+TAILCALL generates a faster interpreter which means falling back to it is not as costly. However, this is still interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants