Paper Notes: Speculative Decoding

April 2, 2026·
Jiangneng Li
Jiangneng Li
· 5 min read
post Paper Notes

Papers:

Industry Adoption: vLLM employs EAGLE as one of its most important speculative decoding features. See vLLM Speculative Decoding Documentation.

Why Speculative Decoding Breaks PagedAttention (and How vLLM Fixes It)

1. Allocation Frequency: Steady Pruning vs. Burst-and-Kill

Beam Search allocates and frees KV cache blocks at the pace of actual generation. With beam width = 3, each forward pass produces 3 new tokens. When a branch is pruned, the system triggers a single deallocation — the churn rate is synchronized with real generation speed.

Speculative Decoding (Tree Attention) operates in explosive bursts. The draft model generates an entire speculation tree (e.g., 15 candidate tokens) in a single, ultra-fast forward pass (a few milliseconds). The verifier then instantly rejects most of them — say 12 out of 15.

The critical gap: Beam search prunes gradually. Speculative decoding forces the page table to insert 15 pointers, then execute 12 Free operations just ~10ms later — every single step. This high-frequency “instant garbage collection” creates severe lock contention on the CPU-side scheduler, and the management overhead can eat into the speedup gained from speculation.

2. Copy-on-Write Fragmentation at Micro Scale

CoW is elegant for beam search branching, but becomes a nightmare at speculative decoding’s micro-granularity.

Consider a block of size 16 tokens with 3 empty slots remaining. Beam search fills one token at a time. But when the draft model forks into 3 parallel paths (A, B, C) each producing 3 tokens, CoW forces the system to immediately copy 3 independent physical blocks:

$$3 \times 16 = 48 \text{ token slots allocated, but only } 3 \times 3 = 9 \text{ draft tokens stored}$$

This is extreme internal fragmentation. After verification kills paths B and C, those under-filled blocks must be freed immediately — adding to the allocation churn.

3. The Engineering Solution: Volatile Draft Buffer

Because naive PagedAttention integration causes page table thrashing and internal fragmentation, production systems like vLLM and TensorRT-LLM do not let draft tokens enter the global PagedAttention memory pool.

Instead, they employ an isolation mechanism:

  • Volatile Draft Buffer: Each sequence gets a small, contiguous temporary buffer (a simple array, not managed by the block allocator). Draft tokens are written directly here — no block allocation, no CoW, no fragmentation.
  • In-place Overwrite: The speculation tree is written into this buffer each step. Rejected tokens are simply overwritten by the next round of drafts — no Free syscall needed.
  • Commit on Verification: Only after the verifier confirms tokens as correct are they “promoted” into the global PagedAttention KV cache as committed tokens, in a single batch write.

This is the real-world engineering answer: use a short-lived, lock-free ring buffer to absorb the high-frequency allocation/deallocation storm, and only touch the global page table for verified, permanent tokens.

ConFu: Can the Draft Model See a Little Further?

ConFu is interesting because it attacks speculative decoding from the draft quality side rather than the memory-management side. EAGLE already improves the draft model by using target-model features, but the draft still mostly conditions on the current prefix. If the draft model only sees “where we are now”, its hidden states can drift from the target model’s next-token distribution as the drafted chain gets longer.

The core ConFu idea is to add a small future-oriented signal:

  • Contemplate token: a synthetic “pause” token inserted into the target-model context so the target model produces a hidden state that represents where generation may go next.
  • Soft prompt: a learned prompt-like signal used to elicit this future-oriented hidden state from the target model.
  • Future token: the resulting hidden representation is passed to the draft model as an extra conditioning token, so the draft model predicts not only from the prefix but also from a target-derived hint about the future trajectory.

My current interpretation: this is basically improving the draft model’s approximation to the target distribution by adding a new hidden-state control channel. The token itself is not a normal vocabulary token that will be emitted to the user. It is closer to a learned internal handle: we can “invent” a token-like object, use it to trigger useful computation, and then route its hidden state into the draft model.

This still feels like a neural-network-design paper more than a pure systems trick. You need to decide how the contemplate token is represented, how the soft prompt is trained, and how the draft model consumes the future token. But it is a fun direction: speculative decoding is no longer only about making a smaller model guess faster; it is also about designing better intermediate hidden states so the guess lands closer to the target distribution.

One part I still want to revisit later is the soft-prompt mechanism. Intuitively, prepending a prompt should bias the target model into producing a useful “future thought” hidden state, but I do not yet have a clean mental model for why this prompt construction is the best interface.

Why vLLM Adopts EAGLE but Not DMC

The key difference from approaches like Dynamic Memory Compression (DMC) is decoupling. EAGLE does not modify the base model weights at all — it trains a lightweight external draft head as a plug-in. If you don’t want speculative decoding, you simply remove the EAGLE head and the original model remains a standard, unmodified checkpoint.

In contrast, DMC requires retrofitting the base model itself with 2% of pre-training data, permanently altering its weights. This makes it impractical unless every model provider commits to the training cost.

With EAGLE, the training cost is absorbed by the open-source community: labs with compute (e.g., Tsinghua, Berkeley, LMSYS) pre-train EAGLE heads for popular models (Llama-3, Qwen, Mistral, etc.) and publish them on HuggingFace. End users simply download the plug-in weights and enjoy ~3x decoding speedup — no training required.

Jiangneng Li
Authors
Doctor of Philosophy
PhD at NTU researching database systems, Data+AI, and multimedia data analytics.