<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>ML Systems | Jiangneng's Homepage</title><link>https://www.jiangnengli.com/tag/ml-systems/</link><atom:link href="https://www.jiangnengli.com/tag/ml-systems/index.xml" rel="self" type="application/rss+xml"/><description>ML Systems</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 09 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://www.jiangnengli.com/media/icon_hu_37c904991c0d686.png</url><title>ML Systems</title><link>https://www.jiangnengli.com/tag/ml-systems/</link></image><item><title>Paper Notes: SGLang and Safe Eviction</title><link>https://www.jiangnengli.com/post/llm-infra-sglang-eviction/</link><pubDate>Thu, 09 Apr 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/llm-infra-sglang-eviction/</guid><description>&lt;p&gt;&lt;strong&gt;Paper:&lt;/strong&gt;
&lt;/p&gt;
&lt;p&gt;SGLang&amp;rsquo;s key systems idea is &lt;strong&gt;RadixAttention&lt;/strong&gt;: instead of discarding KV cache after every request, it retains reusable prefixes in a radix tree and manages them as a cache. The paper explicitly combines three ideas here: &lt;strong&gt;tree-structured KV reuse&lt;/strong&gt;, &lt;strong&gt;leaf-first LRU eviction&lt;/strong&gt;, and &lt;strong&gt;reference-count-based protection for in-flight requests&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The most intuitive way to understand why this matters is to imagine a &lt;strong&gt;tree-search workload&lt;/strong&gt; such as Tree-of-Thought or an MCTS-style reasoning loop. The paper itself evaluates Tree-of-Thought and discusses dynamic tree structures; the MCTS framing below is my systems interpretation of what goes wrong if eviction is implemented naively.&lt;/p&gt;
&lt;h2 id="why-a-naive-lru-can-end-up-evicting-the-root"&gt;Why a Naive LRU Can End Up Evicting the Root&lt;/h2&gt;
&lt;p&gt;The problem is &lt;strong&gt;not&lt;/strong&gt; that SGLang&amp;rsquo;s own RadixAttention blindly evicts the root. The problem is that a naive recency-only cache, if it ignores the tree structure, can absolutely do that.&lt;/p&gt;
&lt;h3 id="1-temporal-blind-spot"&gt;1. Temporal Blind Spot&lt;/h3&gt;
&lt;p&gt;In a tree search, execution often spends a long stretch of time deep inside one branch. While the runtime is busy expanding, scoring, and decoding around a leaf, the root prefix may not be &amp;ldquo;touched&amp;rdquo; for several seconds.&lt;/p&gt;
&lt;p&gt;To a naive LRU queue, that root prefix now looks cold.&lt;/p&gt;
&lt;h3 id="2-physical-illusion"&gt;2. Physical Illusion&lt;/h3&gt;
&lt;p&gt;The root is usually also the &lt;strong&gt;largest shared prefix&lt;/strong&gt; in the system: system prompt, tool instructions, long context, prior reasoning trace, and so on. In practice, it can easily span thousands of tokens.&lt;/p&gt;
&lt;p&gt;So a naive LRU sees something that is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;large in memory,&lt;/li&gt;
&lt;li&gt;old in recency,&lt;/li&gt;
&lt;li&gt;and apparently inactive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is exactly the profile of what a traditional cache would want to evict first.&lt;/p&gt;
&lt;h3 id="3-catastrophic-recomputation"&gt;3. Catastrophic Recomputation&lt;/h3&gt;
&lt;p&gt;But in a tree workload, evicting the root is not like evicting a normal cold page from an OS cache. It is closer to deleting the load-bearing structure of the whole search tree.&lt;/p&gt;
&lt;p&gt;Suppose the runtime has been exploring a deep branch for a while and later needs to backtrack to an earlier decision point to open a sibling branch. If the shared prefix near the root has already been evicted, the model must recompute the entire path from scratch before it can continue. What looked like a &amp;ldquo;good&amp;rdquo; local eviction decision becomes a globally disastrous one.&lt;/p&gt;
&lt;p&gt;This is why naive LRU is too myopic for tree-shaped KV reuse.&lt;/p&gt;
&lt;h2 id="what-sglang-actually-does-differently"&gt;What SGLang Actually Does Differently&lt;/h2&gt;
&lt;h3 id="1-it-evicts-leaves-first"&gt;1. It Evicts Leaves First&lt;/h3&gt;
&lt;p&gt;The paper does not describe an arbitrary node-level LRU. It states that RadixAttention uses an eviction policy that removes the &lt;strong&gt;least recently used leaf first&lt;/strong&gt;. This is the crucial structural fix.&lt;/p&gt;
&lt;p&gt;If you evict leaves first, shared ancestors stay alive as long as they still support some subtree. Only after descendants disappear and an ancestor itself becomes a leaf does it become a candidate for eviction.&lt;/p&gt;
&lt;p&gt;That is exactly the behavior we want in tree search: prune the fringe before touching the trunk.&lt;/p&gt;
&lt;h3 id="2-it-protects-in-flight-prefixes-with-reference-counts"&gt;2. It Protects In-Flight Prefixes with Reference Counts&lt;/h3&gt;
&lt;p&gt;The paper also states that in continuous batching, nodes used by the currently running batch cannot be evicted, so &lt;strong&gt;each node maintains a reference counter&lt;/strong&gt; and is evictable only when that counter is zero.&lt;/p&gt;
&lt;p&gt;Conceptually, this behaves like &lt;strong&gt;pinning&lt;/strong&gt;, even though the implementation is not a standalone &lt;code&gt;pinned=true&lt;/code&gt; flag.&lt;/p&gt;
&lt;p&gt;In the current SGLang codebase, this protection appears as &lt;code&gt;lock_ref&lt;/code&gt; in
:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;inc_lock_ref(...)&lt;/code&gt; walks from a node toward the root and protects the whole ancestor chain,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dec_lock_ref(...)&lt;/code&gt; releases that protection when the request finishes or moves on,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_update_leaf_status(...)&lt;/code&gt; removes locked nodes from the evictable leaf set,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evict(...)&lt;/code&gt; only evicts from evictable leaves and will not continue upward if the parent still has a positive lock reference.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the runtime behavior is: if a request is actively decoding at a deep node, its parent, grandparent, and so on are all protected transitively.&lt;/p&gt;
&lt;h3 id="3-it-schedules-requests-to-preserve-prefix-locality"&gt;3. It Schedules Requests to Preserve Prefix Locality&lt;/h3&gt;
&lt;p&gt;SGLang does not only fix the eviction rule. It also fixes the &lt;strong&gt;execution order&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The paper proposes a cache-aware scheduler based on &lt;strong&gt;longest-shared-prefix-first&lt;/strong&gt;, and proves that in the offline case this corresponds to a DFS order over the radix tree. Intuitively, once the runtime enters a subtree, it tries to stay there long enough to exploit the hot shared prefix instead of bouncing randomly across unrelated branches.&lt;/p&gt;
&lt;p&gt;This matters because even a good eviction policy can still suffer if the scheduler keeps forcing the cache to thrash between disjoint prefixes.&lt;/p&gt;
&lt;h2 id="the-engineering-intuition"&gt;The Engineering Intuition&lt;/h2&gt;
&lt;p&gt;Putting the pieces together:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A naive flat LRU may see the root as &amp;ldquo;old and big&amp;rdquo; and evict it.&lt;/li&gt;
&lt;li&gt;RadixAttention changes the object being managed from isolated pages to &lt;strong&gt;tree-structured prefixes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Leaf-first eviction preserves shared ancestors.&lt;/li&gt;
&lt;li&gt;Reference counts protect every prefix that is still needed by a live request.&lt;/li&gt;
&lt;li&gt;Prefix-aware scheduling keeps the active subtree hot.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the real lesson is not just &amp;ldquo;use LRU carefully.&amp;rdquo; It is that &lt;strong&gt;eviction must respect the geometry of the workload&lt;/strong&gt;. Once requests share prefixes in a branching tree, cache management has to become tree-aware as well.&lt;/p&gt;
&lt;h2 id="key-takeaway"&gt;Key Takeaway&lt;/h2&gt;
&lt;p&gt;SGLang avoids the classic eviction disaster not by abandoning LRU entirely, but by embedding it inside the right structure. A naive LRU can indeed be &amp;ldquo;dumb enough&amp;rdquo; to evict the root of a reasoning tree. RadixAttention prevents that by combining &lt;strong&gt;radix-tree structure&lt;/strong&gt;, &lt;strong&gt;leaf-first eviction&lt;/strong&gt;, &lt;strong&gt;reference-protected ancestors&lt;/strong&gt;, and &lt;strong&gt;cache-aware scheduling&lt;/strong&gt;.&lt;/p&gt;</description></item><item><title>System Notes: Queueing Avalanche</title><link>https://www.jiangnengli.com/post/llm-infra-queueing-avalanche/</link><pubDate>Thu, 09 Apr 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/llm-infra-queueing-avalanche/</guid><description>&lt;p&gt;This note explains a failure mode I like to call &lt;strong&gt;queueing avalanche&lt;/strong&gt;. It is not the cost of a single eviction that kills the system. It is the way a moderate control-plane stall gets amplified by a queue, then amplified again by concurrency, until the whole serving stack starts to look &amp;ldquo;frozen.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;You can think of it as the dynamic version of the safe-eviction story in
: that post focuses on &lt;strong&gt;what&lt;/strong&gt; should be evicted, while this one focuses on &lt;strong&gt;how long the eviction machinery itself takes&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="1-the-scaling-trap"&gt;1. The Scaling Trap&lt;/h2&gt;
&lt;p&gt;The first trap is that microbenchmarks lie.&lt;/p&gt;
&lt;p&gt;Suppose we measure &lt;code&gt;std::make_heap&lt;/code&gt; on a tiny tree and see something like &lt;strong&gt;12,000 ns (12 us)&lt;/strong&gt;. That sounds negligible. The problem is that the number was measured on a structure with only a few thousand nodes.&lt;/p&gt;
&lt;p&gt;A real MCTS-style search can easily grow to &lt;strong&gt;100,000 nodes&lt;/strong&gt; or more. Once the heap maintenance work is proportional to that larger live set, the cost is no longer in microseconds. It can jump into the &lt;strong&gt;millisecond&lt;/strong&gt; range.&lt;/p&gt;
&lt;p&gt;That is the dangerous transition:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;12 us&lt;/code&gt; feels like metadata overhead,&lt;/li&gt;
&lt;li&gt;&lt;code&gt;3-10 ms&lt;/code&gt; becomes a scheduling event,&lt;/li&gt;
&lt;li&gt;and repeated &lt;code&gt;3-10 ms&lt;/code&gt; stalls become a system-level failure mode.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the core mistake is extrapolating from a small-tree benchmark to a large-tree production workload.&lt;/p&gt;
&lt;h2 id="2-the-event-loop-choke-point"&gt;2. The Event-Loop Choke Point&lt;/h2&gt;
&lt;p&gt;In SGLang-like serving stacks, the control plane is orchestrated by a Python &lt;code&gt;asyncio&lt;/code&gt; event loop. Conceptually, that loop is the traffic controller for request admission, scheduling decisions, and GPU handoff.&lt;/p&gt;
&lt;p&gt;If a large eviction operation forces the control path to spend several milliseconds rebuilding or reordering a heap, that work can block the loop at exactly the wrong time.&lt;/p&gt;
&lt;p&gt;During that blocked window:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;new API requests cannot be admitted promptly,&lt;/li&gt;
&lt;li&gt;TCP accept and request handling begin to backlog,&lt;/li&gt;
&lt;li&gt;completed GPU batches come back to the CPU asking for the next step,&lt;/li&gt;
&lt;li&gt;but the CPU-side scheduler is still busy maintaining the priority structure.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is the absurd outcome: the GPUs are not slow, and memory bandwidth is not the bottleneck. The expensive accelerators go idle because the control plane cannot answer the question, &amp;ldquo;What should I run next?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That is why a few milliseconds on the CPU can translate into &lt;strong&gt;0% effective GPU utilization&lt;/strong&gt; for that window.&lt;/p&gt;
&lt;h2 id="3-why-concurrency-turns-it-into-an-avalanche"&gt;3. Why Concurrency Turns It into an Avalanche&lt;/h2&gt;
&lt;p&gt;A single 5 ms stall is unpleasant but survivable. The real disaster appears when many clients arrive concurrently.&lt;/p&gt;
&lt;p&gt;Imagine &lt;strong&gt;256 concurrent MCTS clients&lt;/strong&gt;. While the event loop is blocked:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;existing requests cannot make progress,&lt;/li&gt;
&lt;li&gt;new requests keep arriving,&lt;/li&gt;
&lt;li&gt;completed work cannot be dispatched into the next batch,&lt;/li&gt;
&lt;li&gt;and queue length starts increasing faster than the system can drain it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now the next request does not just pay the original 5 ms stall. It pays:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the original stall,&lt;/li&gt;
&lt;li&gt;plus the waiting time of all work already queued ahead of it,&lt;/li&gt;
&lt;li&gt;plus any additional stalls triggered by the larger backlog.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is the queueing-theory version of positive feedback: delay creates backlog, backlog increases service delay, and the longer delay creates even more backlog.&lt;/p&gt;
&lt;p&gt;That is why the tail latency explosion is so violent. The P99 does not grow linearly. Once the queue starts feeding on itself, it can blow up by &lt;strong&gt;orders of magnitude&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="4-the-intuition-in-one-sentence"&gt;4. The Intuition in One Sentence&lt;/h2&gt;
&lt;p&gt;The system does not fail because heap maintenance is individually expensive. It fails because &lt;strong&gt;a synchronous control-plane pause starves batch scheduling, which idles the GPU, which slows queue draining, which amplifies the next pause&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;That feedback loop is the avalanche.&lt;/p&gt;
&lt;h2 id="key-takeaway"&gt;Key Takeaway&lt;/h2&gt;
&lt;p&gt;The phrase &lt;strong&gt;queueing avalanche&lt;/strong&gt; describes a control-plane collapse where a seemingly modest per-eviction cost becomes catastrophic under scale and concurrency. Small-tree benchmarks hide the problem, but once heap maintenance reaches the millisecond range, a single-threaded scheduler can become the bottleneck for the entire cluster. At that point, the dominant cost is no longer the heap operation itself. It is the cascading queueing delay that follows.&lt;/p&gt;</description></item><item><title>Paper Notes: Dynamic Memory Compression</title><link>https://www.jiangnengli.com/post/llm-infra-kv-cache-reduction/</link><pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/llm-infra-kv-cache-reduction/</guid><description>&lt;p&gt;&lt;strong&gt;Paper:&lt;/strong&gt;
&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;Dynamic Memory Compression (DMC) optimizes LLM inference by allowing the model to autonomously decide when to merge redundant token representations in the KV cache based on learned contextual importance. To achieve this, the algorithm requires &amp;ldquo;retrofitting&amp;rdquo;—fine-tuning the pre-trained LLM on a fraction of its original data to teach the attention mechanism this dynamic pooling behavior. Consequently, while it drastically reduces memory footprint, it fundamentally alters the original model weights, making it incompatible with &amp;ldquo;training-free,&amp;rdquo; plug-and-play inference engines.&lt;/p&gt;
&lt;h2 id="key-takeaway"&gt;Key Takeaway&lt;/h2&gt;
&lt;p&gt;The core limitation of DMC is its dependency on &lt;strong&gt;2% of the original pre-training data&lt;/strong&gt; to train the merging mechanism. This is a non-trivial cost — unless every model provider commits to this retrofitting step, adoption remains impractical at scale. The training overhead makes it fundamentally different from training-free approaches like quantization or eviction-based methods.&lt;/p&gt;
&lt;p&gt;This concern has been echoed in practice. A similar merging request appeared in the vLLM project (
), where it was pointed out that the required training data makes it too expensive for general-purpose deployment.&lt;/p&gt;</description></item><item><title>Paper Notes: Speculative Decoding</title><link>https://www.jiangnengli.com/post/llm-infra-speculative-decoding/</link><pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/llm-infra-speculative-decoding/</guid><description>&lt;p&gt;&lt;strong&gt;Papers:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Industry Adoption:&lt;/strong&gt; vLLM employs EAGLE as one of its most important speculative decoding features. See
.&lt;/p&gt;
&lt;h2 id="why-speculative-decoding-breaks-pagedattention-and-how-vllm-fixes-it"&gt;Why Speculative Decoding Breaks PagedAttention (and How vLLM Fixes It)&lt;/h2&gt;
&lt;h3 id="1-allocation-frequency-steady-pruning-vs-burst-and-kill"&gt;1. Allocation Frequency: Steady Pruning vs. Burst-and-Kill&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Beam Search&lt;/strong&gt; allocates and frees KV cache blocks at the pace of actual generation. With beam width = 3, each forward pass produces 3 new tokens. When a branch is pruned, the system triggers a single deallocation — the churn rate is synchronized with real generation speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speculative Decoding (Tree Attention)&lt;/strong&gt; operates in explosive bursts. The draft model generates an entire speculation tree (e.g., 15 candidate tokens) in a single, ultra-fast forward pass (a few milliseconds). The verifier then instantly rejects most of them — say 12 out of 15.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The critical gap:&lt;/strong&gt; Beam search prunes gradually. Speculative decoding forces the page table to insert 15 pointers, then execute 12 &lt;code&gt;Free&lt;/code&gt; operations just ~10ms later — every single step. This high-frequency &amp;ldquo;instant garbage collection&amp;rdquo; creates severe &lt;strong&gt;lock contention&lt;/strong&gt; on the CPU-side scheduler, and the management overhead can eat into the speedup gained from speculation.&lt;/p&gt;
&lt;h3 id="2-copy-on-write-fragmentation-at-micro-scale"&gt;2. Copy-on-Write Fragmentation at Micro Scale&lt;/h3&gt;
&lt;p&gt;CoW is elegant for beam search branching, but becomes a nightmare at speculative decoding&amp;rsquo;s micro-granularity.&lt;/p&gt;
&lt;p&gt;Consider a block of size 16 tokens with 3 empty slots remaining. Beam search fills one token at a time. But when the draft model forks into 3 parallel paths (A, B, C) each producing 3 tokens, CoW forces the system to immediately copy 3 independent physical blocks:&lt;/p&gt;
$$3 \times 16 = 48 \text{ token slots allocated, but only } 3 \times 3 = 9 \text{ draft tokens stored}$$&lt;p&gt;This is extreme &lt;strong&gt;internal fragmentation&lt;/strong&gt;. After verification kills paths B and C, those under-filled blocks must be freed immediately — adding to the allocation churn.&lt;/p&gt;
&lt;h3 id="3-the-engineering-solution-volatile-draft-buffer"&gt;3. The Engineering Solution: Volatile Draft Buffer&lt;/h3&gt;
&lt;p&gt;Because naive PagedAttention integration causes page table thrashing and internal fragmentation, production systems like vLLM and TensorRT-LLM &lt;strong&gt;do not&lt;/strong&gt; let draft tokens enter the global PagedAttention memory pool.&lt;/p&gt;
&lt;p&gt;Instead, they employ an isolation mechanism:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Volatile Draft Buffer:&lt;/strong&gt; Each sequence gets a small, contiguous temporary buffer (a simple array, not managed by the block allocator). Draft tokens are written directly here — no block allocation, no CoW, no fragmentation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In-place Overwrite:&lt;/strong&gt; The speculation tree is written into this buffer each step. Rejected tokens are simply overwritten by the next round of drafts — no &lt;code&gt;Free&lt;/code&gt; syscall needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Commit on Verification:&lt;/strong&gt; Only after the verifier confirms tokens as correct are they &amp;ldquo;promoted&amp;rdquo; into the global PagedAttention KV cache as committed tokens, in a single batch write.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is the real-world engineering answer: use a short-lived, lock-free ring buffer to absorb the high-frequency allocation/deallocation storm, and only touch the global page table for verified, permanent tokens.&lt;/p&gt;
&lt;h2 id="confu-can-the-draft-model-see-a-little-further"&gt;ConFu: Can the Draft Model See a Little Further?&lt;/h2&gt;
&lt;p&gt;ConFu is interesting because it attacks speculative decoding from the &lt;strong&gt;draft quality&lt;/strong&gt; side rather than the memory-management side. EAGLE already improves the draft model by using target-model features, but the draft still mostly conditions on the current prefix. If the draft model only sees &amp;ldquo;where we are now&amp;rdquo;, its hidden states can drift from the target model&amp;rsquo;s next-token distribution as the drafted chain gets longer.&lt;/p&gt;
&lt;p&gt;The core ConFu idea is to add a small future-oriented signal:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Contemplate token:&lt;/strong&gt; a synthetic &amp;ldquo;pause&amp;rdquo; token inserted into the target-model context so the target model produces a hidden state that represents where generation may go next.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Soft prompt:&lt;/strong&gt; a learned prompt-like signal used to elicit this future-oriented hidden state from the target model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Future token:&lt;/strong&gt; the resulting hidden representation is passed to the draft model as an extra conditioning token, so the draft model predicts not only from the prefix but also from a target-derived hint about the future trajectory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My current interpretation: this is basically improving the draft model&amp;rsquo;s approximation to the target distribution by adding a new hidden-state control channel. The token itself is not a normal vocabulary token that will be emitted to the user. It is closer to a learned internal handle: we can &amp;ldquo;invent&amp;rdquo; a token-like object, use it to trigger useful computation, and then route its hidden state into the draft model.&lt;/p&gt;
&lt;p&gt;This still feels like a neural-network-design paper more than a pure systems trick. You need to decide how the contemplate token is represented, how the soft prompt is trained, and how the draft model consumes the future token. But it is a fun direction: speculative decoding is no longer only about making a smaller model guess faster; it is also about designing better intermediate hidden states so the guess lands closer to the target distribution.&lt;/p&gt;
&lt;p&gt;One part I still want to revisit later is the soft-prompt mechanism. Intuitively, prepending a prompt should bias the target model into producing a useful &amp;ldquo;future thought&amp;rdquo; hidden state, but I do not yet have a clean mental model for why this prompt construction is the best interface.&lt;/p&gt;
&lt;h2 id="why-vllm-adopts-eagle-but-not-dmc"&gt;Why vLLM Adopts EAGLE but Not DMC&lt;/h2&gt;
&lt;p&gt;The key difference from approaches like
is &lt;strong&gt;decoupling&lt;/strong&gt;. EAGLE does not modify the base model weights at all — it trains a lightweight external draft head as a plug-in. If you don&amp;rsquo;t want speculative decoding, you simply remove the EAGLE head and the original model remains a standard, unmodified checkpoint.&lt;/p&gt;
&lt;p&gt;In contrast, DMC requires retrofitting the base model itself with 2% of pre-training data, permanently altering its weights. This makes it impractical unless every model provider commits to the training cost.&lt;/p&gt;
&lt;p&gt;With EAGLE, the training cost is absorbed by the open-source community: labs with compute (e.g., Tsinghua, Berkeley, LMSYS) pre-train EAGLE heads for popular models (Llama-3, Qwen, Mistral, etc.) and publish them on HuggingFace. End users simply download the plug-in weights and enjoy ~3x decoding speedup — no training required.&lt;/p&gt;</description></item><item><title>Paper Notes: FlashAttention</title><link>https://www.jiangnengli.com/post/llm-infra-flashattention1/</link><pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/llm-infra-flashattention1/</guid><description>&lt;p&gt;&lt;strong&gt;Paper:&lt;/strong&gt;
&lt;/p&gt;
&lt;h2 id="1-tiling-and-safe-online-softmax-the-forward-pass-math"&gt;1. Tiling and Safe Online Softmax (The Forward Pass Math)&lt;/h2&gt;
&lt;p&gt;The fundamental bottleneck of standard attention is the $\Theta(N^2)$ memory requirement to materialize the attention score matrix in High Bandwidth Memory (HBM). FlashAttention solves this via &lt;strong&gt;tiling&lt;/strong&gt; (computing block by block in SRAM) combined with the &lt;strong&gt;Safe Online Softmax&lt;/strong&gt; mathematical trick.&lt;/p&gt;
&lt;h3 id="the-overflow-problem--safe-softmax"&gt;The Overflow Problem &amp;amp; Safe Softmax&lt;/h3&gt;
&lt;p&gt;Standard softmax operations $e^{x_i} / \sum e^{x_j}$ will trigger numerical overflow (e.g., NaN in FP16) if $x_i$ is large. To prevent this, a local maximum $m(x) = \max_i x_i$ is subtracted from all elements:&lt;/p&gt;
$$\text{softmax}(x_i) = \frac{e^{x_i - m(x)}}{\sum_j e^{x_j - m(x)}}$$&lt;h3 id="the-time-travel-reweighting-trick-online-softmax"&gt;The &amp;ldquo;Time-Travel&amp;rdquo; Reweighting Trick (Online Softmax)&lt;/h3&gt;
&lt;p&gt;Because blocks are processed sequentially and earlier blocks are discarded from SRAM, we cannot retroactively subtract a newly discovered global maximum from old blocks. Instead, FlashAttention leverages the exponential property $e^{a-b} = e^a \cdot e^{-b}$ to dynamically &amp;ldquo;decay&amp;rdquo; historical running states.&lt;/p&gt;
&lt;p&gt;For each new block $j$, the GPU computes the local scores $S_j = Q K_j^T$, the local max $m_{local}$, and the local exponentiated values $\tilde{P}_{local} = \exp(S_j - m_{local})$. The running variables are updated entirely in SRAM:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update Global Max:&lt;/strong&gt;
&lt;/p&gt;
$$m_{new} = \max(m_{old}, m_{local})$$&lt;p&gt;&lt;strong&gt;Update Running Denominator ($l$) via Exponential Decay:&lt;/strong&gt;
&lt;/p&gt;
$$l_{new} = l_{old} \cdot \exp(m_{old} - m_{new}) + \text{rowsum}(\tilde{P}_{local})$$&lt;p&gt;&lt;strong&gt;Update Running Numerator/Output ($O$) via Weighted Sum:&lt;/strong&gt;
&lt;/p&gt;
$$O_{new} = O_{old} \cdot \exp(m_{old} - m_{new}) + \tilde{P}_{local} V_{local}$$&lt;p&gt;By applying the decay factor $\exp(m_{old} - m_{new})$ to the history, the algorithm mathematically aligns all previous calculations to the new maximum without ever reloading old $K$ and $V$ matrices.&lt;/p&gt;
&lt;h2 id="2-loop-order-optimization-flashattention-1-vs-flashattention-2"&gt;2. Loop Order Optimization: FlashAttention-1 vs. FlashAttention-2&lt;/h2&gt;
&lt;p&gt;The physical execution speed of GPU kernels is heavily bound by HBM write operations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;FlashAttention-1 (KV Outer Loop, Q Inner Loop):&lt;/strong&gt; FA1 iterates over $K, V$ blocks in the outer loop. For every inner loop step over $Q$, the intermediate, partially accumulated output block $O_i$ must be read from HBM, &amp;ldquo;un-normalized&amp;rdquo; by multiplying the old denominator, updated with the new block&amp;rsquo;s weighted sum, re-normalized, and written back to HBM. This causes a massive $O_i$ read/write overhead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;FlashAttention-2 (Q Outer Loop, KV Inner Loop):&lt;/strong&gt; FA2 pins a $Q_i$ block in the outer loop and iterates through all $K_j, V_j$ blocks in the inner loop. The running variables $O_{run}$, $m_{run}$, and $l_{run}$ stay exclusively inside the SRAM registers. The intermediate $O_i$ is continuously accumulated using the decay formula and is written to HBM exactly once after the entire inner KV loop finishes. This simple loop swap eliminates the repetitive HBM writes, drastically dropping the constant factor in the $O(N^2 d^2 M^{-1})$ complexity.&lt;/p&gt;
&lt;h2 id="3-the-backward-pass-and-gradient-recomputation"&gt;3. The Backward Pass and Gradient Recomputation&lt;/h2&gt;
&lt;p&gt;During model training, the backward pass requires the full $N \times N$ attention probability matrix $P$ to calculate gradients using the Chain Rule. Writing this massive matrix to HBM during the forward pass would negate all memory optimizations.&lt;/p&gt;
&lt;h3 id="checkpointing-global-statistics"&gt;Checkpointing Global Statistics&lt;/h3&gt;
&lt;p&gt;Instead of storing the $N \times N$ matrix, the forward pass only saves the final global scalars to HBM: the global maximum ($m^{global}$) and the global denominator ($l^{global}$).&lt;/p&gt;
&lt;h3 id="on-the-fly-recomputation-and-matrix-calculus"&gt;On-the-Fly Recomputation and Matrix Calculus&lt;/h3&gt;
&lt;p&gt;During the backward pass, the GPU loads $Q_i$, $K_j$, $V_j$, and the upstream gradient $dO_i$ into SRAM. Because the true global maximum is already known, there is no need for dynamic reweighting. The exact local probability block $P_{ij}$ is reconstructed instantly:&lt;/p&gt;
$$P_{ij} = \frac{\exp(Q_i K_j^T - m^{global})}{l^{global}}$$&lt;p&gt;With $P_{ij}$ reconstructed locally, the gradients are computed using the multivariable chain rule, and the results are accumulated (+=):&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gradient of V:&lt;/strong&gt;
&lt;/p&gt;
$$dV_j \mathrel{+}= P_{ij}^T \cdot dO_i$$&lt;p&gt;&lt;strong&gt;Gradient of Pre-Softmax Scores ($S$):&lt;/strong&gt;
&lt;/p&gt;
$$dS_{ij} = P_{ij} \circ (dO_i \cdot V_j^T - D_i)$$&lt;p&gt;(where $\circ$ is element-wise multiplication, and $D_i = \text{rowsum}(dO_i \circ O_i)$)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gradients of Q and K:&lt;/strong&gt;
&lt;/p&gt;
$$dQ_i \mathrel{+}= dS_{ij} \cdot K_j$$&lt;p&gt;
&lt;/p&gt;
$$dK_j \mathrel{+}= dS_{ij}^T \cdot Q_i$$&lt;p&gt;The strict accumulation logic (+=) represents the physical manifestation of the mathematical summation over all blocks. Once the local gradients are added to the accumulators in HBM, the massive $P_{ij}$ and $dS_{ij}$ blocks are immediately destroyed from SRAM, ensuring the memory footprint remains constant $\Theta(1)$ regardless of sequence length.&lt;/p&gt;</description></item><item><title>Paper Notes: vLLM PagedAttention</title><link>https://www.jiangnengli.com/post/llm-infra-pagedattention/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/llm-infra-pagedattention/</guid><description>&lt;p&gt;&lt;strong&gt;Paper:&lt;/strong&gt;
&lt;/p&gt;
&lt;h2 id="executive-summary"&gt;Executive Summary&lt;/h2&gt;
&lt;p&gt;PagedAttention revolutionizes Large Language Model (LLM) inference by applying operating system virtual memory concepts to KV cache management. Instead of allocating contiguous GPU memory for a sequence&amp;rsquo;s maximum potential length (which causes massive external and internal fragmentation), PagedAttention divides the KV cache into fixed-size &lt;strong&gt;Physical Blocks&lt;/strong&gt; (e.g., storing 16 tokens each) and maps them dynamically via a &lt;strong&gt;Block Table&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="core-mechanisms--critical-insights"&gt;Core Mechanisms &amp;amp; Critical Insights&lt;/h2&gt;
&lt;h3 id="1-memory-management--copy-on-write-cow"&gt;1. Memory Management &amp;amp; Copy-on-Write (CoW)&lt;/h3&gt;
&lt;p&gt;To enable highly efficient memory sharing (e.g., multiple generated sequences sharing the same system prompt), PagedAttention implements a strict &lt;strong&gt;Reference Counting&lt;/strong&gt; (&lt;code&gt;ref_count&lt;/code&gt;) mechanism at the physical block level.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Shared Pointers&lt;/strong&gt;: Multiple logical blocks from different sequences can map to the exact same physical block. When this happens, the physical block&amp;rsquo;s &lt;code&gt;ref_count&lt;/code&gt; is incremented.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Copy-on-Write (CoW) Execution&lt;/strong&gt;: When a sequence generates a new token and attempts to append it to the current physical block, the system first checks the &lt;code&gt;ref_count&lt;/code&gt;.
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Trigger Condition&lt;/strong&gt;: If &lt;code&gt;ref_count &amp;gt; 1&lt;/code&gt; (meaning the block is shared), the sequence is not allowed to write directly. Instead, a CoW is triggered: the system allocates a brand-new physical block, copies the existing historical tokens into it, decrements the original block&amp;rsquo;s &lt;code&gt;ref_count&lt;/code&gt;, updates its own Block Table, and finally writes the new token into the newly copied block.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-hardware-bottleneck-dichotomy-prefill-vs-decoding"&gt;2. Hardware Bottleneck Dichotomy: Prefill vs. Decoding&lt;/h3&gt;
&lt;p&gt;The system must distinctly separate these two phases because they stress completely different physical hardware units on the GPU:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prefill Phase (Strictly Compute-Bound)&lt;/strong&gt;: To process the initial prompt (e.g., 1,000 tokens), the model must compute the Q, K, and V for all tokens. Because of the multi-layer Transformer architecture, every token must perform an Attention calculation with all preceding tokens to generate its distinct output ($O$) before passing through the FFN to the next layer. This results in a massive $O(N^2)$ Dense Matrix-Matrix Multiplication (GEMM). The GPU&amp;rsquo;s memory bandwidth is sufficient, but the Tensor Cores (ALUs) hit their maximum capacity. (This is where FlashAttention steps in to optimize).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Decoding Phase (Strictly Memory-Bound)&lt;/strong&gt;: During autoregressive generation, the model predicts only one token at a time. The arithmetic operation is a tiny Matrix-Vector Multiplication (GEMV). However, to compute this single step, the GPU must fetch the entire historical KV cache from the global HBM into the SRAM. The ALUs sit idle waiting for data to arrive. Thus, decoding speed is strictly bottlenecked by GPU memory bandwidth.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-fine-grained-branching-in-beam-search"&gt;3. Fine-Grained Branching in Beam Search&lt;/h3&gt;
&lt;p&gt;While architectural diagrams often simplify Beam Search (or parallel decoding) by showing sequences branching perfectly at the boundary of a block, the engineering reality is much more granular.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Token-Level Divergence&lt;/strong&gt;: Sequences branch at the exact token level, which almost always happens right in the middle of a physical block.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CoW Resilience&lt;/strong&gt;: The Copy-on-Write mechanism seamlessly handles this. If a beam diverges at the 5th token of a 16-token block, the CoW mechanism will copy those 5 tokens into a new physical block, and the new sequence will continue appending its unique 6th token into the new block, leaving the original shared block perfectly intact for the other beams.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="system-impact"&gt;System Impact&lt;/h2&gt;
&lt;p&gt;By combining dynamic block allocation, precise reference counting, and Copy-on-Write, PagedAttention achieves near-zero memory waste (less than 4% internal fragmentation in the final block). This fundamentally shifts the LLM inference paradigm, allowing batch sizes to scale significantly higher and dramatically improving overall system throughput.&lt;/p&gt;</description></item></channel></rss>