ML Systems

System Notes: Queueing Avalanche

A simple explanation of how a small eviction cost can scale into a control-plane stall, starving the GPU and blowing up P99 latency under concurrent tree search workloads.

Jiangneng Li

• Apr 9, 2026 • 3 min read

ML Systems

Paper Notes: SGLang and Safe Eviction

Notes on why naive LRU eviction is dangerous for tree-structured reasoning workloads, and how SGLang avoids catastrophic prefix recomputation with leaf-first eviction and reference …

Jiangneng Li

• Apr 9, 2026 • 4 min read

ML Systems

Paper Notes: Speculative Decoding

Notes on speculative decoding methods for accelerating LLM inference, including Medusa, EAGLE, and ConFu.

Jiangneng Li

• Apr 2, 2026 • 4 min read

ML Systems

Paper Notes: Dynamic Memory Compression

DMC retrofits LLMs to autonomously merge redundant KV cache entries based on learned contextual importance, trading 2% pre-training data fine-tuning for significant memory …

Jiangneng Li

• Apr 2, 2026 • 1 min read

ML Systems

Paper Notes: FlashAttention

FlashAttention eliminates the O(N²) memory bottleneck of standard attention by tiling computation in SRAM with an online softmax trick, achieving exact results with no …

Jiangneng Li

• Apr 1, 2026 • 3 min read

ML Systems

Paper Notes: vLLM PagedAttention

PagedAttention revolutionizes LLM inference by applying OS virtual memory concepts to KV cache management, achieving near-zero memory waste.

Jiangneng Li

• Mar 27, 2026 • 2 min read