A simple explanation of how a small eviction cost can scale into a control-plane stall, starving the GPU and blowing up P99 latency under concurrent tree search workloads.
Notes on why naive LRU eviction is dangerous for tree-structured reasoning workloads, and how SGLang avoids catastrophic prefix recomputation with leaf-first eviction and reference protection.
Notes on speculative decoding methods for accelerating LLM inference, including Medusa, EAGLE, and ConFu.
DMC retrofits LLMs to autonomously merge redundant KV cache entries based on learned contextual importance, trading 2% pre-training data fine-tuning for significant memory reduction.
FlashAttention eliminates the O(N²) memory bottleneck of standard attention by tiling computation in SRAM with an online softmax trick, achieving exact results with no approximation.
PagedAttention revolutionizes LLM inference by applying OS virtual memory concepts to KV cache management, achieving near-zero memory waste.