Skip to content

SGLang: Efficient Execution of Structured Language Model Programs

ArXiv: 2312.07104

🎯 Pitch

SGLang is a Python-embedded language plus co-designed runtime that accelerates multi-call, control-flow-heavy LM programs by reusing computation across calls (RadixAttention for KV-cache reuse) and by decoding structured outputs faster (compressed finite state machines). This matters because many real-world LLM applications—agents, few-shot evaluation, tree search, JSON/RAG pipelines—perform many dependent LLM calls with shared prefixes or strict output formats, and SGLang delivers up to 6.4× higher throughput and substantial latency reductions by eliminating redundant work and enabling scalable, reliable execution.


1. Executive Summary (2-3 sentences)

SGLang is a Python-embedded programming system plus a co-designed runtime that makes multi-call, control-flow-heavy, structured LLM applications run much faster by exploiting reusable computation across calls. It matters because many modern LLM workloads (agents, few-shot evaluation, tree searches, JSON outputs, RAG pipelines) repeatedly share prompt prefixes and/or enforce output grammars, yet general-purpose inference servers typically recompute work token-by-token and discard reusable intermediate state. The system’s main runtime contributions—RadixAttention for KV cache reuse and a compressed finite state machine for faster regex-constrained decoding—drive the reported speedups of up to 6.4× throughput and up to 3.7× lower latency across diverse workloads (Figures 5–6; Abstract).


2. Context and Motivation

  • What specific problem or gap does this paper address?
  • Many real LLM applications are not a single “prompt → completion” call; they are language model programs (LM programs) with:
    • multiple generation calls that may depend on each other, and
    • explicit control flow (branching, looping, parallel subcalls), plus
    • structured inputs/outputs for composition with software systems (Introduction; Section 2).
  • Existing systems make LM programs:

    • hard to write (manual string concatenation, fragile parsing, ad-hoc parallelism), and
    • inefficient to run (redundant computation and memory use), especially when prompt prefixes repeat across calls (Introduction; Section 2).
  • Why is this problem important (real-world impact, theoretical significance, or both)?

  • Real-world impact:
    • Agent workflows, reasoning strategies (e.g., tree search), few-shot evaluation, and RAG pipelines inherently require many LLM calls, often sharing common instruction templates or conversation histories (Introduction; Section 6.2).
    • Inefficient execution increases latency (worse UX), reduces throughput (higher serving cost), and wastes GPU memory by duplicating reusable intermediate tensors (Introduction; Section 3).
  • System-level significance:

    • The paper treats LM program execution as a structured workload that can be optimized, rather than as isolated independent requests (Figure 1; Introduction).
  • What prior approaches existed, and where do they fall short?

  • Programming interfaces and systems:
    • Guidance and LMQL provide low-level prompting/programming primitives like generation and selection, but lack SGLang’s runtime co-design and/or key execution features like efficient batching/parallelism in the evaluated settings (Table 1; Section 6.2 notes missing functionality and performance issues).
  • Inference engines:
    • Systems like vLLM, TGI, and TensorRT-LLM optimize for general serving but (in the framing here) lack systematic, automatic reuse of shared prompt prefixes across separate calls and program instances, and decode constrained outputs token-by-token (Introduction; Sections 3–4).
  • Constrained decoding:

    • Existing regex/FSM guided decoding works by masking invalid next tokens one token at a time, even when the constraint implies a deterministic multi-token substring (Section 4; Figure 4(c)).
  • How does this paper position itself relative to existing work?

  • It positions SGLang as a low-level language like LMQL/Guidance (Section 2; Table 1), but distinguishes itself via a custom runtime (SRT) that enables new optimizations (Figure 1; Section 2).
  • It explicitly aims to exploit multi-call structure and shared prefixes in LM programs with automatic runtime mechanisms (RadixAttention + cache-aware scheduling) rather than manual configuration (Section 3).
  • It supports both:
    • open-weight models via SRT, and
    • API-only models via API speculative execution (Section 5; Section 6.2 “Results on API models”).

3. Technical Approach

This section explains how SGLang works end-to-end: a frontend language expresses LM programs, and a backend runtime executes them efficiently (Figure 1).

3.1 Frontend: a Python-embedded language for LM programs

  • Core idea: represent a prompt as a mutable “state” s that you incrementally build, while generation calls write named variables back into that state (Section 2; Figure 2).

  • Key primitives (Section 2):

  • extend / +=: append text (and other primitives) to the prompt under construction.
  • gen(name, ...): generate text and store it under s[name]; supports stop=... and regex=... for constrained output (Section 2; Figure 2 shows gen("judgment", stop="END") and later gen("output", regex=schema)).
  • select(name, choices=[...]): choose the highest-probability option among discrete choices (Section 2; Figure 2 uses select("related", choices=["yes","no"])).
  • fork(n) and join: create parallel branches of the current prompt state and later merge them (Section 2; Figure 2).
  • image(path) / video(...): include multimodal inputs (Section 2; Table 1 shows video is specific to SGLang among the compared systems).

  • How this reduces “LM program plumbing” (Figure 2):

  • Figure 2 demonstrates a “branch-solve-merge” workflow:
    1. Build initial prompt with a system message and multimodal input.
    2. Use select to decide if the essay is image-related.
    3. If yes, fork into multiple parallel “dimensions” (clarity/originality/evidence).
    4. Each fork runs a gen to produce a judgment.
    5. Merge fork outputs back in Python, then request a summary + grade.
    6. Enforce a final JSON schema using a regex constraint (Figure 2; Section 2).
  • This example is used to motivate why parallelism and structured decoding are first-class needs, not add-ons (Section 2; Figure 2 annotations explicitly tie to runtime optimizations).

3.2 Execution model: interpreter streams vs compiler graphs

  • Interpreter mode (default in the paper):
  • A prompt is treated as an asynchronous stream: operations like extend, gen, select are submitted non-blockingly; Python continues running until a variable is fetched, which blocks for synchronization (Section 2).
  • Each prompt is managed by a stream executor in a background thread, enabling intra-program parallelism (Section 2).

  • Compiler mode (Appendix D):

  • Programs can be traced into a computational graph IR with nodes like ConstantText, Gen, Select, Fork, Join, etc. (Appendix D.1; Figure 14).
  • This enables graph execution and potential rewrites; one explored optimization is code movement to increase shared prefixes (Appendix D.2).

3.3 Backend runtime: three optimizations tied to LM program structure

Figure 1 summarizes the architecture: frontend primitives → runtime optimizations. The paper’s runtime methods map to three bottlenecks:

  1. Redundant prefix computation across calls/requests → RadixAttention KV cache reuse (Section 3; Figure 3; Appendix A).
  2. Token-by-token constrained decoding → compressed FSM multi-token decoding steps (Section 4; Figures 4 and 11; Appendix B).
  3. Repeated input-token costs in API-only multi-call programs → API speculative execution (Section 5).

Below are the mechanisms in more detail.


3.4 RadixAttention: automatic KV cache reuse across requests and calls (Section 3)

Definitions (paper-specific / essential)

  • KV cache: intermediate tensors produced during transformer inference that depend only on the already-processed prefix tokens; reused during decoding to avoid recomputing attention over the whole prefix (Appendix A.1).
  • prefill vs decoding:
  • prefill processes the input prompt tokens in a forward pass.
  • decoding generates tokens sequentially, reusing the prefix’s KV cache (Appendix A.1).
  • radix tree: a space-efficient prefix tree where edges can represent sequences of tokens, not just one token per edge (Section 3).
  • LRU eviction: “least recently used” eviction policy; here it evicts least-recently-used leaves first to preserve shared ancestors as long as possible (Section 3).

Mechanism: treat the KV cache as a shared, tree-indexed cache

  • Data structure:
  • The runtime maintains a radix tree mapping token sequences → KV cache tensors (Section 3).
  • KV cache is stored in a paged layout where page size equals one token (Section 3).

  • Core operations at runtime (Section 3; Figure 3 illustrates them):

  • Prefix match: when a new request arrives, match the longest prompt prefix already in the radix tree.
  • Reuse: reuse the KV cache for that prefix, avoiding recomputation of those prefix tokens.
  • Insert: after generating new tokens, insert the new extended sequence (prompt + generation) into the tree.
  • Split edges/nodes: if two prompts share only part of an existing edge, split nodes so both can share the common prefix (Figure 3 step (4) describes splitting node “b” so two chat sessions share the system prompt).
  • Evict: when memory is tight, evict least-recently-used leaves (Figure 3 steps (5), (8), (9) show evictions).

  • Why evict leaves first?

  • Evicting leaves preserves shared internal nodes (common prefixes) longer, enabling reuse across multiple descendants until those internal nodes themselves become leaves (Section 3).

  • Continuous batching constraints:

  • The system cannot evict nodes used by currently running batches, so each tree node has a reference counter; only nodes with counter 0 are evictable (Section 3).

  • Memory pool design choice:

  • The cache and running requests share the same memory pool dynamically; when many waiting requests run, cached tokens can be evicted in favor of a larger batch (Section 3).
  • This design aims to avoid rigid partitioning (“fixed cache size”) and instead adapt to load.

Cache-aware scheduling: run requests in an order that increases reuse

  • Problem: if scheduling alternates unrelated requests, the cache thrashes, reducing hit rate (Section 3).
  • Policy: sort waiting requests by matched prefix length and prioritize longer matches (“longest-shared-prefix-first”) (Section 3).
  • Algorithm reference: pseudocode is given as Algorithm 1 in Appendix A.2.

Appendix A.2 (Algorithm 1) shows the scheduler: match prefixes for all waiting requests, sort by matched prefix length, then construct a new batch subject to available memory (evictable cache + free pool), update reference counters, and later insert finished requests back into the tree.

  • Theoretical result (offline case):
  • Theorem 3.1 claims an optimal cache hit rate can be achieved by visiting the radix tree in DFS order when cache size ≥ maximum request length; longest-shared-prefix-first is equivalent to DFS order (Section 3; proof in Appendix A.3).

Section 3 (Theorem 3.1) and Appendix A.3 provide the proof sketch: each edge’s KV cache needs computing at least once, and DFS ensures each edge is computed only once (given sufficient cache), achieving the lower bound on total KV cache computation.

Frontend-runtime co-design: “frontend hints”

  • During fork, the frontend sends the prefix first “as a hint” so the runtime inserts the correct prefix and schedules/matches more effectively (Section 3, paragraph after Figure 3).
  • This is presented as a concrete example where language design improves runtime efficiency (Section 3; also reflected in the RadixAttention ablation in Figure 8(c), “No Frontend Hint”).

3.5 Compressed finite state machine for fast regex-constrained decoding (Section 4; Appendix B)

Definitions (paper-specific / essential)

  • constrained decoding: restricting the model’s output to match a formal constraint (here, a regular expression) by disallowing invalid next tokens (Section 4).
  • finite state machine (FSM): a graph of states and transitions that represent all strings matching a regex; decoding tracks the current state and only allows tokens that correspond to valid outgoing transitions (Section 4; Appendix B).
  • compressed FSM: an FSM transformed by merging sequences of transitions where only a single next character/string is valid, enabling multi-token “jump forward” steps (Section 4; Appendix B.1–B.2).

Why normal FSM decoding is slow

  • Standard guided decoding masks logits to allow only tokens valid for the next step, then decodes one token per forward pass (Section 4).
  • Many constraints contain long deterministic substrings (e.g., JSON keys and punctuation), where there is only one valid continuation for multiple characters/tokens.
  • Figure 4 demonstrates this with a JSON prefix like {"summary": ":
  • In the normal FSM, it takes many steps (Figure 4(c)).
  • With compression, the deterministic substring becomes one compressed transition, so multiple tokens can be decoded in one forward pass (Figure 4(b) and 4(d)).

Mechanism: compress singular transitions

  • Appendix B.1 defines:
  • a singular transition edge as an edge where (1) the source has only one successor and (2) only one acceptable character/string;
  • a compressed edge merges consecutive singular edges by concatenating their texts.
  • The runtime:
  • Builds the original FSM over characters/strings (Appendix B.1).
  • Recursively merges singular transitions until no further compression is possible (Appendix B.1).
  • During decoding, uses “jump forward” when a compressed edge implies a deterministic continuation (Appendix B.2; Figure 11 contrasts normal vs jump-forward decoding).

Handling tokenizer mismatch: retokenization

  • A critical practical issue: constraints are in characters/strings, but models operate in tokens; you cannot arbitrarily split a deterministic string into tokens without respecting the model’s tokenizer (Appendix B.2).
  • The runtime addresses this by retokenizing the previous text plus the compressed-edge text using the original tokenizer to ensure alignment (Appendix B.2).

3.6 API speculative execution for API-only models (Section 5)

  • Motivation: With black-box APIs, multiple sequential gen calls can duplicate input context tokens across calls, increasing cost and latency (Section 5).
  • Mechanism described:
  • Enable speculative execution on an earlier call: let it generate past the stop condition for a few extra tokens.
  • Keep the extra tokens in the interpreter and later match/reuse them with subsequent primitives (Section 5).
  • The paper emphasizes this requires careful prompt engineering to keep the continuation aligned with later templates (Section 5).

4. Key Insights and Innovations

1) Treating KV cache as a shared tree-based LRU cache (RadixAttention)

  • What is novel here (within this paper’s framing)?
  • KV cache is not per-request ephemeral state; it is retained across requests and program calls in a radix tree with LRU eviction of leaves (Section 3).
  • This supports multi-level sharing (e.g., shared system prompt + shared few-shot examples + shared question prefix), which the evaluation highlights for tasks like HellaSwag (“two-level sharing”) (Section 6.2).
  • Why it matters:
  • It reduces redundant prefill computation and memory footprint, enabling larger batch sizes and higher throughput (Section 6.2 explanation for MMLU).
  • What makes it more than incremental:
  • The combination of (a) radix-tree representation, (b) LRU eviction over tree nodes, and (c) cache-aware scheduling is presented as a systematic mechanism that adapts to dynamic, branching program structure (Section 3; Figure 3 shows dynamic splitting/eviction scenarios).

2) Cache-aware scheduling based on longest-shared-prefix-first + theoretical grounding

  • Novel element:
  • Scheduling is explicitly optimized for cache reuse: reorder waiting requests by matched prefix length (Section 3).
  • Theorem 3.1 connects this heuristic to optimal DFS traversal under an offline assumption and sufficient cache (Section 3; Appendix A.3).
  • Why it matters:
  • High cache hit rates depend not only on caching but also on executing requests in an order that preserves locality (Section 3).
  • Incremental vs fundamental:
  • The scheduling rule is conceptually simple, but the paper strengthens it with a formal optimality claim (offline case) and integrates it into continuous batching mechanics (Algorithm 1, Appendix A.2).

3) Compressed FSM enabling multi-token constrained decoding steps

  • What is new:
  • Instead of masking logits for one-token transitions, compress deterministic substrings into single transitions so decoding can advance multiple tokens per forward pass (Section 4; Figure 4).
  • Why it matters:
  • Structured outputs (like JSON) are common in LM programs; speeding this up targets a frequent bottleneck (Section 4; Section 6.2 “JSON decoding”).
  • Key enabling detail:
  • The runtime must manage the string↔token mismatch; the retokenization mechanism is essential to make “jump forward” compatible with tokenization (Appendix B.2).

4) Frontend/runtime co-design as a performance tool (fork hints, asynchronous streams)

  • What is new in the system design:
  • The interpreter’s stream model enables intra-program parallelism (Section 2).
  • The “frontend hint” during fork is specifically designed to improve runtime scheduling/matching in RadixAttention (Section 3).
  • Why it matters:
  • The ablation in Figure 8(c) shows performance drops when disabling frontend parallelism or hints, suggesting language semantics directly influence runtime efficiency (Section 6.3; Figure 8(c)).

5) API speculative execution for black-box endpoints (conditional innovation)

  • What it adds:
  • Extends optimization beyond open-weight runtimes to API-only models by reducing repeated-context calls in multi-call programs (Section 5).
  • Why it matters:
  • Multi-call structured prompting is common even when you cannot alter inference internals; this offers a way to cut cost/latency without model-side changes (Section 5; Section 6.2 “Results on API models”).

5. Experimental Analysis

5.1 Evaluation methodology

  • Workloads (Section 6.1):
  • Few-shot benchmarks: 5-shot MMLU and 20-shot HellaSwag.
  • Agent/reasoning workflows: ReAct agent traces, Generative Agents traces, Tree-of-thought on GSM-8K, Skeleton-of-thought for tip generation.
  • Structured outputs: LLM judge using branch-solve-merge (matches Figure 2), and JSON decoding using regex schemas.
  • Conversation: Multi-turn chat (4 turns; input 256–512 tokens/turn), with short vs long outputs.
  • Pipeline: DSPy RAG pipeline example (Section 6.1).
  • Models (Section 6.1):
  • Open-weight: Llama-2 family (7B to 70B), Mixtral-8x7B.
  • Multimodal: LLaVA-v1.5-7B (image) and LLaVA-NeXT-34B (video).
  • API-only: OpenAI GPT-3.5 for the speculative execution test (Section 6.2).
  • Hardware (Section 6.1; Appendix C):
  • Most experiments: AWS EC2 G5 with NVIDIA A10G (24GB).
  • 7B on 1×A10G; larger models with tensor parallelism across multiple A10Gs.
  • Additional: A100 (80GB) experiments (Section 6.1; Appendix C notes Llama-70B on 4×A100).
  • Baselines (Section 6.1):
  • Guidance (llama.cpp backend), vLLM, LMQL (HF Transformers backend).
  • The evaluation generally avoids toggling optimizations that change computation results, aiming for comparable outputs (Section 6.1).
  • Metrics (Section 6.1):
  • Throughput: program instances per second (p/s) at maximum batching.
  • Latency: single-program execution without batching, averaged over instances.

5.2 Main quantitative results

Open-weight models: normalized throughput and latency

  • Figures 5 and 6 summarize results on Llama-7B (normalized):
  • SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency across the evaluated workloads (Section 6.2; Figures 5–6; Abstract).

Section 6.2 explicitly attributes these gains to “KV cache reuse, the exploitation of parallelism within a single program, and faster constrained decoding,” tying each workload’s speedup to a mechanism (e.g., MMLU → few-shot KV reuse; JSON decoding → compressed FSM; tree/skeleton-of-thought → intra-program parallelism + reuse).

  • The paper also reports cache hit rates:
  • Across benchmarks in Figures 5–6, cache hit rates range from 50% to 99% (Section 6.2).
  • Cache-aware scheduling achieves ~96% of the optimal hit rate on average, with achieved vs optimal shown in Figure 13 (Section 6.2; Figure 13).

Larger models with tensor parallelism

  • Mixtral-8x7B: Figure 7 shows normalized throughput improvements over vLLM across the same benchmark suite (Section 6.2; Figure 7).
  • Llama-2-70B: Figure 12 (Appendix) provides normalized throughput under tensor parallelism (Section 6.2 references it; Figure 12 is in Appendix).

Multimodal throughput (absolute numbers)

  • Table 2 provides explicit throughput numbers:
  • LLaVA-v1.5-7B (image):
    • baseline (author implementation): 0.18 image/s
    • SGLang: 1.15 image/s
  • LLaVA-NeXT-34B (video):
    • baseline: 0.02 frame/s
    • SGLang: 0.10 frame/s

Table 2 shows these throughput comparisons directly; Section 6.2 explains reuse arises because multiple questions share the same image, and the runtime hashes input images as keys for reuse in the radix tree.

API-only models (cost reduction)

  • The API speculative execution experiment:
  • A prompt extracts three fields from a Wikipedia page using GPT-3.5.
  • With few-shot prompting, speculative execution reduces input token cost by ~3× because it avoids paying the repeated context for three separate field extractions (Section 6.2, “Results on API models”).

5.3 Do the experiments support the claims?

  • Support for “KV cache reuse is a major driver”:
  • Section 6.2 gives workload-by-workload causal explanations:
    • MMLU: reuse 5-shot examples; reduces prefill and first-token latency.
    • HellaSwag: reuse few-shot examples + shared question prefix across multiple choices (“two-level sharing”).
    • Multi-turn chat: reuse chat history; speedup larger for short outputs because reuse reduces prefix time, but long outputs are decoding-dominated (Section 6.2).
  • The ablation on cache hit rate vs performance (Figure 8(a)(b)) directly links higher cache hit rate to:

    • larger batch size,
    • higher throughput,
    • lower first-token latency and total latency (Section 6.3; Figure 8(a)(b)).
  • Support for “co-design matters” (frontend + runtime):

  • Figure 8(c) includes ablations:
    • “No Frontend Parallelism” and “No Frontend Hint” degrade throughput relative to “Full Optimization” (Section 6.3; Figure 8(c)).
  • This ties performance not just to caching, but to how programs are executed and how hints help tree insertion/matching (Section 3 + Section 6.3).

  • Support for “compressed FSM speeds constrained decoding”:

  • Section 6.3 reports:

    • compressed FSM increases throughput by 1.6× on the JSON decoding benchmark.
    • preprocessing the FSM and reusing it across a batch is necessary; otherwise, throughput is 2.4× lower if preprocessing is redone per request (Section 6.3).
  • Overhead / “doesn’t hurt when no reuse”:

  • Section 6.3 reports an overhead test on ShareGPT (a benchmark stated to have no KV cache reuse opportunities):
    • 100 requests take 74.3 s total; radix-tree management takes 0.2 s (< 0.3% overhead) (Section 6.3).

5.4 Ablations, failure cases, robustness checks

  • Ablation studies present:
  • Cache hit rate vs latency/throughput (Figure 8(a)(b)).
  • Component ablation for RadixAttention (Figure 8(c)): no cache, no tree structure, FCFS scheduling, random scheduling, no frontend parallelism, no frontend hint, full optimizations.
  • Compressed FSM throughput and preprocessing reuse (Section 6.3).

  • Conditional / mixed results explicitly discussed:

  • Multi-turn chat with long outputs shows “almost no speedup” because decoding dominates and there is little sharing across sessions (Section 6.2).
  • Cache-aware scheduling can cause starvation; integration with fair scheduling is left to future work (Section 3).

6. Limitations and Trade-offs

  • Assumptions and where benefits come from
  • RadixAttention helps most when there are shared prefixes across requests/calls (Section 3; Section 6.2).
  • Workloads with limited cross-request sharing (e.g., multi-turn chat with long, decode-heavy outputs) see limited gains (Section 6.2).

  • Scheduling trade-off: throughput vs fairness

  • Cache-aware scheduling uses longest-shared-prefix-first to reduce cache thrashing, but it can cause starvation for some requests (Section 3).
  • The paper explicitly leaves integration with fairness methods as future work (Section 3).

  • Theoretical result caveat

  • Theorem 3.1’s optimality relies on an offline batch and cache size ≥ maximum request length (Section 3; Appendix A.3).
  • A footnote notes that in practice computation can differ because unpredictable output lengths can cause recomputation of KV cache (footnote near Theorem 3.1 in Section 3).

  • Compressed FSM limitations: tokenization and probability distortion

  • Tokenization mismatch requires retokenization; this introduces some overhead (Appendix B.2) and adds complexity.
  • Appendix B.3 describes an open problem: distorted probability when compressed paths correspond to different token-sequence probabilities; accurately computing choice probabilities may require summing over multiple tokenizations, which is expensive (Appendix B.3).

  • Compiler mode limitations

  • Graph construction uses tracing with abstract arguments and is limited to programs without data-dependent control flow (Appendix D.1).
  • The code-movement optimization is described as “aggressive” and can change semantics; failures occur when reordering constants alters meaning (Appendix D.2).

  • Scope limitations noted as future directions

  • Output modalities beyond current support (text + image/video inputs are supported; additional output modalities are listed as future work) (Section 8).
  • Extending RadixAttention across multiple memory hierarchy levels (DRAM/disk) and “fuzzy semantic matching” are listed but not implemented here (Section 8).

7. Implications and Future Directions

  • How this work changes the landscape
  • It reframes LLM serving not just as isolated completion requests but as execution of structured programs, where:
    • shared prompt prefixes are a cacheable resource, and
    • constrained decoding is a compiler/runtime problem rather than purely a prompting trick (Figure 1; Sections 3–4).
  • It demonstrates that co-designing a programming interface with a runtime can unlock optimizations that are difficult to retrofit into generic API servers (Section 3 frontend hints; Figure 8(c) co-design ablations).

  • Follow-up research directions enabled/suggested (explicitly listed in Section 8 and implied by limitations)

  • Fairness-aware cache scheduling: resolve starvation while maintaining cache locality (Section 3; Section 8 mentions fixing starvation; also references fair scheduling work [42] in Section 3).
  • Memory-hierarchy-aware caching: extend RadixAttention beyond GPU memory to DRAM/disk tiers (Section 8).
  • Semantic/fuzzy prefix matching: reuse beyond exact token prefixes (Section 8).
  • More reliable compiler optimizations: improve static scheduling/memory planning; handle data-dependent control flow in compiler mode (Section 8; Appendix D.1).
  • Constrained decoding correctness: address distorted probability under compression (Appendix B.3).

  • Practical applications / downstream use cases (supported by evaluated workloads)

  • Agents and tool-using workflows: ReAct and generative agents traces run faster due to reuse of templates and prior calls (Section 6.2).
  • Reasoning/search prompting: Tree-of-thought and Skeleton-of-thought benefit from parallel subcalls plus reuse (Section 6.2).
  • Evaluation pipelines: LLM judges with branch-solve-merge map naturally to fork and benefit from parallelism + reuse (Figure 2; Section 6.2).
  • Structured outputs at scale: JSON/regex constrained decoding benefits from compressed FSM throughput improvements (Section 6.3; Figure 4).
  • RAG systems: shared context examples enable reuse in DSPy pipelines (Section 6.2).
  • Multimodal Q/A over the same media: hashing images to reuse image-token KV cache (Section 6.2; Table 2).

  • Evidence of deployment relevance

  • The system is deployed in Chatbot Arena, reporting observed cache hit rates (52.4% for LLaVA-Next-34B, 74.1% for Vicuna-33B) and a 1.7× first-token latency reduction for Vicuna-33B (Section 6.2, “Production deployment”). This supports the claim that prefix reuse occurs in real traffic via common system messages, reused images, and chat histories (Section 6.2).

Section 6.2 “Production deployment” provides these concrete observed cache hit rates and the reported first-token latency reduction, indicating RadixAttention’s reuse opportunities persist outside synthetic benchmarks.