L4 Speculative Decoding — Tool Author's Guide

Purpose

L4 Speculative Decoding is the fifth shipped optimization layer in the

InferenceIQ v2 stack (sitting alongside L0 Hardware Profiling, L1

Kernel Tuning, L2 Quantization, and L5 Batching). It selects the best

combination of **algorithm family, draft source, k / tree-shape, and

engine flags** for a given `(target model, GPU, engine, workload

fingerprint)` tuple — without changing model weights or touching the

target-model checkpoint. The goal is sustained decode-throughput

speedups in the 1.6×–3.5× range while preserving output quality (via

the 7-method quality gate) and staying within a tenant's GPU budget.

L4 ships in three tiers (the same Compose / Tune / Learn shape L1 uses

for kernel tuning), so customers can move from a 5-second cache-hit

lookup all the way to training custom Medusa-3 heads for their

fine-tuned model, all inside the same tool surface.

Three tiers (the product shape)

the target in under five seconds via a workload-fingerprint-aware

acceptance-rate predictor plus a 30-day winner cache. Production-safe,

deterministic, always available. Target user: the 80% of customers

who just want "a faster endpoint" without running experiments.

draft × k × tree-shape), seeded by Tier 1 pareto points, bounded by a

wall-clock budget (default 60 min, max 480). Resumable across worker

restarts. N-rep statistical rigor. Target user: performance engineers

squeezing out the last 10–20% on a well-understood model.

customer's fine-tuned target. Governance (lifecycle + approval),

quality gate on held-out eval, registry write. Target user:

enterprise teams shipping a proprietary fine-tune whose top-50 open

model pre-trained heads no longer apply.

Algorithm families

L4 covers seven families in v1, with a stretch "cascade" family behind

a feature flag. Speedup ranges are from the 2026 literature and

measured on 70B-class targets under interactive batch=1 pressure.

Family Footprint Speedup Prereq Engine coverage
Classic draft model small draft + target 1.6–2.3× same-family draft available vLLM, TGI, SGLang, TRT-LLM, LMDeploy
Medusa-2 / 2.1 / 3 heads grafted on target 2.0–3.2× Medusa heads for target (pre-trained or T3-trained) SGLang (best), vLLM, TGI, TRT-LLM, LMDeploy
EAGLE-1 / 2 / 3 feature-level draft net 2.3–3.5× EAGLE heads (top-50 open models on HF, or T3-trained) SGLang (best — EAGLE-3), vLLM (EAGLE-2), TRT-LLM, TGI, LMDeploy
Lookahead (Jacobi) zero — target only 1.5–2.2× none vLLM, SGLang (partial)
REST (retrieval) target + corpus index 2.5–3.5× on code/RAG indexed corpus vLLM, SGLang (partial)
MTP (pass-through) baked into target native ~4× target trained with MTP heads (e.g. DeepSeek-V3) vLLM, SGLang catching up
Cascade (stretch) draft → medium → target 2.8–3.5× three compatible tiers SGLang preview (behind flag)

Full coverage matrix + per-engine version windows live in

L4-speculative-decoding-RESEARCH.md §4.

Workload fingerprint

L4 makes different choices for different workloads. The **8 fingerprint

values** are:

Fingerprint is supplied via a 3-question wizard (Tier 1 default) or

auto-inferred from deployment telemetry (P4 feature, flag

ENABLE_L4_AUTO_WORKLOAD_FINGERPRINT). **The fingerprint drives the

catalog's pruning rules** — e.g. REST is skipped for chat, classic

drafts are skipped for MTP-native targets.

How to run a T1 experiment

All endpoints live under /api/v1/inferenceiq/speculative/.

curl -X POST https://platform.inwire.ai/api/v1/inferenceiq/speculative/runs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Idempotency-Key: $(uuidgen)" \
  -d '{
    "target": {"model_ref": "meta-llama/Llama-3.1-70B-Instruct", "gpu": "H100-80GB", "engine": "vllm", "engine_version": "0.8.0"},
    "workload": {"fingerprint": "chat"},
    "mode": "interactive",
    "goal": {"quality_gate": "topk5", "priorities": {"throughput": 0.7, "latency": 0.3}}
  }'

Python SDK equivalent:

from inwire.inferenceiq import SpeculativeClient

client = SpeculativeClient()
run = client.runs.create(
    target={"model_ref": "meta-llama/Llama-3.1-70B-Instruct", "gpu": "H100-80GB", "engine": "vllm"},
    workload={"fingerprint": "chat"},
    mode="interactive",
    goal={"quality_gate": "topk5"},
)
for event in client.runs.stream(run.id):
    print(event)
winner = client.runs.get(run.id).winner
client.runs.apply(run.id)  # Hands off to spec 13 apply pipeline.

The SSE stream emits queued → running → per-rep → completed events.

Cache-hit runs short-circuit to completed in under a second and

surface a cache_hit=true flag on the response.

How to run a T2 tune session

After a T1 run surfaces a pareto frontier, launch a Bayesian tune:

curl -X POST .../speculative/tune-sessions \
  -d '{"run_id": "<t1-run-id>", "budget_minutes": 60}'

The session seeds priors from T1 pareto points, drops insensitive

dimensions, and runs budgeted BO. The convergence chart is a

best-so-far curve; inspect it mid-run via /events SSE or the frontend

SpecTuneSessionCard. A session that crosses 80% of its budget without

improving the top-3 by more than 1σ is auto-terminated and flagged.

How to train a T3 head

Tier 3 trains custom Medusa-3 heads (GA) or EAGLE-3 heads (preview

behind ENABLE_L4_EAGLE3_TRAINING) for a customer's fine-tuned target.

workload. JSONL format, {"prompt": "...", "completion": "..."}.

More diverse = more generalizable heads.

the trained heads must hit the gate's thresholds before registry

write.

quality gate) → uat (approved for staging; requires developer+) →

production (approved for prod; requires admin or head_reviewer,

audit log written to speculative_head_approvals) → deprecated

(flagged unsafe).

a 1-line rationale and write an audit row. Deprecation is

non-destructive; runs that already bound to the head continue.

Training takes ~30–90 min for Medusa-3, ~5–30 h for EAGLE-3 on H100.

Status + per-epoch loss + measured acceptance rate stream via /events.

Quality gate

L4 extends L1's 6-method quality gate with a 7th method

(constraint_violation) covering structured-output workloads. The

default is topk5. Override per-run via goal.quality_gate.

Method Pass / fail When to use
exact top-1 agreement ≥ threshold deterministic workloads, tight quality bar
topk5 (default) top-5 agreement ≥ 0.95 general-purpose; balances strictness + acceptance headroom
logit_kl KL(target ‖ spec) ≤ ε research / production canary
mmlu_slice ΔMMLU ≤ threshold on sliced eval regression guard for general models
constraint_violation ≤ N schema violations per 1k tokens agent / structured-output workloads
skipped gate disabled explicit opt-out (research only, audit-logged)
pending gate awaiting evaluation transient status during async measurement

A gate fail sets quality_preserved=0.0 and

error_stage="quality_regression"; the combo is disqualified by the

scoring layer regardless of throughput.

Security

L4 ships with a 3-tier security model per-deployment:

acceptance-rate telemetry. Suitable for single-tenant.

injection on acceptance path (F-S2, configurable ms), tenant-scoped

metrics. Recommended for shared-GPU multi-tenant.

deployment via F-S3. Use for workloads where the acceptance-timing

side-channel is unacceptable (regulated finance, healthcare).

The two mitigations close the 2025 papers' side channels:

on shared draft+target. Mitigated by F-S1 (per-tenant cache) +

F-S2 (timing jitter).

Mitigated by F-S1 + F-S3 (opt-out).

F-S4 (audit logging) and F-S5 (security ADR) ship on-by-default. See

the runbook for the security ADR link and residual-risk documentation.

Integration patterns

Tool How L4 plays with it
L0 Hardware Profiling L4 reads the L0 profile for feasibility (does the draft fit alongside the target on this GPU class?). Heterogeneous-GPU draft routing queries L0 topology.
L1 Kernel Tuning Medusa / EAGLE have their own kernel hot paths. L4 benchmarks run on L1-optimized kernels by default and do not re-optimize kernels inside L4.
L2 Quantization Draft-model quantization is a subsearch within L4 (FP16 / FP8 / INT4 drafts). L4 delegates quant-config validation to L2's catalog.
L3 Parallelism L4 respects the deployment's current TP configuration; it does NOT search across tensor-parallel configs.
L5 Batching L4's online adaptive k (preview behind ENABLE_L4_ADAPTIVE_K) reads L5's batch metrics to shrink k under batch pressure.

Examples

1. Chat workload with Llama-3.1-70B + Medusa-3 heads (T1 Compose).

Customer wants faster chat decode on a single H100-80GB. They POST to

/speculative/runs with fingerprint=chat, mode=interactive. The

predictor surfaces three candidates; the 30-day cache hits on EAGLE-3

heads (already pre-trained for Llama-3.1-70B on HuggingFace). Response

arrives in < 5 s with a SpecWinnerHero inline and a cache-hit badge.

Apply → spec 13 apply pipeline rolls out new engine flags.

2. Code workload with REST corpus (T2 Tune). Cursor-style code

completion workload. Customer uploads a curated 4GB code corpus, runs

a T1 probe to confirm REST is a legal candidate, then launches a T2

tune session with budget_minutes=90. BO searches over (k, tree

shape, retrieval-depth) and surfaces a 3.1× winner with constraint

quality_gate=constraint_violation (code must be parseable).

3. Enterprise fine-tune with custom Medusa-3 heads (T3 Learn). An

enterprise customer has fine-tuned Llama-3.1-8B for internal support

triage. No pre-trained Medusa heads exist for their checkpoint. They

POST to /speculative/heads/train with a 10k-prompt calibration set

and a 1k-prompt held-out eval. Medusa-3 training runs ~45 min; quality

gate topk5 passes; head is written to registry in validated state.

After developer review the head transitions to uat, then after

head_reviewer sign-off to production. Now T1 runs for this target

pick up the new head from the registry automatically.

Deviations from spec L4 2026-04-22 LOCKED plan

See the PR §10 "Deviations" list. Notable items will be recorded

there as implementation lands — this doc is the tool-author reference

and mirrors the plan where they diverge.