L4 Speculative Decoding — Tool Author's Guide
Purpose
L4 Speculative Decoding is the fifth shipped optimization layer in the
InferenceIQ v2 stack (sitting alongside L0 Hardware Profiling, L1
Kernel Tuning, L2 Quantization, and L5 Batching). It selects the best
combination of **algorithm family, draft source, k / tree-shape, and
engine flags** for a given `(target model, GPU, engine, workload
fingerprint)` tuple — without changing model weights or touching the
target-model checkpoint. The goal is sustained decode-throughput
speedups in the 1.6×–3.5× range while preserving output quality (via
the 7-method quality gate) and staying within a tenant's GPU budget.
L4 ships in three tiers (the same Compose / Tune / Learn shape L1 uses
for kernel tuning), so customers can move from a 5-second cache-hit
lookup all the way to training custom Medusa-3 heads for their
fine-tuned model, all inside the same tool surface.
Three tiers (the product shape)
- Tier 1 — Compose. Pick the best
(algorithm, draft, config)for
the target in under five seconds via a workload-fingerprint-aware
acceptance-rate predictor plus a 30-day winner cache. Production-safe,
deterministic, always available. Target user: the 80% of customers
who just want "a faster endpoint" without running experiments.
- Tier 2 — Tune. Bayesian-optimisation search over (algorithm ×
draft × k × tree-shape), seeded by Tier 1 pareto points, bounded by a
wall-clock budget (default 60 min, max 480). Resumable across worker
restarts. N-rep statistical rigor. Target user: performance engineers
squeezing out the last 10–20% on a well-understood model.
- Tier 3 — Learn. Train custom Medusa-3 or EAGLE-3 heads for the
customer's fine-tuned target. Governance (lifecycle + approval),
quality gate on held-out eval, registry write. Target user:
enterprise teams shipping a proprietary fine-tune whose top-50 open
model pre-trained heads no longer apply.
Algorithm families
L4 covers seven families in v1, with a stretch "cascade" family behind
a feature flag. Speedup ranges are from the 2026 literature and
measured on 70B-class targets under interactive batch=1 pressure.
| Family | Footprint | Speedup | Prereq | Engine coverage |
|---|---|---|---|---|
| Classic draft model | small draft + target | 1.6–2.3× | same-family draft available | vLLM, TGI, SGLang, TRT-LLM, LMDeploy |
| Medusa-2 / 2.1 / 3 | heads grafted on target | 2.0–3.2× | Medusa heads for target (pre-trained or T3-trained) | SGLang (best), vLLM, TGI, TRT-LLM, LMDeploy |
| EAGLE-1 / 2 / 3 | feature-level draft net | 2.3–3.5× | EAGLE heads (top-50 open models on HF, or T3-trained) | SGLang (best — EAGLE-3), vLLM (EAGLE-2), TRT-LLM, TGI, LMDeploy |
| Lookahead (Jacobi) | zero — target only | 1.5–2.2× | none | vLLM, SGLang (partial) |
| REST (retrieval) | target + corpus index | 2.5–3.5× on code/RAG | indexed corpus | vLLM, SGLang (partial) |
| MTP (pass-through) | baked into target | native ~4× | target trained with MTP heads (e.g. DeepSeek-V3) | vLLM, SGLang catching up |
| Cascade (stretch) | draft → medium → target | 2.8–3.5× | three compatible tiers | SGLang preview (behind flag) |
Full coverage matrix + per-engine version windows live in
L4-speculative-decoding-RESEARCH.md §4.
Workload fingerprint
L4 makes different choices for different workloads. The **8 fingerprint
values** are:
chat— interactive, batch 1–2, latency-sensitive. EAGLE-3 or Medusa-3.code— code completion, batch 1–4. REST shines here (3×+).rag— retrieval-augmented, batch 2–8. REST with corpus.agent— ReAct / tool-use, batch 1–4, constrained output. Medusa-3 + constraint gate.long_context— summarization / doc QA. Prefill dominates; Lookahead or EAGLE-3 on decode.reasoning— chain-of-thought. Long decodes favor EAGLE-3.vision— multimodal prefill; decode speedup is a small fraction.multimodal— mixed text + vision/audio.
Fingerprint is supplied via a 3-question wizard (Tier 1 default) or
auto-inferred from deployment telemetry (P4 feature, flag
ENABLE_L4_AUTO_WORKLOAD_FINGERPRINT). **The fingerprint drives the
catalog's pruning rules** — e.g. REST is skipped for chat, classic
drafts are skipped for MTP-native targets.
How to run a T1 experiment
All endpoints live under /api/v1/inferenceiq/speculative/.
curl -X POST https://platform.inwire.ai/api/v1/inferenceiq/speculative/runs \
-H "Authorization: Bearer $TOKEN" \
-H "Idempotency-Key: $(uuidgen)" \
-d '{
"target": {"model_ref": "meta-llama/Llama-3.1-70B-Instruct", "gpu": "H100-80GB", "engine": "vllm", "engine_version": "0.8.0"},
"workload": {"fingerprint": "chat"},
"mode": "interactive",
"goal": {"quality_gate": "topk5", "priorities": {"throughput": 0.7, "latency": 0.3}}
}'
Python SDK equivalent:
from inwire.inferenceiq import SpeculativeClient
client = SpeculativeClient()
run = client.runs.create(
target={"model_ref": "meta-llama/Llama-3.1-70B-Instruct", "gpu": "H100-80GB", "engine": "vllm"},
workload={"fingerprint": "chat"},
mode="interactive",
goal={"quality_gate": "topk5"},
)
for event in client.runs.stream(run.id):
print(event)
winner = client.runs.get(run.id).winner
client.runs.apply(run.id) # Hands off to spec 13 apply pipeline.
The SSE stream emits queued → running → per-rep → completed events.
Cache-hit runs short-circuit to completed in under a second and
surface a cache_hit=true flag on the response.
How to run a T2 tune session
After a T1 run surfaces a pareto frontier, launch a Bayesian tune:
curl -X POST .../speculative/tune-sessions \
-d '{"run_id": "<t1-run-id>", "budget_minutes": 60}'
The session seeds priors from T1 pareto points, drops insensitive
dimensions, and runs budgeted BO. The convergence chart is a
best-so-far curve; inspect it mid-run via /events SSE or the frontend
SpecTuneSessionCard. A session that crosses 80% of its budget without
improving the top-3 by more than 1σ is auto-terminated and flagged.
How to train a T3 head
Tier 3 trains custom Medusa-3 heads (GA) or EAGLE-3 heads (preview
behind ENABLE_L4_EAGLE3_TRAINING) for a customer's fine-tuned target.
- Calibration set — 2k–20k prompts representative of the target
workload. JSONL format, {"prompt": "...", "completion": "..."}.
More diverse = more generalizable heads.
- Eval set — held-out 500–2k prompts. Used by the quality gate;
the trained heads must hit the gate's thresholds before registry
write.
- Lifecycle —
draft(training in progress) →validated(passed
quality gate) → uat (approved for staging; requires developer+) →
production (approved for prod; requires admin or head_reviewer,
audit log written to speculative_head_approvals) → deprecated
(flagged unsafe).
- Approval workflow — transitions to
uatorproductionrequire
a 1-line rationale and write an audit row. Deprecation is
non-destructive; runs that already bound to the head continue.
Training takes ~30–90 min for Medusa-3, ~5–30 h for EAGLE-3 on H100.
Status + per-epoch loss + measured acceptance rate stream via /events.
Quality gate
L4 extends L1's 6-method quality gate with a 7th method
(constraint_violation) covering structured-output workloads. The
default is topk5. Override per-run via goal.quality_gate.
| Method | Pass / fail | When to use |
|---|---|---|
exact |
top-1 agreement ≥ threshold | deterministic workloads, tight quality bar |
topk5 (default) |
top-5 agreement ≥ 0.95 | general-purpose; balances strictness + acceptance headroom |
logit_kl |
KL(target ‖ spec) ≤ ε | research / production canary |
mmlu_slice |
ΔMMLU ≤ threshold on sliced eval | regression guard for general models |
constraint_violation |
≤ N schema violations per 1k tokens | agent / structured-output workloads |
skipped |
gate disabled | explicit opt-out (research only, audit-logged) |
pending |
gate awaiting evaluation | transient status during async measurement |
A gate fail sets quality_preserved=0.0 and
error_stage="quality_regression"; the combo is disqualified by the
scoring layer regardless of throughput.
Security
L4 ships with a 3-tier security model per-deployment:
default— per-org draft cache, audit logging on, standard
acceptance-rate telemetry. Suitable for single-tenant.
hardened— per-tenant draft cache (F-S1), timing-jitter
injection on acceptance path (F-S2, configurable ms), tenant-scoped
metrics. Recommended for shared-GPU multi-tenant.
sensitive_opt_out— spec-dec fully disabled for this
deployment via F-S3. Use for workloads where the acceptance-timing
side-channel is unacceptable (regulated finance, healthcare).
The two mitigations close the 2025 papers' side channels:
- SpecLeak — prompt extraction via acceptance-timing fingerprints
on shared draft+target. Mitigated by F-S1 (per-tenant cache) +
F-S2 (timing jitter).
- DraftEcho — model inversion via draft-rejection patterns.
Mitigated by F-S1 + F-S3 (opt-out).
F-S4 (audit logging) and F-S5 (security ADR) ship on-by-default. See
the runbook for the security ADR link and residual-risk documentation.
Integration patterns
| Tool | How L4 plays with it |
|---|---|
| L0 Hardware Profiling | L4 reads the L0 profile for feasibility (does the draft fit alongside the target on this GPU class?). Heterogeneous-GPU draft routing queries L0 topology. |
| L1 Kernel Tuning | Medusa / EAGLE have their own kernel hot paths. L4 benchmarks run on L1-optimized kernels by default and do not re-optimize kernels inside L4. |
| L2 Quantization | Draft-model quantization is a subsearch within L4 (FP16 / FP8 / INT4 drafts). L4 delegates quant-config validation to L2's catalog. |
| L3 Parallelism | L4 respects the deployment's current TP configuration; it does NOT search across tensor-parallel configs. |
| L5 Batching | L4's online adaptive k (preview behind ENABLE_L4_ADAPTIVE_K) reads L5's batch metrics to shrink k under batch pressure. |
Examples
1. Chat workload with Llama-3.1-70B + Medusa-3 heads (T1 Compose).
Customer wants faster chat decode on a single H100-80GB. They POST to
/speculative/runs with fingerprint=chat, mode=interactive. The
predictor surfaces three candidates; the 30-day cache hits on EAGLE-3
heads (already pre-trained for Llama-3.1-70B on HuggingFace). Response
arrives in < 5 s with a SpecWinnerHero inline and a cache-hit badge.
Apply → spec 13 apply pipeline rolls out new engine flags.
2. Code workload with REST corpus (T2 Tune). Cursor-style code
completion workload. Customer uploads a curated 4GB code corpus, runs
a T1 probe to confirm REST is a legal candidate, then launches a T2
tune session with budget_minutes=90. BO searches over (k, tree
shape, retrieval-depth) and surfaces a 3.1× winner with constraint
quality_gate=constraint_violation (code must be parseable).
3. Enterprise fine-tune with custom Medusa-3 heads (T3 Learn). An
enterprise customer has fine-tuned Llama-3.1-8B for internal support
triage. No pre-trained Medusa heads exist for their checkpoint. They
POST to /speculative/heads/train with a 10k-prompt calibration set
and a 1k-prompt held-out eval. Medusa-3 training runs ~45 min; quality
gate topk5 passes; head is written to registry in validated state.
After developer review the head transitions to uat, then after
head_reviewer sign-off to production. Now T1 runs for this target
pick up the new head from the registry automatically.
Deviations from spec L4 2026-04-22 LOCKED plan
See the PR §10 "Deviations" list. Notable items will be recorded
there as implementation lands — this doc is the tool-author reference
and mirrors the plan where they diverge.