L4 Speculative Decoding — Tool Author's Guide

Purpose

L4 Speculative Decoding is the fifth shipped optimization layer in the

InferenceIQ v2 stack (sitting alongside L0 Hardware Profiling, L1

Kernel Tuning, L2 Quantization, and L5 Batching). It selects the best

combination of **algorithm family, draft source, k / tree-shape, and

engine flags** for a given `(target model, GPU, engine, workload

fingerprint)` tuple — without changing model weights or touching the

target-model checkpoint. The goal is sustained decode-throughput

speedups in the 1.6×–3.5× range while preserving output quality (via

the 7-method quality gate) and staying within a tenant's GPU budget.

L4 ships in three tiers (the same Compose / Tune / Learn shape L1 uses

for kernel tuning), so customers can move from a 5-second cache-hit

lookup all the way to training custom Medusa-3 heads for their

fine-tuned model, all inside the same tool surface.

Three tiers (the product shape)

Tier 1 — Compose. Pick the best (algorithm, draft, config) for

the target in under five seconds via a workload-fingerprint-aware

acceptance-rate predictor plus a 30-day winner cache. Production-safe,

deterministic, always available. Target user: the 80% of customers

who just want "a faster endpoint" without running experiments.

Tier 2 — Tune. Bayesian-optimisation search over (algorithm ×

draft × k × tree-shape), seeded by Tier 1 pareto points, bounded by a

wall-clock budget (default 60 min, max 480). Resumable across worker

restarts. N-rep statistical rigor. Target user: performance engineers

squeezing out the last 10–20% on a well-understood model.

Tier 3 — Learn. Train custom Medusa-3 or EAGLE-3 heads for the

customer's fine-tuned target. Governance (lifecycle + approval),

quality gate on held-out eval, registry write. Target user:

enterprise teams shipping a proprietary fine-tune whose top-50 open

model pre-trained heads no longer apply.

Algorithm families

L4 covers seven families in v1, with a stretch "cascade" family behind

a feature flag. Speedup ranges are from the 2026 literature and

measured on 70B-class targets under interactive batch=1 pressure.

Family	Footprint	Speedup	Prereq	Engine coverage
Classic draft model	small draft + target	1.6–2.3×	same-family draft available	vLLM, TGI, SGLang, TRT-LLM, LMDeploy
Medusa-2 / 2.1 / 3	heads grafted on target	2.0–3.2×	Medusa heads for target (pre-trained or T3-trained)	SGLang (best), vLLM, TGI, TRT-LLM, LMDeploy
EAGLE-1 / 2 / 3	feature-level draft net	2.3–3.5×	EAGLE heads (top-50 open models on HF, or T3-trained)	SGLang (best — EAGLE-3), vLLM (EAGLE-2), TRT-LLM, TGI, LMDeploy
Lookahead (Jacobi)	zero — target only	1.5–2.2×	none	vLLM, SGLang (partial)
REST (retrieval)	target + corpus index	2.5–3.5× on code/RAG	indexed corpus	vLLM, SGLang (partial)
MTP (pass-through)	baked into target	native ~4×	target trained with MTP heads (e.g. DeepSeek-V3)	vLLM, SGLang catching up
Cascade (stretch)	draft → medium → target	2.8–3.5×	three compatible tiers	SGLang preview (behind flag)

Full coverage matrix + per-engine version windows live in

L4-speculative-decoding-RESEARCH.md §4.

Workload fingerprint

L4 makes different choices for different workloads. The **8 fingerprint

values** are:

chat — interactive, batch 1–2, latency-sensitive. EAGLE-3 or Medusa-3.
code — code completion, batch 1–4. REST shines here (3×+).
rag — retrieval-augmented, batch 2–8. REST with corpus.
agent — ReAct / tool-use, batch 1–4, constrained output. Medusa-3 + constraint gate.
long_context — summarization / doc QA. Prefill dominates; Lookahead or EAGLE-3 on decode.
reasoning — chain-of-thought. Long decodes favor EAGLE-3.
vision — multimodal prefill; decode speedup is a small fraction.
multimodal — mixed text + vision/audio.

Fingerprint is supplied via a 3-question wizard (Tier 1 default) or

auto-inferred from deployment telemetry (P4 feature, flag

ENABLE_L4_AUTO_WORKLOAD_FINGERPRINT). **The fingerprint drives the

catalog's pruning rules** — e.g. REST is skipped for chat, classic

drafts are skipped for MTP-native targets.

How to run a T1 experiment

All endpoints live under /api/v1/inferenceiq/speculative/.

curl -X POST https://platform.inwire.ai/api/v1/inferenceiq/speculative/runs \
  -H "Authorization: Bearer $TOKEN" \
  -H "Idempotency-Key: $(uuidgen)" \
  -d '{
    "target": {"model_ref": "meta-llama/Llama-3.1-70B-Instruct", "gpu": "H100-80GB", "engine": "vllm", "engine_version": "0.8.0"},
    "workload": {"fingerprint": "chat"},
    "mode": "interactive",
    "goal": {"quality_gate": "topk5", "priorities": {"throughput": 0.7, "latency": 0.3}}
  }'

Python SDK equivalent:

from inwire.inferenceiq import SpeculativeClient

client = SpeculativeClient()
run = client.runs.create(
    target={"model_ref": "meta-llama/Llama-3.1-70B-Instruct", "gpu": "H100-80GB", "engine": "vllm"},
    workload={"fingerprint": "chat"},
    mode="interactive",
    goal={"quality_gate": "topk5"},
)
for event in client.runs.stream(run.id):
    print(event)
winner = client.runs.get(run.id).winner
client.runs.apply(run.id)  # Hands off to spec 13 apply pipeline.

The SSE stream emits queued → running → per-rep → completed events.

Cache-hit runs short-circuit to completed in under a second and

surface a cache_hit=true flag on the response.

How to run a T2 tune session

After a T1 run surfaces a pareto frontier, launch a Bayesian tune:

curl -X POST .../speculative/tune-sessions \
  -d '{"run_id": "<t1-run-id>", "budget_minutes": 60}'

The session seeds priors from T1 pareto points, drops insensitive

dimensions, and runs budgeted BO. The convergence chart is a

best-so-far curve; inspect it mid-run via /events SSE or the frontend

SpecTuneSessionCard. A session that crosses 80% of its budget without

improving the top-3 by more than 1σ is auto-terminated and flagged.

How to train a T3 head

Tier 3 trains custom Medusa-3 heads (GA) or EAGLE-3 heads (preview

behind ENABLE_L4_EAGLE3_TRAINING) for a customer's fine-tuned target.

Calibration set — 2k–20k prompts representative of the target

workload. JSONL format, {"prompt": "...", "completion": "..."}.

More diverse = more generalizable heads.

Eval set — held-out 500–2k prompts. Used by the quality gate;

the trained heads must hit the gate's thresholds before registry

write.

Lifecycle — draft (training in progress) → validated (passed

quality gate) → uat (approved for staging; requires developer+) →

production (approved for prod; requires admin or head_reviewer,

audit log written to speculative_head_approvals) → deprecated

(flagged unsafe).

Approval workflow — transitions to uat or production require

a 1-line rationale and write an audit row. Deprecation is

non-destructive; runs that already bound to the head continue.

Training takes ~30–90 min for Medusa-3, ~5–30 h for EAGLE-3 on H100.

Status + per-epoch loss + measured acceptance rate stream via /events.

Quality gate

L4 extends L1's 6-method quality gate with a 7th method

(constraint_violation) covering structured-output workloads. The

default is topk5. Override per-run via goal.quality_gate.

Method	Pass / fail	When to use
`exact`	top-1 agreement ≥ threshold	deterministic workloads, tight quality bar
`topk5` (default)	top-5 agreement ≥ 0.95	general-purpose; balances strictness + acceptance headroom
`logit_kl`	KL(target ‖ spec) ≤ ε	research / production canary
`mmlu_slice`	ΔMMLU ≤ threshold on sliced eval	regression guard for general models
`constraint_violation`	≤ N schema violations per 1k tokens	agent / structured-output workloads
`skipped`	gate disabled	explicit opt-out (research only, audit-logged)
`pending`	gate awaiting evaluation	transient status during async measurement

A gate fail sets quality_preserved=0.0 and

error_stage="quality_regression"; the combo is disqualified by the

scoring layer regardless of throughput.

Security

L4 ships with a 3-tier security model per-deployment:

default — per-org draft cache, audit logging on, standard

acceptance-rate telemetry. Suitable for single-tenant.

hardened — per-tenant draft cache (F-S1), timing-jitter

injection on acceptance path (F-S2, configurable ms), tenant-scoped

metrics. Recommended for shared-GPU multi-tenant.

sensitive_opt_out — spec-dec fully disabled for this

deployment via F-S3. Use for workloads where the acceptance-timing

side-channel is unacceptable (regulated finance, healthcare).

The two mitigations close the 2025 papers' side channels:

SpecLeak — prompt extraction via acceptance-timing fingerprints

on shared draft+target. Mitigated by F-S1 (per-tenant cache) +

F-S2 (timing jitter).

DraftEcho — model inversion via draft-rejection patterns.

Mitigated by F-S1 + F-S3 (opt-out).

F-S4 (audit logging) and F-S5 (security ADR) ship on-by-default. See

the runbook for the security ADR link and residual-risk documentation.

Integration patterns

Tool	How L4 plays with it
L0 Hardware Profiling	L4 reads the L0 profile for feasibility (does the draft fit alongside the target on this GPU class?). Heterogeneous-GPU draft routing queries L0 topology.
L1 Kernel Tuning	Medusa / EAGLE have their own kernel hot paths. L4 benchmarks run on L1-optimized kernels by default and do not re-optimize kernels inside L4.
L2 Quantization	Draft-model quantization is a subsearch within L4 (FP16 / FP8 / INT4 drafts). L4 delegates quant-config validation to L2's catalog.
L3 Parallelism	L4 respects the deployment's current TP configuration; it does NOT search across tensor-parallel configs.
L5 Batching	L4's online adaptive `k` (preview behind `ENABLE_L4_ADAPTIVE_K`) reads L5's batch metrics to shrink `k` under batch pressure.

Examples

1. Chat workload with Llama-3.1-70B + Medusa-3 heads (T1 Compose).

Customer wants faster chat decode on a single H100-80GB. They POST to

/speculative/runs with fingerprint=chat, mode=interactive. The

predictor surfaces three candidates; the 30-day cache hits on EAGLE-3

heads (already pre-trained for Llama-3.1-70B on HuggingFace). Response

arrives in < 5 s with a SpecWinnerHero inline and a cache-hit badge.

Apply → spec 13 apply pipeline rolls out new engine flags.

2. Code workload with REST corpus (T2 Tune). Cursor-style code

completion workload. Customer uploads a curated 4GB code corpus, runs

a T1 probe to confirm REST is a legal candidate, then launches a T2

tune session with budget_minutes=90. BO searches over (k, tree

shape, retrieval-depth) and surfaces a 3.1× winner with constraint

quality_gate=constraint_violation (code must be parseable).

3. Enterprise fine-tune with custom Medusa-3 heads (T3 Learn). An

enterprise customer has fine-tuned Llama-3.1-8B for internal support

triage. No pre-trained Medusa heads exist for their checkpoint. They

POST to /speculative/heads/train with a 10k-prompt calibration set

and a 1k-prompt held-out eval. Medusa-3 training runs ~45 min; quality

gate topk5 passes; head is written to registry in validated state.

After developer review the head transitions to uat, then after

head_reviewer sign-off to production. Now T1 runs for this target

pick up the new head from the registry automatically.

Deviations from spec L4 2026-04-22 LOCKED plan

See the PR §10 "Deviations" list. Notable items will be recorded

there as implementation lands — this doc is the tool-author reference

and mirrors the plan where they diverge.