L1 Kernel Tuning — Tool Author's Guide
Overview
L1 Kernel Tuning is the second optimization layer in the InferenceIQ stack
(between L0 hardware profiling and L2 quantization). It explores the
engine-level kernel configuration space — attention backend, KV layout,
matmul path, activation dtype, fusion strategy, CUDA graph capture — and
selects combinations that maximise throughput or minimise latency on a
given hardware + model + workload target, without changing model weights.
L1 ships as three tiers:
- Tier 1 — Compose. Validated kernel catalog + legality predicates +
exhaustive or heuristic sweep over pruned combinations. Production-safe,
deterministic, always available.
- Tier 2 — Tune. Bayesian-optimisation driven search over a narrower
space, seeded by Tier 1 pareto points, bounded by a time / cost budget.
Resumable across worker restarts.
- Tier 3 — Generate. Sandboxed code-generation pipeline that proposes
new kernels; results flow into a review / approval registry and only
become selectable after human sign-off. The Tier 3 sandbox + registry
foundation ships with L1; the generator itself is a follow-up spec.
Pick Tier 1 for routine profiling, Tier 2 when you have a budget and want
the best point on the pareto frontier, Tier 3 only for research / novel
workloads that the catalog cannot express.
API surface
All endpoints are mounted under /api/v1/inferenceiq/kernels/.
Run lifecycle (Tier 1):
POST /runs— create a run (model + hardware + workload + mode).GET /runs— list runs, paginated.GET /runs/{id}— detail + summary stats.GET /runs/{id}/events— Server-Sent Events stream for progress.GET /runs/{id}/combos— enumerated combos + per-combo metrics.POST /runs/{id}/cancel— cancel an in-flight run.POST /runs/{id}/apply— hand the winning combo to spec 13's apply
pipeline.
DELETE /runs/{id}— soft-delete (admin).
Tune sessions (Tier 2):
POST /tune— start a Bayesian session, seeded from a Tier 1 run.GET /tune/{id}— session status + best-so-far.GET /tune/{id}/events— SSE stream of trial results.POST /tune/{id}/cancel— halt and persist state.POST /tune/{id}/resume— pick up a paused / crashed session.
Registry (Tier 3 foundation):
GET /registry— list registered kernels (status filter).POST /registry— submit a new kernel (admin).POST /registry/{id}/approve— approver workflow.POST /registry/{id}/deprecate— flag a kernel as unsafe.
Data model
Tables are in the modelops schema:
kernel_catalog_snapshots— frozen catalog versions. Each run binds to
a snapshot so results are reproducible after catalog updates.
kernel_runs— one row per Tier 1 run. Holds mode, target, status,
stats JSON, cache hit flags.
kernel_run_combos— enumerated combos per run, with per-combo
metrics (throughput, p50/p95/p99 latency, memory, kernel-time breakdown).
kernel_tune_sessions— Tier 2 sessions. Holdsbayesian_stateJSON
for resumability, budget, best-trial pointer.
l1_kernel_registry— Tier 3 proposed/approved kernels.l1_kernel_registry_approvals— audit log of approve/deprecate events.
See services/modelops/alembic/versions/*_l1_kernel_tuning*.py for the
canonical DDL.
Catalog
Source of truth: services/modelops/app/services/inferenceiq/kernels/catalog.py.
The catalog defines 6 kernel dimensions:
- attention_backend — flash_attn_v2/v3, paged_attention, xformers,
native.
- kv_layout — contiguous, paged, chunked.
- matmul_path — cublas, cutlass, triton, machete.
- activation_dtype — fp16, bf16, fp8_e4m3, fp8_e5m2.
- fusion_strategy — none, conservative, aggressive.
- cuda_graph — off, per-shape, dynamic.
Legality predicates live alongside each dimension; pruning rules remove
combinations that are known-bad (e.g. fp8_e5m2 + flash_attn_v2 on
pre-Hopper). A nightly CI job refreshes the catalog against vendor
release notes; the catalog owner rotates weekly — see the runbook.
To update the catalog:
- Edit
catalog.py. - Add / update legality predicates.
- Bump
CATALOG_VERSION. - Ship a migration that records the new snapshot.
- Existing runs continue to reference the old snapshot.
Per-engine plumbing
Each of the 5 supported engines (vLLM, TGI, TRT-LLM, Triton, ONNX
Runtime) has a translator module under
services/modelops/app/services/inferenceiq/kernels/engines/. Every
translator implements:
translate(combo, version) -> EngineArgs— maps a catalog combo to the
CLI flags / env vars / Python kwargs that engine expects.
supported_versions() -> list[str]— canonical list of engine versions
this translator was validated against.
To add a new engine:
- Create
kernels/engines/<engine>.py. - Implement the two methods above.
- Register the module in
l1_handler.py::_translate_via_enginedispatch
table.
- Add unit tests in
tests/unit/test_kernel_engines_<engine>.pycovering
all dimensions.
- Add a fixture to the catalog refresh CI job.
Quality gate
Four gate methods, defaulting to topk5:
topk5— top-5 combos by target metric must beat baseline by ≥ margin.pareto— winning combo must be on the pareto frontier (throughput vs
latency).
regression— no metric regresses by more than ε vs baseline.manual— skip automatic gating; surface all combos for human review.
Override via run request body (quality_gate field) when a target demands
it — e.g. manual for research runs, regression for production
canaries.
Statistical rigor
Every combo is measured N reps (default 7). The top and bottom 10% are
trimmed (Gaussian trim) to handle outliers. A Welch's t-test compares
against baseline; effect size (Cohen's d) is reported alongside
significance so callers can distinguish "statistically significant, but
tiny" from "big win". Early termination kicks in when the confidence
interval tightens below the noise floor and the winner is unambiguous.
The noise floor is hardware-specific — see the operator runbook for
tuning guidance.
Cache
Tuning is expensive, so results cache for 30 days in Redis (hot) + DB
(cold). Cache key: `(catalog_version, engine, engine_version, model_hash,
hardware_fingerprint, workload_fingerprint, combo_hash)`. Invalidation
triggers: catalog version bump, engine version bump, explicit admin flush
(see runbook), TTL expiry.
Tier 2 autotune
Bayesian optimisation over a narrower space, using Tier 1 pareto points
as priors. The narrower is produced by ConfigNarrower which drops
dimensions that Tier 1 found insensitive. Budget is enforced as wall-clock
minutes or GPU-hours; when the budget is exhausted the session emits its
best-so-far. State is persisted as bayesian_state JSON so a crashed
Celery worker can resume rather than restart.
Tier 3 foundation
Three pieces ship with L1:
- Sandbox pod — isolated namespace with no network egress except to
the registry and telemetry sinks. New kernels run here before ever
touching a real tenant.
- Provider abstraction — pluggable generator interface so we can
swap LLM-based, template-based, or human-authored kernel sources.
- Registry — the
l1_kernel_registrytable + approval workflow.
The generator itself is deliberately out of scope for spec L1 — it ships
as its own spec so we can land the machinery without promising
generation quality.
Observability
Prometheus metrics (all prefixed inferenceiq_l1_):
_runs_total{mode,status}counter._run_duration_secondshistogram._combo_evaluations_totalcounter._cache_hits_total/_cache_misses_total._tune_trials_total{outcome}._quality_gate_rejections_total{reason}._registry_entries{status}gauge.
Grafana dashboard: UID iiq-l1-kernel-tuning. Panels for run
throughput, cache hit rate, quality-gate rejection rate, tune-session
budget burn, registry queue depth.
Alerts: see the operator runbook.
Running locally
L1 lives in the ModelOps service and uses Celery for async work.
# Start infra + ModelOps + frontend
cd infra/compose
./dev.sh start
# Apply migrations
docker exec compose-modelops-1 alembic upgrade head
# Unit tests
docker exec compose-modelops-1 pytest tests/unit/test_kernel_l1_handler.py
# Playwright (post-merge on platform.inwire.ai only — never locally)
Deviations from spec L1 2026-04-19 draft
See PR §10 "Deviations" for the canonical list. Notable items:
pruning-predicate reordering, Tier 3 generator deferred to a follow-up
spec, tune-session resume added after draft review.
Landing v2 (post-2026-04-24)
The landing page at /modelops/inferenceiq/kernels was rewritten
end-to-end per the LOCKED plan at
analysis-docs/inferenceiq/v2-specs/demo-readiness-docs/inferenceiq-v2/L1-kernel-tuning-PRODUCTION-FIX-PLAN.md.
User-visible changes
- Deployment picker above the tabs (combobox + free-text UUID fallback, URL-persisted).
- Current-combo card above the tabs with drift badge.
- Results → Run History — real data from
/api/v1/kernels/runs. - Default tab = Runner (plan §3 LOCKED).
- URL state canonical:
?deployment=<uuid>&tab=<name>&run=<uuid>. - Quick-mode sweep preview before Start.
- Pre-submission validate() gates the Start button.
- Tier 3 registry moved to
/kernels/registry(read-only v1).
Feature flag
Removed in a follow-up. v2 is the only landing now; the legacy
page.legacy.tsx was deleted alongside the
NEXT_PUBLIC_ENABLE_L1_LANDING_V2 flag because NEXT_PUBLIC_* is
inlined at next build time and thus not actually runtime-flippable
from a ConfigMap. Rollback is git revert of the merging PR.
The tool-level kill switch (ENABLE_L1_KERNEL_TUNING) still works.
New API surface (plan §6)
| Endpoint | Purpose |
|---|---|
GET /kernels/runs |
List runs (filters + pagination) |
GET /kernels/deployments-recent |
Deployments with L1 activity in last 30d |
GET /kernels/deployments/{id}/current-combo |
Read-back |
GET /kernels/deployments/{id}/drift |
From calibration_runs (minor 5-15%, significant >15%) |
POST /kernels/deployments/{id}/validate |
Pre-submission existence + RLS check |
GET /kernels/catalog/preview |
Vidur-only combo preview per mode |
POST /metrics/client |
TTFS telemetry Histogram l1_landing_ttfs_seconds |
Observability (plan §9)
- 3 Prometheus Counters in
metrics.py. l1_landing_ttfs_secondsHistogram, SLO 10s p95.- 3 Grafana panels on the Kernel Tuning dashboard.
- Alerts:
L1LandingApiErrorRateHigh,L1LandingTTFSP95Slow.
Test guardrail
Nightly Playwright e2e at
frontend/e2e/inferenceiq-L1-kernels-landing.spec.ts against
platform.inwire.ai — 12 scenarios (8 functional + 4 a11y), no
test.skip(), no mocked API. Plus 6h smoke curl.