L1 Kernel Tuning — Tool Author's Guide

Overview

L1 Kernel Tuning is the second optimization layer in the InferenceIQ stack

(between L0 hardware profiling and L2 quantization). It explores the

engine-level kernel configuration space — attention backend, KV layout,

matmul path, activation dtype, fusion strategy, CUDA graph capture — and

selects combinations that maximise throughput or minimise latency on a

given hardware + model + workload target, without changing model weights.

L1 ships as three tiers:

Tier 1 — Compose. Validated kernel catalog + legality predicates +

exhaustive or heuristic sweep over pruned combinations. Production-safe,

deterministic, always available.

Tier 2 — Tune. Bayesian-optimisation driven search over a narrower

space, seeded by Tier 1 pareto points, bounded by a time / cost budget.

Resumable across worker restarts.

Tier 3 — Generate. Sandboxed code-generation pipeline that proposes

new kernels; results flow into a review / approval registry and only

become selectable after human sign-off. The Tier 3 sandbox + registry

foundation ships with L1; the generator itself is a follow-up spec.

Pick Tier 1 for routine profiling, Tier 2 when you have a budget and want

the best point on the pareto frontier, Tier 3 only for research / novel

workloads that the catalog cannot express.

API surface

All endpoints are mounted under /api/v1/inferenceiq/kernels/.

Run lifecycle (Tier 1):

POST /runs — create a run (model + hardware + workload + mode).
GET /runs — list runs, paginated.
GET /runs/{id} — detail + summary stats.
GET /runs/{id}/events — Server-Sent Events stream for progress.
GET /runs/{id}/combos — enumerated combos + per-combo metrics.
POST /runs/{id}/cancel — cancel an in-flight run.
POST /runs/{id}/apply — hand the winning combo to spec 13's apply

pipeline.

DELETE /runs/{id} — soft-delete (admin).

Tune sessions (Tier 2):

POST /tune — start a Bayesian session, seeded from a Tier 1 run.
GET /tune/{id} — session status + best-so-far.
GET /tune/{id}/events — SSE stream of trial results.
POST /tune/{id}/cancel — halt and persist state.
POST /tune/{id}/resume — pick up a paused / crashed session.

Registry (Tier 3 foundation):

GET /registry — list registered kernels (status filter).
POST /registry — submit a new kernel (admin).
POST /registry/{id}/approve — approver workflow.
POST /registry/{id}/deprecate — flag a kernel as unsafe.

Data model

Tables are in the modelops schema:

kernel_catalog_snapshots — frozen catalog versions. Each run binds to

a snapshot so results are reproducible after catalog updates.

kernel_runs — one row per Tier 1 run. Holds mode, target, status,

stats JSON, cache hit flags.

kernel_run_combos — enumerated combos per run, with per-combo

metrics (throughput, p50/p95/p99 latency, memory, kernel-time breakdown).

kernel_tune_sessions — Tier 2 sessions. Holds bayesian_state JSON

for resumability, budget, best-trial pointer.

l1_kernel_registry — Tier 3 proposed/approved kernels.
l1_kernel_registry_approvals — audit log of approve/deprecate events.

See services/modelops/alembic/versions/*_l1_kernel_tuning*.py for the

canonical DDL.

Catalog

Source of truth: services/modelops/app/services/inferenceiq/kernels/catalog.py.

The catalog defines 6 kernel dimensions:

attention_backend — flash_attn_v2/v3, paged_attention, xformers,

native.

kv_layout — contiguous, paged, chunked.
matmul_path — cublas, cutlass, triton, machete.
activation_dtype — fp16, bf16, fp8_e4m3, fp8_e5m2.
fusion_strategy — none, conservative, aggressive.
cuda_graph — off, per-shape, dynamic.

Legality predicates live alongside each dimension; pruning rules remove

combinations that are known-bad (e.g. fp8_e5m2 + flash_attn_v2 on

pre-Hopper). A nightly CI job refreshes the catalog against vendor

release notes; the catalog owner rotates weekly — see the runbook.

To update the catalog:

Edit catalog.py.
Add / update legality predicates.
Bump CATALOG_VERSION.
Ship a migration that records the new snapshot.
Existing runs continue to reference the old snapshot.

Per-engine plumbing

Each of the 5 supported engines (vLLM, TGI, TRT-LLM, Triton, ONNX

Runtime) has a translator module under

services/modelops/app/services/inferenceiq/kernels/engines/. Every

translator implements:

translate(combo, version) -> EngineArgs — maps a catalog combo to the

CLI flags / env vars / Python kwargs that engine expects.

supported_versions() -> list[str] — canonical list of engine versions

this translator was validated against.

To add a new engine:

Create kernels/engines/<engine>.py.
Implement the two methods above.
Register the module in l1_handler.py::_translate_via_engine dispatch

table.

Add unit tests in tests/unit/test_kernel_engines_<engine>.py covering

all dimensions.

Add a fixture to the catalog refresh CI job.

Quality gate

Four gate methods, defaulting to topk5:

topk5 — top-5 combos by target metric must beat baseline by ≥ margin.
pareto — winning combo must be on the pareto frontier (throughput vs

latency).

regression — no metric regresses by more than ε vs baseline.
manual — skip automatic gating; surface all combos for human review.

Override via run request body (quality_gate field) when a target demands

it — e.g. manual for research runs, regression for production

canaries.

Statistical rigor

Every combo is measured N reps (default 7). The top and bottom 10% are

trimmed (Gaussian trim) to handle outliers. A Welch's t-test compares

against baseline; effect size (Cohen's d) is reported alongside

significance so callers can distinguish "statistically significant, but

tiny" from "big win". Early termination kicks in when the confidence

interval tightens below the noise floor and the winner is unambiguous.

The noise floor is hardware-specific — see the operator runbook for

tuning guidance.

Cache

Tuning is expensive, so results cache for 30 days in Redis (hot) + DB

(cold). Cache key: `(catalog_version, engine, engine_version, model_hash,

hardware_fingerprint, workload_fingerprint, combo_hash)`. Invalidation

triggers: catalog version bump, engine version bump, explicit admin flush

(see runbook), TTL expiry.

Tier 2 autotune

Bayesian optimisation over a narrower space, using Tier 1 pareto points

as priors. The narrower is produced by ConfigNarrower which drops

dimensions that Tier 1 found insensitive. Budget is enforced as wall-clock

minutes or GPU-hours; when the budget is exhausted the session emits its

best-so-far. State is persisted as bayesian_state JSON so a crashed

Celery worker can resume rather than restart.

Tier 3 foundation

Three pieces ship with L1:

Sandbox pod — isolated namespace with no network egress except to

the registry and telemetry sinks. New kernels run here before ever

touching a real tenant.

Provider abstraction — pluggable generator interface so we can

swap LLM-based, template-based, or human-authored kernel sources.

Registry — the l1_kernel_registry table + approval workflow.

The generator itself is deliberately out of scope for spec L1 — it ships

as its own spec so we can land the machinery without promising

generation quality.

Observability

Prometheus metrics (all prefixed inferenceiq_l1_):

_runs_total{mode,status} counter.
_run_duration_seconds histogram.
_combo_evaluations_total counter.
_cache_hits_total / _cache_misses_total.
_tune_trials_total{outcome}.
_quality_gate_rejections_total{reason}.
_registry_entries{status} gauge.

Grafana dashboard: UID iiq-l1-kernel-tuning. Panels for run

throughput, cache hit rate, quality-gate rejection rate, tune-session

budget burn, registry queue depth.

Alerts: see the operator runbook.

Running locally

L1 lives in the ModelOps service and uses Celery for async work.

# Start infra + ModelOps + frontend
cd infra/compose
./dev.sh start

# Apply migrations
docker exec compose-modelops-1 alembic upgrade head

# Unit tests
docker exec compose-modelops-1 pytest tests/unit/test_kernel_l1_handler.py

# Playwright (post-merge on platform.inwire.ai only — never locally)

Deviations from spec L1 2026-04-19 draft

See PR §10 "Deviations" for the canonical list. Notable items:

pruning-predicate reordering, Tier 3 generator deferred to a follow-up

spec, tune-session resume added after draft review.

Landing v2 (post-2026-04-24)

The landing page at /modelops/inferenceiq/kernels was rewritten

end-to-end per the LOCKED plan at

analysis-docs/inferenceiq/v2-specs/demo-readiness-docs/inferenceiq-v2/L1-kernel-tuning-PRODUCTION-FIX-PLAN.md.

User-visible changes

Deployment picker above the tabs (combobox + free-text UUID fallback, URL-persisted).
Current-combo card above the tabs with drift badge.
Results → Run History — real data from /api/v1/kernels/runs.
Default tab = Runner (plan §3 LOCKED).
URL state canonical: ?deployment=<uuid>&tab=<name>&run=<uuid>.
Quick-mode sweep preview before Start.
Pre-submission validate() gates the Start button.
Tier 3 registry moved to /kernels/registry (read-only v1).

Feature flag

Removed in a follow-up. v2 is the only landing now; the legacy

page.legacy.tsx was deleted alongside the

NEXT_PUBLIC_ENABLE_L1_LANDING_V2 flag because NEXT_PUBLIC_* is

inlined at next build time and thus not actually runtime-flippable

from a ConfigMap. Rollback is git revert of the merging PR.

The tool-level kill switch (ENABLE_L1_KERNEL_TUNING) still works.

New API surface (plan §6)

Endpoint	Purpose
`GET /kernels/runs`	List runs (filters + pagination)
`GET /kernels/deployments-recent`	Deployments with L1 activity in last 30d
`GET /kernels/deployments/{id}/current-combo`	Read-back
`GET /kernels/deployments/{id}/drift`	From `calibration_runs` (minor 5-15%, significant >15%)
`POST /kernels/deployments/{id}/validate`	Pre-submission existence + RLS check
`GET /kernels/catalog/preview`	Vidur-only combo preview per mode
`POST /metrics/client`	TTFS telemetry Histogram `l1_landing_ttfs_seconds`

Observability (plan §9)

3 Prometheus Counters in metrics.py.
l1_landing_ttfs_seconds Histogram, SLO 10s p95.
3 Grafana panels on the Kernel Tuning dashboard.
Alerts: L1LandingApiErrorRateHigh, L1LandingTTFSP95Slow.

Test guardrail

Nightly Playwright e2e at

frontend/e2e/inferenceiq-L1-kernels-landing.spec.ts against

platform.inwire.ai — 12 scenarios (8 functional + 4 a11y), no

test.skip(), no mocked API. Plus 6h smoke curl.