L1 Kernel Tuning — Tool Author's Guide

Overview

L1 Kernel Tuning is the second optimization layer in the InferenceIQ stack

(between L0 hardware profiling and L2 quantization). It explores the

engine-level kernel configuration space — attention backend, KV layout,

matmul path, activation dtype, fusion strategy, CUDA graph capture — and

selects combinations that maximise throughput or minimise latency on a

given hardware + model + workload target, without changing model weights.

L1 ships as three tiers:

exhaustive or heuristic sweep over pruned combinations. Production-safe,

deterministic, always available.

space, seeded by Tier 1 pareto points, bounded by a time / cost budget.

Resumable across worker restarts.

new kernels; results flow into a review / approval registry and only

become selectable after human sign-off. The Tier 3 sandbox + registry

foundation ships with L1; the generator itself is a follow-up spec.

Pick Tier 1 for routine profiling, Tier 2 when you have a budget and want

the best point on the pareto frontier, Tier 3 only for research / novel

workloads that the catalog cannot express.

API surface

All endpoints are mounted under /api/v1/inferenceiq/kernels/.

Run lifecycle (Tier 1):

pipeline.

Tune sessions (Tier 2):

Registry (Tier 3 foundation):

Data model

Tables are in the modelops schema:

a snapshot so results are reproducible after catalog updates.

stats JSON, cache hit flags.

metrics (throughput, p50/p95/p99 latency, memory, kernel-time breakdown).

for resumability, budget, best-trial pointer.

See services/modelops/alembic/versions/*_l1_kernel_tuning*.py for the

canonical DDL.

Catalog

Source of truth: services/modelops/app/services/inferenceiq/kernels/catalog.py.

The catalog defines 6 kernel dimensions:

  1. attention_backend — flash_attn_v2/v3, paged_attention, xformers,

native.

  1. kv_layout — contiguous, paged, chunked.
  2. matmul_path — cublas, cutlass, triton, machete.
  3. activation_dtype — fp16, bf16, fp8_e4m3, fp8_e5m2.
  4. fusion_strategy — none, conservative, aggressive.
  5. cuda_graph — off, per-shape, dynamic.

Legality predicates live alongside each dimension; pruning rules remove

combinations that are known-bad (e.g. fp8_e5m2 + flash_attn_v2 on

pre-Hopper). A nightly CI job refreshes the catalog against vendor

release notes; the catalog owner rotates weekly — see the runbook.

To update the catalog:

  1. Edit catalog.py.
  2. Add / update legality predicates.
  3. Bump CATALOG_VERSION.
  4. Ship a migration that records the new snapshot.
  5. Existing runs continue to reference the old snapshot.

Per-engine plumbing

Each of the 5 supported engines (vLLM, TGI, TRT-LLM, Triton, ONNX

Runtime) has a translator module under

services/modelops/app/services/inferenceiq/kernels/engines/. Every

translator implements:

CLI flags / env vars / Python kwargs that engine expects.

this translator was validated against.

To add a new engine:

  1. Create kernels/engines/<engine>.py.
  2. Implement the two methods above.
  3. Register the module in l1_handler.py::_translate_via_engine dispatch

table.

  1. Add unit tests in tests/unit/test_kernel_engines_<engine>.py covering

all dimensions.

  1. Add a fixture to the catalog refresh CI job.

Quality gate

Four gate methods, defaulting to topk5:

latency).

Override via run request body (quality_gate field) when a target demands

it — e.g. manual for research runs, regression for production

canaries.

Statistical rigor

Every combo is measured N reps (default 7). The top and bottom 10% are

trimmed (Gaussian trim) to handle outliers. A Welch's t-test compares

against baseline; effect size (Cohen's d) is reported alongside

significance so callers can distinguish "statistically significant, but

tiny" from "big win". Early termination kicks in when the confidence

interval tightens below the noise floor and the winner is unambiguous.

The noise floor is hardware-specific — see the operator runbook for

tuning guidance.

Cache

Tuning is expensive, so results cache for 30 days in Redis (hot) + DB

(cold). Cache key: `(catalog_version, engine, engine_version, model_hash,

hardware_fingerprint, workload_fingerprint, combo_hash)`. Invalidation

triggers: catalog version bump, engine version bump, explicit admin flush

(see runbook), TTL expiry.

Tier 2 autotune

Bayesian optimisation over a narrower space, using Tier 1 pareto points

as priors. The narrower is produced by ConfigNarrower which drops

dimensions that Tier 1 found insensitive. Budget is enforced as wall-clock

minutes or GPU-hours; when the budget is exhausted the session emits its

best-so-far. State is persisted as bayesian_state JSON so a crashed

Celery worker can resume rather than restart.

Tier 3 foundation

Three pieces ship with L1:

  1. Sandbox pod — isolated namespace with no network egress except to

the registry and telemetry sinks. New kernels run here before ever

touching a real tenant.

  1. Provider abstraction — pluggable generator interface so we can

swap LLM-based, template-based, or human-authored kernel sources.

  1. Registry — the l1_kernel_registry table + approval workflow.

The generator itself is deliberately out of scope for spec L1 — it ships

as its own spec so we can land the machinery without promising

generation quality.

Observability

Prometheus metrics (all prefixed inferenceiq_l1_):

Grafana dashboard: UID iiq-l1-kernel-tuning. Panels for run

throughput, cache hit rate, quality-gate rejection rate, tune-session

budget burn, registry queue depth.

Alerts: see the operator runbook.

Running locally

L1 lives in the ModelOps service and uses Celery for async work.

# Start infra + ModelOps + frontend
cd infra/compose
./dev.sh start

# Apply migrations
docker exec compose-modelops-1 alembic upgrade head

# Unit tests
docker exec compose-modelops-1 pytest tests/unit/test_kernel_l1_handler.py

# Playwright (post-merge on platform.inwire.ai only — never locally)

Deviations from spec L1 2026-04-19 draft

See PR §10 "Deviations" for the canonical list. Notable items:

pruning-predicate reordering, Tier 3 generator deferred to a follow-up

spec, tune-session resume added after draft review.

Landing v2 (post-2026-04-24)

The landing page at /modelops/inferenceiq/kernels was rewritten

end-to-end per the LOCKED plan at

analysis-docs/inferenceiq/v2-specs/demo-readiness-docs/inferenceiq-v2/L1-kernel-tuning-PRODUCTION-FIX-PLAN.md.

User-visible changes

Feature flag

Removed in a follow-up. v2 is the only landing now; the legacy

page.legacy.tsx was deleted alongside the

NEXT_PUBLIC_ENABLE_L1_LANDING_V2 flag because NEXT_PUBLIC_* is

inlined at next build time and thus not actually runtime-flippable

from a ConfigMap. Rollback is git revert of the merging PR.

The tool-level kill switch (ENABLE_L1_KERNEL_TUNING) still works.

New API surface (plan §6)

Endpoint Purpose
GET /kernels/runs List runs (filters + pagination)
GET /kernels/deployments-recent Deployments with L1 activity in last 30d
GET /kernels/deployments/{id}/current-combo Read-back
GET /kernels/deployments/{id}/drift From calibration_runs (minor 5-15%, significant >15%)
POST /kernels/deployments/{id}/validate Pre-submission existence + RLS check
GET /kernels/catalog/preview Vidur-only combo preview per mode
POST /metrics/client TTFS telemetry Histogram l1_landing_ttfs_seconds

Observability (plan §9)

Test guardrail

Nightly Playwright e2e at

frontend/e2e/inferenceiq-L1-kernels-landing.spec.ts against

platform.inwire.ai — 12 scenarios (8 functional + 4 a11y), no

test.skip(), no mocked API. Plus 6h smoke curl.