InferenceIQ User Guide

Optimize inference cost, latency, throughput, hardware fit, serving engine settings, and production feedback loops.

Who This Guide Is For

Inference engineers
ML performance engineers
MLOps operators

Where To Go

Page	Use It For
`/modelops/inferenceiq`	InferenceIQ dashboard.
`/modelops/inferenceiq/planner`	AI-assisted optimization planner.
`/modelops/inferenceiq/experiments`	Experiment lifecycle and history.
`/modelops/inferenceiq/hardware`	L0 hardware profiling.
`/modelops/inferenceiq/kernels`	L1 kernel tuning.
`/modelops/inferenceiq/quantization`	L2 quantization.
`/modelops/inferenceiq/parallelism`	L3 parallelism.
`/modelops/inferenceiq/speculation`	L4 speculative decoding.
`/modelops/inferenceiq/batching`	L5 batching and scheduling.
`/modelops/inferenceiq/benchmark`	Benchmark and validation runs.
`/modelops/inferenceiq/registry`	Validated configuration registry.
`/modelops/inferenceiq/monitor`	Production inference feedback.

Core Concepts

Concept	Meaning
Objective	The optimization target: throughput, latency, or cost.
Experiment	A measured optimization run over a model, engine, and hardware context.
Validated configuration	A tested set of serving parameters that can be applied to deployments.
Tool level	Optimization areas L0-L8, from hardware profiling through distillation.
Apply flow	The controlled process for moving a validated config into deployment settings.

Common Workflows

Run an optimization loop

Choose objective and model/deployment context.
Run hardware profiling if capacity is unknown.
Run recommended L0-L8 experiments.
Benchmark candidates against quality and SLO targets.
Save validated configuration.
Apply the configuration to a deployment or new deployment wizard.
Monitor production feedback.

Use the planner

Open Planner.
Select model, hardware, workload, and target objective.
Review recommended sequence and confidence.
Run or save the plan.
Open plan detail to track actions and evidence.

Best Practices

Benchmark every applied change against quality floors and SLOs.
Treat cold changes like quantization, pruning, and distillation as release candidates, not live tweaks.
Keep validated configs linked to the model, engine, GPU, and workload they were measured against.
Use production feedback to recalibrate future recommendations.