InferenceIQ User Guide
Optimize inference cost, latency, throughput, hardware fit, serving engine settings, and production feedback loops.
Who This Guide Is For
- Inference engineers
- ML performance engineers
- MLOps operators
Where To Go
| Page |
Use It For |
/modelops/inferenceiq |
InferenceIQ dashboard. |
/modelops/inferenceiq/planner |
AI-assisted optimization planner. |
/modelops/inferenceiq/experiments |
Experiment lifecycle and history. |
/modelops/inferenceiq/hardware |
L0 hardware profiling. |
/modelops/inferenceiq/kernels |
L1 kernel tuning. |
/modelops/inferenceiq/quantization |
L2 quantization. |
/modelops/inferenceiq/parallelism |
L3 parallelism. |
/modelops/inferenceiq/speculation |
L4 speculative decoding. |
/modelops/inferenceiq/batching |
L5 batching and scheduling. |
/modelops/inferenceiq/benchmark |
Benchmark and validation runs. |
/modelops/inferenceiq/registry |
Validated configuration registry. |
/modelops/inferenceiq/monitor |
Production inference feedback. |
Core Concepts
| Concept |
Meaning |
| Objective |
The optimization target: throughput, latency, or cost. |
| Experiment |
A measured optimization run over a model, engine, and hardware context. |
| Validated configuration |
A tested set of serving parameters that can be applied to deployments. |
| Tool level |
Optimization areas L0-L8, from hardware profiling through distillation. |
| Apply flow |
The controlled process for moving a validated config into deployment settings. |
Common Workflows
Run an optimization loop
- Choose objective and model/deployment context.
- Run hardware profiling if capacity is unknown.
- Run recommended L0-L8 experiments.
- Benchmark candidates against quality and SLO targets.
- Save validated configuration.
- Apply the configuration to a deployment or new deployment wizard.
- Monitor production feedback.
Use the planner
- Open Planner.
- Select model, hardware, workload, and target objective.
- Review recommended sequence and confidence.
- Run or save the plan.
- Open plan detail to track actions and evidence.
Best Practices
- Benchmark every applied change against quality floors and SLOs.
- Treat cold changes like quantization, pruning, and distillation as release candidates, not live tweaks.
- Keep validated configs linked to the model, engine, GPU, and workload they were measured against.
- Use production feedback to recalibrate future recommendations.