Overview
InferenceIQ is the AI-powered inference optimization advisor embedded into every model deployment workflow on the inwire platform. Before a model goes to production, InferenceIQ analyzes its architecture, estimates resource requirements, and recommends the optimal configuration across five dimensions: latency, throughput, cost, reliability, and sustainability. It supports every major inference engine: vLLM, TGI, TensorRT-LLM, ONNX Runtime, Triton, llama.cpp , and every major GPU family from NVIDIA, AMD, and Intel. InferenceIQ is the starting point; IGT is the deep-dive. Together, they ensure no GPU cycle is wasted.
Capabilities
Multi-Objective Scoring Across 5 Dimensions
Every recommendation is scored across latency, throughput, cost, reliability, and sustainability. See the trade-offs visually. Choose the configuration that matches your priorities, not a one-size-fits-all default.
Inference Engine Recommendation
Automatically recommends the best inference engine based on model type and size: vLLM for large language models, TGI for smaller LLMs, TensorRT-LLM for NVIDIA-optimized deployments, ONNX Runtime for cross-platform NLP, Triton for vision models, llama.cpp for edge deployments.
GPU Selection and Memory Fit Analysis
Analyzes model parameter count, dtype, and KV cache requirements to recommend the right GPU type and count. Supports 13+ GPU profiles: B200, H200, H100, A100, L40S, L4, T4, MI300X, MI325X, Gaudi 3, and more.
Quantization Strategy Advisor
Recommends optimal quantization (FP16, FP8, INT8, INT4, AWQ, GPTQ, GGUF) based on model size, architecture compatibility, and quality sensitivity. Models over 70B parameters get AWQ/GPTQ recommendations. Smaller models get precision-preserving strategies.
Cloud Cost Comparison
Real-time pricing across 19 cloud configurations spanning AWS, GCP, Azure, Lambda Labs, CoreWeave, RunPod, and Together AI. See cost-per-hour, cost-per-million-tokens, and monthly projections for every GPU option.
HuggingFace Deep Integration
Point InferenceIQ at any HuggingFace model URL. It automatically extracts architecture, parameter count, quantization compatibility, attention type, context length, and vocabulary size. Supports gated models with license acceptance and token authentication.
Confidence Scores and Explanations
Every recommendation includes a confidence score and a plain-English explanation of why it was chosen, what trade-offs were made, and what alternatives were considered. No black-box decisions.
Five LLM Backend Support
Uses OpenAI, Anthropic, Kimi/Moonshot, or self-hosted Ollama as its reasoning engine. Falls back to a deterministic rule-based engine when no LLM is available. The LLM analyzes your model. It never touches your data.
Scaling Policy Generation
Recommends Horizontal Pod Autoscaler (HPA) configuration: min and max replicas, autoscaling thresholds, scale-up and scale-down behavior. Optimized for your workload pattern: real-time, batch, or bursty.
Deployment-Ready Configuration Export
Every recommendation exports as a production-ready configuration: Kubernetes manifests, Helm values, or direct API payload. One click from recommendation to deployment. No manual translation.