Serving and Inference Gateway User Guide
Manage inference endpoints, keys, requests, observability, incidents, policies, rollouts, snapshots, routing, and failover.
Who This Guide Is For
- Inference platform teams
- Application developers
- SREs
Where To Go
| Page |
Use It For |
/serving |
Serving overview. |
/serving/endpoints |
Endpoint catalog. |
/serving/keys |
Serving API keys. |
/serving/requests |
Request inspection. |
/serving/observability |
Serving metrics and traces. |
/serving/policies |
Routing and access policies. |
/serving/rollouts |
Canary, blue/green, and rollout controls. |
/serving/incidents |
Serving incidents. |
/serving/snapshots |
Endpoint and routing snapshots. |
/serving/rim |
Runtime intelligence and routing signals. |
Core Concepts
| Concept |
Meaning |
| Endpoint |
A stable serving interface for applications. |
| Inference Gateway |
The request entry point that authenticates, routes, and observes inference calls. |
| Neural Router |
The routing decision engine for deployment selection, traffic split, failover, and policy enforcement. |
| Rollout |
A controlled release pattern such as canary, blue/green, or shadow. |
| Snapshot |
A saved view of endpoint or routing configuration for review and rollback. |
Common Workflows
Create a serving key
- Open Serving -> Keys.
- Create a key for a specific application or environment.
- Store it in a secret manager.
- Set usage limits if available.
- Rotate and revoke keys on schedule.
Investigate serving latency
- Open Observability.
- Filter by endpoint and time range.
- Inspect latency percentiles and error rate.
- Open traces for slow requests.
- Check routing policy and backend deployment health.
- Apply scaling, rollout, or routing change if needed.
Best Practices
- Use separate keys per application and environment.
- Keep rollout changes observable and reversible.
- Monitor request, token, latency, and cost metrics together.
- Use policy controls for failover, admission, and routing safety.