Evaluation User Guide
Create datasets, benchmarks, safety checks, model comparisons, leaderboards, and evaluation jobs for models and prompts.
Who This Guide Is For
- Evaluation engineers
- ML engineers
- QA teams
- Governance reviewers
Where To Go
| Page |
Use It For |
/evaluation |
Evaluation overview. |
/evaluation/datasets |
Evaluation datasets. |
/evaluation/golden-datasets |
Golden test sets. |
/evaluation/benchmarks |
Benchmark definitions. |
/evaluation/jobs |
Evaluation job status. |
/evaluation/compare |
Compare model or prompt outputs. |
/evaluation/safety |
Safety evaluation workflows. |
/evaluation/leaderboard |
Compare systems across metrics. |
/evaluation/settings |
Evaluation settings. |
Core Concepts
| Concept |
Meaning |
| Golden dataset |
A trusted set of examples and expected behavior. |
| Benchmark |
A repeatable evaluation procedure and metric suite. |
| Job |
A running or completed evaluation execution. |
| Judge |
An automatic, human, or LLM-assisted scoring mechanism. |
| Leaderboard |
A comparative view across models, prompts, or versions. |
Common Workflows
Run a pre-deployment evaluation
- Select a target model or deployment.
- Choose a dataset or benchmark.
- Set metrics and judge configuration.
- Run the job.
- Review pass/fail results.
- Attach evidence to ModelOps promotion or gates.
Compare model versions
- Open Compare.
- Select two or more targets.
- Choose the same benchmark and dataset.
- Run comparison.
- Review metric deltas and failure cases.
Best Practices
- Keep golden datasets small, trusted, and versioned.
- Separate quality, safety, latency, and cost metrics.
- Use evaluation results as promotion evidence, not just dashboards.
- Review failure examples before approving production changes.