Evaluation User Guide

Create datasets, benchmarks, safety checks, model comparisons, leaderboards, and evaluation jobs for models and prompts.

Who This Guide Is For

Page	Use It For
`/evaluation`	Evaluation overview.
`/evaluation/datasets`	Evaluation datasets.
`/evaluation/golden-datasets`	Golden test sets.
`/evaluation/benchmarks`	Benchmark definitions.
`/evaluation/jobs`	Evaluation job status.
`/evaluation/compare`	Compare model or prompt outputs.
`/evaluation/safety`	Safety evaluation workflows.
`/evaluation/leaderboard`	Compare systems across metrics.
`/evaluation/settings`	Evaluation settings.

Concept	Meaning
Golden dataset	A trusted set of examples and expected behavior.
Benchmark	A repeatable evaluation procedure and metric suite.
Job	A running or completed evaluation execution.
Judge	An automatic, human, or LLM-assisted scoring mechanism.
Leaderboard	A comparative view across models, prompts, or versions.