Evaluation User Guide

Create datasets, benchmarks, safety checks, model comparisons, leaderboards, and evaluation jobs for models and prompts.

Who This Guide Is For

Where To Go

Page Use It For
/evaluation Evaluation overview.
/evaluation/datasets Evaluation datasets.
/evaluation/golden-datasets Golden test sets.
/evaluation/benchmarks Benchmark definitions.
/evaluation/jobs Evaluation job status.
/evaluation/compare Compare model or prompt outputs.
/evaluation/safety Safety evaluation workflows.
/evaluation/leaderboard Compare systems across metrics.
/evaluation/settings Evaluation settings.

Core Concepts

Concept Meaning
Golden dataset A trusted set of examples and expected behavior.
Benchmark A repeatable evaluation procedure and metric suite.
Job A running or completed evaluation execution.
Judge An automatic, human, or LLM-assisted scoring mechanism.
Leaderboard A comparative view across models, prompts, or versions.

Common Workflows

Run a pre-deployment evaluation

  1. Select a target model or deployment.
  2. Choose a dataset or benchmark.
  3. Set metrics and judge configuration.
  4. Run the job.
  5. Review pass/fail results.
  6. Attach evidence to ModelOps promotion or gates.

Compare model versions

  1. Open Compare.
  2. Select two or more targets.
  3. Choose the same benchmark and dataset.
  4. Run comparison.
  5. Review metric deltas and failure cases.

Best Practices