Skill
skill-evaluation
Evaluate any AI skill's quality through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/quality), efficiency,...
When to use skill-evaluation
Choose if
You're shipping an AI skill (OpenClaw, Claude Code, ClawHub, similar) and want a structured evaluation harness measuring trigger accuracy, per-step execution, efficiency, and safety, with reproducible artifacts (plan.md, trigger-results.json, cases.json, execution-results.json, report.md). Best for production-readiness gating or comparing skill versions.
Avoid if
You only need a smoke test or one-off invocation check — the harness writes a versioned directory per run and enforces ceremony (expected-before-actual, three independent scores per step). Also avoid for skills whose value is open-ended generation rather than measurable correctness.
Risk Flags
- MEDIUM scope README enforces strict ordering (expected results before execution, new version folder per run, three independent scores per step, zero tolerance for regressions) — agents that batch-iterate or shortcut these gates will silently invalidate their own scorecard.
- LOW scope README defines stop conditions as zero bad cases, correctness avg ≥ 1.8/2, no regressions, zero safety findings. Workloads that can't meet these thresholds won't graduate the harness; lower-bar projects need a different tool.
Cost
Type: Free
Distribution
- ClawHub
skill-evaluation- License
- MIT-0