Skill

skill-evaluation

Evaluate any AI skill's quality through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/quality), efficiency,...

Verified: 2026-05-15 (clawhub-ingest-2026-05-15+enrich-capability-skill)

When to use skill-evaluation

Choose if

You're shipping an AI skill (OpenClaw, Claude Code, ClawHub, similar) and want a structured evaluation harness measuring trigger accuracy, per-step execution, efficiency, and safety, with reproducible artifacts (plan.md, trigger-results.json, cases.json, execution-results.json, report.md). Best for production-readiness gating or comparing skill versions.

Avoid if

You only need a smoke test or one-off invocation check — the harness writes a versioned directory per run and enforces ceremony (expected-before-actual, three independent scores per step). Also avoid for skills whose value is open-ended generation rather than measurable correctness.

Risk Flags

  • MEDIUM scope README enforces strict ordering (expected results before execution, new version folder per run, three independent scores per step, zero tolerance for regressions) — agents that batch-iterate or shortcut these gates will silently invalidate their own scorecard.
  • LOW scope README defines stop conditions as zero bad cases, correctness avg ≥ 1.8/2, no regressions, zero safety findings. Workloads that can't meet these thresholds won't graduate the harness; lower-bar projects need a different tool.

Cost

Type: Free

Distribution

ClawHub
skill-evaluation
License
MIT-0