Skill

skill-evaluation

Evaluate any AI skill's quality through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/quality), efficiency,...

Verified: 2026-05-15 (clawhub-ingest-2026-05-15+enrich-capability-skill)

When to use skill-evaluation

Choose if

You're shipping an AI skill (OpenClaw, Claude Code, ClawHub, similar) and want a structured evaluation harness measuring trigger accuracy, per-step execution, efficiency, and safety, with reproducible artifacts (plan.md, trigger-results.json, cases.json, execution-results.json, report.md). Best for production-readiness gating or comparing skill versions.

Avoid if

You only need a smoke test or one-off invocation check — the harness writes a versioned directory per run and enforces ceremony (expected-before-actual, three independent scores per step). Also avoid for skills whose value is open-ended generation rather than measurable correctness.

Risk Flags

MEDIUM scope README enforces strict ordering (expected results before execution, new version folder per run, three independent scores per step, zero tolerance for regressions) — agents that batch-iterate or shortcut these gates will silently invalidate their own scorecard.
LOW scope README defines stop conditions as zero bad cases, correctness avg ≥ 1.8/2, no regressions, zero safety findings. Workloads that can't meet these thresholds won't graduate the harness; lower-bar projects need a different tool.

Cost

Type: Free

Distribution

ClawHub: skill-evaluation
License: MIT-0

Use skill-evaluation from your agent

claude mcp add auxiliar -- npx auxiliar-mcp
# Then in your agent:
get_capability(id="clawhub-skill-evaluation")