Skill
Ai Agent Evaluator
AI-powered agent evaluation and benchmarking assistant — design evaluation suites, run structured assessments (task completion rate, latency, safety, reasoni...
When to use Ai Agent Evaluator
Choose if
You're standing up an evaluation discipline for an AI agent and want methodology guidance — how to design eval suites, which benchmarks map to which use case, how to read failure modes from logs, how to plan red-team adversarial tests. Bilingual (EN / 中文). Pair with execution platforms (DeepEval, PromptFoo, Braintrust, LangSmith) which the skill references but does not embed.
Avoid if
You want a runnable eval harness rather than methodology — SKILL.md states this provides "evaluation methodology and guidance, not direct code execution". Also avoid for production safety sign-off on its own: the skill notes safety evaluations require human security team involvement and results must be "reviewed by qualified ML engineers before deployment decisions".
Risk Flags
- LOW scope Methodology-only skill. SKILL.md states it provides "evaluation methodology and guidance, not direct code execution" — agents needing an actual eval runner must use DeepEval, PromptFoo, Braintrust, LangSmith, or equivalents.
- LOW data_quality SKILL.md notes benchmark scores are "time-sensitive" and recommends "always check latest published leaderboards"; safety evaluations require human security team involvement, and results must be reviewed by qualified ML engineers before deployment decisions.
Cost
Type: Unknown
Distribution
- ClawHub
ai-agent-evaluator- License
- MIT-0