Files
everything-claude-code-zh/skills/eval-harness/SKILL.md

228 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: eval-harness
description: 为 Claude Code 会话提供的正式评测框架实现了评测驱动开发Eval-Driven DevelopmentEDD原则
tools: Read, Write, Edit, Bash, Grep, Glob
---
# 评测套件技能Eval Harness Skill
一个为 Claude Code 会话提供的正式评测框架实现了评测驱动开发Eval-Driven DevelopmentEDD原则。
## 核心理念Philosophy
评测驱动开发EDD将评测Evals视为“AI 开发的单元测试”:
- 在实现代码之“前”定义预期行为
- 在开发过程中持续运行评测
- 跟踪每次变更带来的回归Regressions
- 使用 pass@k 指标来衡量可靠性
## 评测类型
### 能力评测Capability Evals
测试 Claude 是否能够完成之前无法完成的任务:
```markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
```
### 回归评测Regression Evals
确保变更不会破坏现有功能:
```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
```
## 评分器Grader类型
### 1. 基于代码的评分器Code-Based Grader
使用代码进行确定性检查:
```bash
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
```
### 2. 基于模型的评分器Model-Based Grader
使用 Claude 评估开放式输出:
```markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
```
### 3. 人工评分器Human Grader
标记以供人工审查:
```markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
```
## 指标Metrics
### pass@k
“k 次尝试中至少成功一次”
- pass@1:首次尝试成功率
- pass@33 次尝试内成功
- 典型目标pass@3 > 90%
### pass^k
“k 次试验全部成功”
- 更高的可靠性门槛
- pass^3连续 3 次成功
- 用于关键路径Critical Paths
## 评测工作流
### 1. 定义(编码前)
```markdown
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
```
### 2. 实现
编写代码以通过定义的评测。
### 3. 评估
```bash
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
```
### 4. 报告
```markdown
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
```
## 集成模式
### 实现前Pre-Implementation
```
/eval define feature-name
```
`.claude/evals/feature-name.md` 创建评测定义文件。
### 实现中During Implementation
```
/eval check feature-name
```
运行当前评测并报告状态。
### 实现后Post-Implementation
```
/eval report feature-name
```
生成完整的评测报告。
## 评测存储
在项目中存储评测:
```
.claude/
evals/
feature-xyz.md # 评测定义
feature-xyz.log # 评测运行历史
baseline.json # 回归基线
```
## 最佳实践
1. **在编码之“前”定义评测** —— 强制对成功准则进行清晰思考。
2. **频繁运行评测** —— 尽早发现回归。
3. **随着时间推移跟踪 pass@k** —— 监控可靠性趋势。
4. **尽可能使用代码评分器** —— 确定性Deterministic优于概率性Probabilistic
5. **安全相关的由人工审查** —— 永远不要完全自动化安全检查。
6. **保持评测快速** —— 缓慢的评测往往不会被运行。
7. **将评测与代码一同进行版本控制** —— 评测是一等公民产物First-class Artifacts
## 示例:添加身份验证
```markdown
## EVAL: add-authentication
### Phase 1: Define (10 min)
Capability Evals:
- [ ] User can register with email/password
- [ ] User can login with valid credentials
- [ ] Invalid credentials rejected with proper error
- [ ] Sessions persist across page reloads
- [ ] Logout clears session
Regression Evals:
- [ ] Public routes still accessible
- [ ] API responses unchanged
- [ ] Database schema compatible
### Phase 2: Implement (varies)
[Write code]
### Phase 3: Evaluate
Run: /eval check add-authentication
### Phase 4: Report
EVAL REPORT: add-authentication
==============================
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
```