mirror of
https://github.com/sweetwisdom/everything-claude-code-zh.git
synced 2026-03-21 22:10:09 +00:00
docs: 完成所有文档的中文翻译并应用到项目
This commit is contained in:
@@ -1,25 +1,25 @@
|
||||
---
|
||||
name: eval-harness
|
||||
description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
|
||||
description: 为 Claude Code 会话提供的正式评测框架,实现了评测驱动开发(Eval-Driven Development,EDD)原则
|
||||
tools: Read, Write, Edit, Bash, Grep, Glob
|
||||
---
|
||||
|
||||
# Eval Harness Skill
|
||||
# 评测套件技能(Eval Harness Skill)
|
||||
|
||||
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
|
||||
一个为 Claude Code 会话提供的正式评测框架,实现了评测驱动开发(Eval-Driven Development,EDD)原则。
|
||||
|
||||
## Philosophy
|
||||
## 核心理念(Philosophy)
|
||||
|
||||
Eval-Driven Development treats evals as the "unit tests of AI development":
|
||||
- Define expected behavior BEFORE implementation
|
||||
- Run evals continuously during development
|
||||
- Track regressions with each change
|
||||
- Use pass@k metrics for reliability measurement
|
||||
评测驱动开发(EDD)将评测(Evals)视为“AI 开发的单元测试”:
|
||||
- 在实现代码之“前”定义预期行为
|
||||
- 在开发过程中持续运行评测
|
||||
- 跟踪每次变更带来的回归(Regressions)
|
||||
- 使用 pass@k 指标来衡量可靠性
|
||||
|
||||
## Eval Types
|
||||
## 评测类型
|
||||
|
||||
### Capability Evals
|
||||
Test if Claude can do something it couldn't before:
|
||||
### 能力评测(Capability Evals)
|
||||
测试 Claude 是否能够完成之前无法完成的任务:
|
||||
```markdown
|
||||
[CAPABILITY EVAL: feature-name]
|
||||
Task: Description of what Claude should accomplish
|
||||
@@ -30,8 +30,8 @@ Success Criteria:
|
||||
Expected Output: Description of expected result
|
||||
```
|
||||
|
||||
### Regression Evals
|
||||
Ensure changes don't break existing functionality:
|
||||
### 回归评测(Regression Evals)
|
||||
确保变更不会破坏现有功能:
|
||||
```markdown
|
||||
[REGRESSION EVAL: feature-name]
|
||||
Baseline: SHA or checkpoint name
|
||||
@@ -42,10 +42,10 @@ Tests:
|
||||
Result: X/Y passed (previously Y/Y)
|
||||
```
|
||||
|
||||
## Grader Types
|
||||
## 评分器(Grader)类型
|
||||
|
||||
### 1. Code-Based Grader
|
||||
Deterministic checks using code:
|
||||
### 1. 基于代码的评分器(Code-Based Grader)
|
||||
使用代码进行确定性检查:
|
||||
```bash
|
||||
# Check if file contains expected pattern
|
||||
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
|
||||
@@ -57,8 +57,8 @@ npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
|
||||
npm run build && echo "PASS" || echo "FAIL"
|
||||
```
|
||||
|
||||
### 2. Model-Based Grader
|
||||
Use Claude to evaluate open-ended outputs:
|
||||
### 2. 基于模型的评分器(Model-Based Grader)
|
||||
使用 Claude 评估开放式输出:
|
||||
```markdown
|
||||
[MODEL GRADER PROMPT]
|
||||
Evaluate the following code change:
|
||||
@@ -71,8 +71,8 @@ Score: 1-5 (1=poor, 5=excellent)
|
||||
Reasoning: [explanation]
|
||||
```
|
||||
|
||||
### 3. Human Grader
|
||||
Flag for manual review:
|
||||
### 3. 人工评分器(Human Grader)
|
||||
标记以供人工审查:
|
||||
```markdown
|
||||
[HUMAN REVIEW REQUIRED]
|
||||
Change: Description of what changed
|
||||
@@ -80,23 +80,23 @@ Reason: Why human review is needed
|
||||
Risk Level: LOW/MEDIUM/HIGH
|
||||
```
|
||||
|
||||
## Metrics
|
||||
## 指标(Metrics)
|
||||
|
||||
### pass@k
|
||||
"At least one success in k attempts"
|
||||
- pass@1: First attempt success rate
|
||||
- pass@3: Success within 3 attempts
|
||||
- Typical target: pass@3 > 90%
|
||||
“k 次尝试中至少成功一次”
|
||||
- pass@1:首次尝试成功率
|
||||
- pass@3:3 次尝试内成功
|
||||
- 典型目标:pass@3 > 90%
|
||||
|
||||
### pass^k
|
||||
"All k trials succeed"
|
||||
- Higher bar for reliability
|
||||
- pass^3: 3 consecutive successes
|
||||
- Use for critical paths
|
||||
“k 次试验全部成功”
|
||||
- 更高的可靠性门槛
|
||||
- pass^3:连续 3 次成功
|
||||
- 用于关键路径(Critical Paths)
|
||||
|
||||
## Eval Workflow
|
||||
## 评测工作流
|
||||
|
||||
### 1. Define (Before Coding)
|
||||
### 1. 定义(编码前)
|
||||
```markdown
|
||||
## EVAL DEFINITION: feature-xyz
|
||||
|
||||
@@ -115,10 +115,10 @@ Risk Level: LOW/MEDIUM/HIGH
|
||||
- pass^3 = 100% for regression evals
|
||||
```
|
||||
|
||||
### 2. Implement
|
||||
Write code to pass the defined evals.
|
||||
### 2. 实现
|
||||
编写代码以通过定义的评测。
|
||||
|
||||
### 3. Evaluate
|
||||
### 3. 评估
|
||||
```bash
|
||||
# Run capability evals
|
||||
[Run each capability eval, record PASS/FAIL]
|
||||
@@ -129,7 +129,7 @@ npm test -- --testPathPattern="existing"
|
||||
# Generate report
|
||||
```
|
||||
|
||||
### 4. Report
|
||||
### 4. 报告
|
||||
```markdown
|
||||
EVAL REPORT: feature-xyz
|
||||
========================
|
||||
@@ -153,48 +153,48 @@ Metrics:
|
||||
Status: READY FOR REVIEW
|
||||
```
|
||||
|
||||
## Integration Patterns
|
||||
## 集成模式
|
||||
|
||||
### Pre-Implementation
|
||||
### 实现前(Pre-Implementation)
|
||||
```
|
||||
/eval define feature-name
|
||||
```
|
||||
Creates eval definition file at `.claude/evals/feature-name.md`
|
||||
在 `.claude/evals/feature-name.md` 创建评测定义文件。
|
||||
|
||||
### During Implementation
|
||||
### 实现中(During Implementation)
|
||||
```
|
||||
/eval check feature-name
|
||||
```
|
||||
Runs current evals and reports status
|
||||
运行当前评测并报告状态。
|
||||
|
||||
### Post-Implementation
|
||||
### 实现后(Post-Implementation)
|
||||
```
|
||||
/eval report feature-name
|
||||
```
|
||||
Generates full eval report
|
||||
生成完整的评测报告。
|
||||
|
||||
## Eval Storage
|
||||
## 评测存储
|
||||
|
||||
Store evals in project:
|
||||
在项目中存储评测:
|
||||
```
|
||||
.claude/
|
||||
evals/
|
||||
feature-xyz.md # Eval definition
|
||||
feature-xyz.log # Eval run history
|
||||
baseline.json # Regression baselines
|
||||
feature-xyz.md # 评测定义
|
||||
feature-xyz.log # 评测运行历史
|
||||
baseline.json # 回归基线
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
## 最佳实践
|
||||
|
||||
1. **Define evals BEFORE coding** - Forces clear thinking about success criteria
|
||||
2. **Run evals frequently** - Catch regressions early
|
||||
3. **Track pass@k over time** - Monitor reliability trends
|
||||
4. **Use code graders when possible** - Deterministic > probabilistic
|
||||
5. **Human review for security** - Never fully automate security checks
|
||||
6. **Keep evals fast** - Slow evals don't get run
|
||||
7. **Version evals with code** - Evals are first-class artifacts
|
||||
1. **在编码之“前”定义评测** —— 强制对成功准则进行清晰思考。
|
||||
2. **频繁运行评测** —— 尽早发现回归。
|
||||
3. **随着时间推移跟踪 pass@k** —— 监控可靠性趋势。
|
||||
4. **尽可能使用代码评分器** —— 确定性(Deterministic)优于概率性(Probabilistic)。
|
||||
5. **安全相关的由人工审查** —— 永远不要完全自动化安全检查。
|
||||
6. **保持评测快速** —— 缓慢的评测往往不会被运行。
|
||||
7. **将评测与代码一同进行版本控制** —— 评测是一等公民产物(First-class Artifacts)。
|
||||
|
||||
## Example: Adding Authentication
|
||||
## 示例:添加身份验证
|
||||
|
||||
```markdown
|
||||
## EVAL: add-authentication
|
||||
|
||||
Reference in New Issue
Block a user