docs: 完成所有文档的中文翻译并应用到项目

2026-05-19 15:12:48 +00:00 · 2026-01-28 00:12:54 +08:00
parent 0ced59a26b
commit e133f58e1c
76 changed files with 6808 additions and 6170 deletions
--- a/skills/eval-harness/SKILL.md
+++ b/skills/eval-harness/SKILL.md
@@ -1,25 +1,25 @@
 ---
 name: eval-harness
-description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
+description: 为 Claude Code 会话提供的正式评测框架，实现了评测驱动开发（Eval-Driven Development，EDD）原则
 tools: Read, Write, Edit, Bash, Grep, Glob
 ---

-# Eval Harness Skill
+# 评测套件技能（Eval Harness Skill）

-A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
+一个为 Claude Code 会话提供的正式评测框架，实现了评测驱动开发（Eval-Driven Development，EDD）原则。

-## Philosophy
+## 核心理念（Philosophy）

-Eval-Driven Development treats evals as the "unit tests of AI development":
- Define expected behavior BEFORE implementation
- Run evals continuously during development
- Track regressions with each change
- Use pass@k metrics for reliability measurement
+评测驱动开发（EDD）将评测（Evals）视为“AI 开发的单元测试”：
+- 在实现代码之“前”定义预期行为
+- 在开发过程中持续运行评测
+- 跟踪每次变更带来的回归（Regressions）
+- 使用 pass@k 指标来衡量可靠性

-## Eval Types
+## 评测类型

-### Capability Evals
-Test if Claude can do something it couldn't before:
+### 能力评测（Capability Evals）
+测试 Claude 是否能够完成之前无法完成的任务：
 ```markdown
 [CAPABILITY EVAL: feature-name]
 Task: Description of what Claude should accomplish
@@ -30,8 +30,8 @@ Success Criteria:
 Expected Output: Description of expected result
 ```

-### Regression Evals
-Ensure changes don't break existing functionality:
+### 回归评测（Regression Evals）
+确保变更不会破坏现有功能：
 ```markdown
 [REGRESSION EVAL: feature-name]
 Baseline: SHA or checkpoint name
@@ -42,10 +42,10 @@ Tests:
 Result: X/Y passed (previously Y/Y)
 ```

-## Grader Types
+## 评分器（Grader）类型

-### 1. Code-Based Grader
-Deterministic checks using code:
+### 1. 基于代码的评分器（Code-Based Grader）
+使用代码进行确定性检查：
 ```bash
 # Check if file contains expected pattern
 grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
@@ -57,8 +57,8 @@ npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
 npm run build && echo "PASS" || echo "FAIL"
 ```

-### 2. Model-Based Grader
-Use Claude to evaluate open-ended outputs:
+### 2. 基于模型的评分器（Model-Based Grader）
+使用 Claude 评估开放式输出：
 ```markdown
 [MODEL GRADER PROMPT]
 Evaluate the following code change:
@@ -71,8 +71,8 @@ Score: 1-5 (1=poor, 5=excellent)
 Reasoning: [explanation]
 ```

-### 3. Human Grader
-Flag for manual review:
+### 3. 人工评分器（Human Grader）
+标记以供人工审查：
 ```markdown
 [HUMAN REVIEW REQUIRED]
 Change: Description of what changed
@@ -80,23 +80,23 @@ Reason: Why human review is needed
 Risk Level: LOW/MEDIUM/HIGH
 ```

-## Metrics
+## 指标（Metrics）

 ### pass@k
-"At least one success in k attempts"
- pass@1: First attempt success rate
- pass@3: Success within 3 attempts
- Typical target: pass@3 > 90%
+“k 次尝试中至少成功一次”
+- pass@1：首次尝试成功率
+- pass@3：3 次尝试内成功
+- 典型目标：pass@3 > 90%

 ### pass^k
-"All k trials succeed"
- Higher bar for reliability
- pass^3: 3 consecutive successes
- Use for critical paths
+“k 次试验全部成功”
+- 更高的可靠性门槛
+- pass^3：连续 3 次成功
+- 用于关键路径（Critical Paths）

-## Eval Workflow
+## 评测工作流

-### 1. Define (Before Coding)
+### 1. 定义（编码前）
 ```markdown
 ## EVAL DEFINITION: feature-xyz

@@ -115,10 +115,10 @@ Risk Level: LOW/MEDIUM/HIGH
 - pass^3 = 100% for regression evals
 ```

-### 2. Implement
-Write code to pass the defined evals.
+### 2. 实现
+编写代码以通过定义的评测。

-### 3. Evaluate
+### 3. 评估
 ```bash
 # Run capability evals
 [Run each capability eval, record PASS/FAIL]
@@ -129,7 +129,7 @@ npm test -- --testPathPattern="existing"
 # Generate report
 ```

-### 4. Report
+### 4. 报告
 ```markdown
 EVAL REPORT: feature-xyz
 ========================
@@ -153,48 +153,48 @@ Metrics:
 Status: READY FOR REVIEW
 ```

-## Integration Patterns
+## 集成模式

-### Pre-Implementation
+### 实现前（Pre-Implementation）
 ```
 /eval define feature-name
 ```
-Creates eval definition file at `.claude/evals/feature-name.md`
+在 `.claude/evals/feature-name.md` 创建评测定义文件。

-### During Implementation
+### 实现中（During Implementation）
 ```
 /eval check feature-name
 ```
-Runs current evals and reports status
+运行当前评测并报告状态。

-### Post-Implementation
+### 实现后（Post-Implementation）
 ```
 /eval report feature-name
 ```
-Generates full eval report
+生成完整的评测报告。

-## Eval Storage
+## 评测存储

-Store evals in project:
+在项目中存储评测：
 ```
 .claude/
  evals/
-    feature-xyz.md      # Eval definition
-    feature-xyz.log     # Eval run history
-    baseline.json       # Regression baselines
+    feature-xyz.md      # 评测定义
+    feature-xyz.log     # 评测运行历史
+    baseline.json       # 回归基线
 ```

-## Best Practices
+## 最佳实践

-1. **Define evals BEFORE coding** - Forces clear thinking about success criteria
-2. **Run evals frequently** - Catch regressions early
-3. **Track pass@k over time** - Monitor reliability trends
-4. **Use code graders when possible** - Deterministic > probabilistic
-5. **Human review for security** - Never fully automate security checks
-6. **Keep evals fast** - Slow evals don't get run
-7. **Version evals with code** - Evals are first-class artifacts
+1. **在编码之“前”定义评测** —— 强制对成功准则进行清晰思考。
+2. **频繁运行评测** —— 尽早发现回归。
+3. **随着时间推移跟踪 pass@k** —— 监控可靠性趋势。
+4. **尽可能使用代码评分器** —— 确定性（Deterministic）优于概率性（Probabilistic）。
+5. **安全相关的由人工审查** —— 永远不要完全自动化安全检查。
+6. **保持评测快速** —— 缓慢的评测往往不会被运行。
+7. **将评测与代码一同进行版本控制** —— 评测是一等公民产物（First-class Artifacts）。

-## Example: Adding Authentication
+## 示例：添加身份验证

 ```markdown
 ## EVAL: add-authentication