docs: 完成所有文档的中文翻译并应用到项目

2026-03-22 06:20:10 +00:00 · 2026-01-28 00:12:54 +08:00
parent 0ced59a26b
commit e133f58e1c
76 changed files with 6808 additions and 6170 deletions
--- a/skills/clickhouse-io/SKILL.md
+++ b/skills/clickhouse-io/SKILL.md
@@ -1,26 +1,26 @@
 ---
 name: clickhouse-io
-description: ClickHouse database patterns, query optimization, analytics, and data engineering best practices for high-performance analytical workloads.
+description: ClickHouse 数据库模式、查询优化、分析以及针对高性能分析工作负载的数据工程最佳实践。
 ---

-# ClickHouse Analytics Patterns
+# ClickHouse 分析模式

-ClickHouse-specific patterns for high-performance analytics and data engineering.
+针对高性能分析和数据工程的 ClickHouse 特定模式。

-## Overview
+## 概览

-ClickHouse is a column-oriented database management system (DBMS) for online analytical processing (OLAP). It's optimized for fast analytical queries on large datasets.
+ClickHouse 是一款用于联机分析处理（OLAP）的列式数据库管理系统（DBMS）。它针对大规模数据集上的快速分析查询进行了优化。

-**Key Features:**
- Column-oriented storage
- Data compression
- Parallel query execution
- Distributed queries
- Real-time analytics
+**核心特性：**
+- 列式存储
+- 数据压缩
+- 并行查询执行
+- 分布式查询
+- 实时分析

-## Table Design Patterns
+## 表设计模式

-### MergeTree Engine (Most Common)
+### MergeTree 引擎（最常用）

 ```sql
 CREATE TABLE markets_analytics (
@@ -38,10 +38,10 @@ ORDER BY (date, market_id)
 SETTINGS index_granularity = 8192;
 ```

-### ReplacingMergeTree (Deduplication)
+### ReplacingMergeTree（去重）

 ```sql
-- For data that may have duplicates (e.g., from multiple sources)
+-- 针对可能存在重复的数据（例如来自多个源）
 CREATE TABLE user_events (
    event_id String,
    user_id String,
@@ -54,10 +54,10 @@ ORDER BY (user_id, event_id, timestamp)
 PRIMARY KEY (user_id, event_id);
 ```

-### AggregatingMergeTree (Pre-aggregation)
+### AggregatingMergeTree（预聚合）

 ```sql
-- For maintaining aggregated metrics
+-- 用于维护聚合指标
 CREATE TABLE market_stats_hourly (
    hour DateTime,
    market_id String,
@@ -68,7 +68,7 @@ CREATE TABLE market_stats_hourly (
 PARTITION BY toYYYYMM(hour)
 ORDER BY (hour, market_id);

-- Query aggregated data
+-- 查询聚合数据
 SELECT
    hour,
    market_id,
@@ -81,12 +81,12 @@ GROUP BY hour, market_id
 ORDER BY hour DESC;
 ```

-## Query Optimization Patterns
+## 查询优化模式

-### Efficient Filtering
+### 高效过滤

 ```sql
-- ✅ GOOD: Use indexed columns first
+-- ✅ 推荐：优先使用索引列
 SELECT *
 FROM markets_analytics
 WHERE date >= '2025-01-01'
@@ -95,7 +95,7 @@ WHERE date >= '2025-01-01'
 ORDER BY date DESC
 LIMIT 100;

-- ❌ BAD: Filter on non-indexed columns first
+-- ❌ 不推荐：优先过滤非索引列
 SELECT *
 FROM markets_analytics
 WHERE volume > 1000
@@ -103,10 +103,10 @@ WHERE volume > 1000
  AND date >= '2025-01-01';
 ```

-### Aggregations
+### 聚合

 ```sql
-- ✅ GOOD: Use ClickHouse-specific aggregation functions
+-- ✅ 推荐：使用 ClickHouse 特有的聚合函数
 SELECT
    toStartOfDay(created_at) AS day,
    market_id,
@@ -119,7 +119,7 @@ WHERE created_at >= today() - INTERVAL 7 DAY
 GROUP BY day, market_id
 ORDER BY day DESC, total_volume DESC;

-- ✅ Use quantile for percentiles (more efficient than percentile)
+-- ✅ 使用 quantile 计算分位数（比 percentile 更高效）
 SELECT
    quantile(0.50)(trade_size) AS median,
    quantile(0.95)(trade_size) AS p95,
@@ -128,10 +128,10 @@ FROM trades
 WHERE created_at >= now() - INTERVAL 1 HOUR;
 ```

-### Window Functions
+### 窗口函数

 ```sql
-- Calculate running totals
+-- 计算累计总量
 SELECT
    date,
    market_id,
@@ -146,9 +146,9 @@ WHERE date >= today() - INTERVAL 30 DAY
 ORDER BY market_id, date;
 ```

-## Data Insertion Patterns
+## 数据插入模式

-### Bulk Insert (Recommended)
+### 批量插入（推荐）

 ```typescript
 import { ClickHouse } from 'clickhouse'
@@ -162,7 +162,7 @@ const clickhouse = new ClickHouse({
  }
 })

-// ✅ Batch insert (efficient)
+// ✅ 批量插入（高效）
 async function bulkInsertTrades(trades: Trade[]) {
  const values = trades.map(trade => `(
    '${trade.id}',
@@ -178,19 +178,19 @@ async function bulkInsertTrades(trades: Trade[]) {
  `).toPromise()
 }

-// ❌ Individual inserts (slow)
+// ❌ 逐条插入（缓慢）
 async function insertTrade(trade: Trade) {
-  // Don't do this in a loop!
+  // 不要循环执行此操作！
  await clickhouse.query(`
    INSERT INTO trades VALUES ('${trade.id}', ...)
  `).toPromise()
 }
 ```

-### Streaming Insert
+### 流式插入

 ```typescript
-// For continuous data ingestion
+// 用于持续的数据摄取
 import { createWriteStream } from 'fs'
 import { pipeline } from 'stream/promises'

@@ -205,12 +205,12 @@ async function streamInserts() {
 }
 ```

-## Materialized Views
+## 物化视图（Materialized Views）

-### Real-time Aggregations
+### 实时聚合

 ```sql
-- Create materialized view for hourly stats
+-- 为每小时统计创建物化视图
 CREATE MATERIALIZED VIEW market_stats_hourly_mv
 TO market_stats_hourly
 AS SELECT
@@ -222,7 +222,7 @@ AS SELECT
 FROM trades
 GROUP BY hour, market_id;

-- Query the materialized view
+-- 查询物化视图
 SELECT
    hour,
    market_id,
@@ -234,12 +234,12 @@ WHERE hour >= now() - INTERVAL 24 HOUR
 GROUP BY hour, market_id;
 ```

-## Performance Monitoring
+## 性能监控

-### Query Performance
+### 查询性能

 ```sql
-- Check slow queries
+-- 检查慢查询
 SELECT
    query_id,
    user,
@@ -256,10 +256,10 @@ ORDER BY query_duration_ms DESC
 LIMIT 10;
 ```

-### Table Statistics
+### 表统计信息

 ```sql
-- Check table sizes
+-- 检查表大小
 SELECT
    database,
    table,
@@ -272,12 +272,12 @@ GROUP BY database, table
 ORDER BY sum(bytes) DESC;
 ```

-## Common Analytics Queries
+## 常用分析查询

-### Time Series Analysis
+### 时间序列分析

 ```sql
-- Daily active users
+-- 日活跃用户数
 SELECT
    toDate(timestamp) AS date,
    uniq(user_id) AS daily_active_users
@@ -286,7 +286,7 @@ WHERE timestamp >= today() - INTERVAL 30 DAY
 GROUP BY date
 ORDER BY date;

-- Retention analysis
+-- 留存分析
 SELECT
    signup_date,
    countIf(days_since_signup = 0) AS day_0,
@@ -306,10 +306,10 @@ GROUP BY signup_date
 ORDER BY signup_date DESC;
 ```

-### Funnel Analysis
+### 漏斗分析

 ```sql
-- Conversion funnel
+-- 转化漏斗
 SELECT
    countIf(step = 'viewed_market') AS viewed,
    countIf(step = 'clicked_trade') AS clicked,
@@ -327,10 +327,10 @@ FROM (
 GROUP BY session_id;
 ```

-### Cohort Analysis
+### 队列分析（Cohort Analysis）

 ```sql
-- User cohorts by signup month
+-- 按注册月份划分的用户队列
 SELECT
    toStartOfMonth(signup_date) AS cohort,
    toStartOfMonth(activity_date) AS month,
@@ -347,17 +347,17 @@ GROUP BY cohort, month, months_since_signup
 ORDER BY cohort, months_since_signup;
 ```

-## Data Pipeline Patterns
+## 数据流水线（Data Pipeline）模式

-### ETL Pattern
+### ETL 模式

 ```typescript
-// Extract, Transform, Load
+// 抽取（Extract）、转换（Transform）、加载（Load）
 async function etlPipeline() {
-  // 1. Extract from source
+  // 1. 从源端抽取
  const rawData = await extractFromPostgres()

-  // 2. Transform
+  // 2. 转换
  const transformed = rawData.map(row => ({
    date: new Date(row.created_at).toISOString().split('T')[0],
    market_id: row.market_slug,
@@ -365,18 +365,18 @@ async function etlPipeline() {
    trades: parseInt(row.trade_count)
  }))

-  // 3. Load to ClickHouse
+  // 3. 加载到 ClickHouse
  await bulkInsertToClickHouse(transformed)
 }

-// Run periodically
-setInterval(etlPipeline, 60 * 60 * 1000)  // Every hour
+// 定期运行
+setInterval(etlPipeline, 60 * 60 * 1000)  // 每小时
 ```

-### Change Data Capture (CDC)
+### 变更数据捕获（CDC）

 ```typescript
-// Listen to PostgreSQL changes and sync to ClickHouse
+// 监听 PostgreSQL 变更并同步到 ClickHouse
 import { Client } from 'pg'

 const pgClient = new Client({ connectionString: process.env.DATABASE_URL })
@@ -397,33 +397,33 @@ pgClient.on('notification', async (msg) => {
 })
 ```

-## Best Practices
+## 最佳实践

-### 1. Partitioning Strategy
- Partition by time (usually month or day)
- Avoid too many partitions (performance impact)
- Use DATE type for partition key
+### 1. 分区策略
+- 按时间分区（通常是按月或按天）
+- 避免分区过多（会影响性能）
+- 分区键使用 DATE 类型

-### 2. Ordering Key
- Put most frequently filtered columns first
- Consider cardinality (high cardinality first)
- Order impacts compression
+### 2. 排序键（Ordering Key）
+- 将最常过滤的列放在前面
+- 考虑基数（高基数列放在前面）
+- 排序会影响压缩效果

-### 3. Data Types
- Use smallest appropriate type (UInt32 vs UInt64)
- Use LowCardinality for repeated strings
- Use Enum for categorical data
+### 3. 数据类型
+- 使用最合适的最小类型（如 UInt32 而非 UInt64）
+- 对重复字符串使用 LowCardinality
+- 对类别数据使用 Enum

-### 4. Avoid
- SELECT * (specify columns)
- FINAL (merge data before query instead)
- Too many JOINs (denormalize for analytics)
- Small frequent inserts (batch instead)
+### 4. 避免事项
+- SELECT *（应指定具体列）
+- FINAL（应改为在查询前合并数据）
+- 过多的 JOIN 操作（针对分析场景应进行反规范化）
+- 小额频繁插入（应改为批量插入）

-### 5. Monitoring
- Track query performance
- Monitor disk usage
- Check merge operations
- Review slow query log
+### 5. 监控
+- 追踪查询性能
+- 监控磁盘使用情况
+- 检查合并（merge）操作
+- 审查慢查询日志

-**Remember**: ClickHouse excels at analytical workloads. Design tables for your query patterns, batch inserts, and leverage materialized views for real-time aggregations.
+**记住**：ClickHouse 擅长处理分析型工作负载。请根据查询模式设计表结构，采用批量插入，并利用物化视图进行实时聚合。