目录 / 文档-技术白皮书 / 45-EFT.WP.Data.Pipeline v1.0
I. 章节目的与范围
的规范:失败语义与补偿、重试与超时、幂等与去重、检查点与快照、备份与回放、RTO/RPO 与演练、跨可用区/跨区域切换与回切、导出物与审计;确保与契约/调度/监控/计量章一致。灾备(disaster recovery, DR) 与 恢复(recovery)、容错(fault tolerance)固化流水线II. 术语与依赖
- 术语:idempotency、compensation(补偿)、checkpoint、snapshot、rollback、failover/fallback、active-active|active-passive、RTO、RPO、quorum、split-brain、chaos test。
- 依赖:契约与导出(《Core.DataSpec v1.0》);单位/量纲(《Core.Metrology v1.0》);编排/调度/资源(第10章);监控/告警(第12章);版本化与血缘(第11章)。
- 数学与符号:内联符号用反引号(如 RTO、RPO、QPS、T_inf、ρ);含除号/积分/复合算符必须加括号;涉及路径量 T_arr 时登记 gamma(ell) 与 d ell;公式/符号/定义禁用中文。
III. 字段与结构(规范性)
fault_tolerance:
semantics:
on_fail: "retry|skip|quarantine|block"
error_classes: ["retryable","non_retryable","escalate"]
retry:
policy: {max: 3, backoff: "expo", jitter_ms: 200}
timeout_s: 1800
idempotency:
enabled: true
dedupe_key: ["<pk>","<offset|ts>"]
sink_mode: "idempotent-insert|upsert"
compensation:
enabled: true
handlers:
- {stage:"transform.normalize", action:"reverse_op", spec:"comp/normalize.reverse.yaml"}
- {stage:"feature.map", action:"delete_artifact", spec:"comp/delete.manifest.yaml"}
recovery:
checkpoint:
mode: "exactly-once|at-least-once"
store: "s3://.../chk/<stage>"
cadence: "PT5M"
contents: ["offset","cursor","watermark","sink_commit"]
snapshot:
enabled: true
store: "s3://.../snap/<dataset>"
cadence: "P1D"
retention: "P30D"
replay:
enabled: true
inputs_lock: "locks/inputs.manifest.json"
policy: "strict|lenient"
rollbacks:
guardrail: {max_depth: 2, require_approval: true}
dr:
strategy: "active-active|active-passive"
topology:
primary: {region:"eu-west-1", azs:["a","b"], quorum:3}
standby: {region:"eu-central-1", azs:["a","b"], quorum:3}
rto: "PT30M"
rpo: "PT5M"
failover:
trigger: "manual|auto"
health_checks: ["latency_ms.p99","error_rate","heartbeat"]
dns_ttl_s: 60
fallback:
criteria: ["primary_healthy_24h","replication_lag<PT1M"]
testing:
chaos:
enabled: true
experiments:
- {name:"kill-worker", scope:"stage", percent:10}
- {name:"net-partition", scope:"cluster", duration_s:300}
- {name:"disk-throttle", scope:"node", mbps:50}
drills:
schedule: "quarterly"
playbooks: ["dr/runbook.md","rollback/runbook.md"]
success_criteria: ["rto_met","rpo_met","no_data_loss","alerting_ok"]
backups:
datasets: ["feat_rows","train_pkg"]
cadence: "P1D"
store: "s3://.../backup"
encryption: "SSE-KMS"
integrity: {hash:"sha256", manifest:"backup/manifest.json"}
IV. 失败语义、重试与幂等
- 失败语义:按 error_classes 分类;retry 仅作用于可重试错误;skip|quarantine|block 分别对应丢弃/隔离/停止。
- 重试与超时:指数退避+抖动,显式 timeout_s;单次与总时限区分并记录。
- 幂等与去重:来源端用 dedupe_key,汇聚端用 idempotent-insert|upsert;保证幂等哈希与输出字节级一致。
V. 补偿、回滚与回放
- 补偿:为每个可变更算子定义反向/删除/重建动作;补偿事务需具幂等与可串行化保证。
- 回滚:限制回滚深度与审批;回滚即重放上游检查点/快照并重建下游工件。
- 回放:严格策略要求结果字节级一致;宽松策略允许小幅偏差并设容忍域与审计说明。
VI. 检查点、快照与备份
- 检查点:保存偏移/游标/水位与汇聚提交标志;cadence 控制频度;存储需具写入原子性。
- 快照:周期性冻结可重建数据集;管理保留周期与清理策略,防止累积成本。
- 备份:区分逻辑/物理备份;校验 sha256 与对账清单,失败为阻断。
VII. 灾备策略与切换
- 拓扑:active-active 提供并行服务与多活仲裁;active-passive 以复制与冷备为主。
- RTO/RPO:以 SI 时间表达;监控 replication_lag 与健康信号;切换与回切均需 Runbook 与审批。
- 一致性:防止 split-brain,采用 quorum 与写入栅栏;切换后触发血缘对比与审计。
VIII. 混沌实验、演练与成功准则
- 混沌注入:节点/网络/磁盘/依赖模拟;每次实验记录影响域与恢复耗时。
- 演练:按计划触发故障模拟与回滚/切换流程;成功准则与未达标项纳入整改清单。
- 回归:演练后对 SLA、DQ、成本与报警质量进行回归评估。
IX. 计量与单位(SI)
- 性能与目标:RTO、RPO、T_inf(ms)、QPS(1/s)、ρ(—);带宽 net_mbps、体量 size_bytes;
- 强制:metrology:{units:"SI", check_dim:true};合成/换算前先做单位归一。
- 路径量:若容错/恢复流程处理 T_arr,需登记 delta_form、path="gamma(ell)"、measure="d ell",并采用:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),通过 check_dim 校核。
X. 机器可读片段(可直接嵌入)
fault_tolerance:
semantics: {on_fail:"retry", error_classes:["retryable","non_retryable","escalate"]}
retry: {policy:{max:3, backoff:"expo", jitter_ms:200}, timeout_s:1800}
idempotency: {enabled:true, dedupe_key:["id","updated_at"], sink_mode:"upsert"}
compensation:
enabled: true
handlers:
- {stage:"feature.map", action:"delete_artifact", spec:"comp/delete.manifest.yaml"}
recovery:
checkpoint: {mode:"exactly-once", store:"s3://meta/chk/feat.map", cadence:"PT5M",
contents:["offset","cursor","watermark","sink_commit"]}
snapshot: {enabled:true, store:"s3://snap/feat_rows", cadence:"P1D", retention:"P30D"}
replay: {enabled:true, inputs_lock:"locks/inputs.manifest.json", policy:"strict"}
rollbacks: {guardrail:{max_depth:2, require_approval:true}}
dr:
strategy: "active-passive"
topology:
primary: {region:"eu-west-1", azs:["a","b"], quorum:3}
standby: {region:"eu-central-1", azs:["a","b"], quorum:3}
rto: "PT30M"
rpo: "PT5M"
failover: {trigger:"auto", health_checks:["latency_ms.p99","error_rate","heartbeat"], dns_ttl_s:60}
fallback: {criteria:["primary_healthy_24h","replication_lag<PT1M"]}
testing:
chaos: {enabled:true, experiments:[{name:"kill-worker",scope:"stage",percent:10}]}
drills: {schedule:"quarterly", playbooks:["dr/runbook.md"], success_criteria:["rto_met","rpo_met","no_data_loss"]}
backups:
datasets: ["feat_rows","train_pkg"]
cadence: "P1D"
store: "s3://backup"
encryption: "SSE-KMS"
integrity: {hash:"sha256", manifest:"backup/manifest.json"}
metrology: {units:"SI", check_dim:true}
XI. Lint 规则(节选,规范性)
lint_rules:
- id: FT.IDEMPOTENCY_REQUIRED
when: "$.fault_tolerance.idempotency.enabled"
assert: "value == true"
level: error
- id: RC.CHECKPOINT_DEFINED
when: "$.recovery.checkpoint"
assert: "has_keys(mode, store, cadence)"
level: error
- id: DR.RTO_RPO_DEFINED
when: "$.dr"
assert: "has_keys(rto, rpo) and duration_valid(rto) and duration_valid(rpo)"
level: error
- id: DR.STRATEGY_ALLOWED
when: "$.dr.strategy"
assert: "value in ['active-active','active-passive']"
level: error
- id: TEST.DRILLS_SCHEDULED
when: "$.testing.drills.schedule"
assert: "matches('^(monthly|quarterly|biannual|annual)$') or duration_valid(value)"
level: error
- id: BKP.INTEGRITY_MANIFEST
when: "$.backups"
assert: "has_keys(store, cadence, integrity)"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XII. 导出清单与审计
export_manifest:
version: "v1.0"
artifacts:
- {path:"chk/catalog.json", sha256:"..."}
- {path:"snap/retention.policy", sha256:"..."}
- {path:"dr/runbook.md", sha256:"..."}
- {path:"dr/drill_reports/2025Q3.md", sha256:"..."}
- {path:"backup/manifest.json", sha256:"..."}
- {path:"comp/normalize.reverse.yaml", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.Pipeline v1.0:Ch.11"
- "EFT.WP.Data.Pipeline v1.0:Ch.12"
XIII. 本章合规自检
- 失败语义、重试/超时、幂等/去重策略明确,补偿与回滚流程可复现并具审批与限额。
- 检查点/快照/备份与完整性校验就绪;回放策略与 inputs_lock 完整。
- 灾备拓扑、RTO/RPO、切换与回切 Runbook 固化;混沌实验与演练按计划执行且达成成功准则。
- SI 计量与 check_dim=true 生效;如涉 T_arr 已登记 delta_form/path/measure 并通过校核。
- 导出清单列出检查点/快照/备份/演练与 Runbook 等工件及引用锚点,并具 sha256,满足发布门槛。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/