目录文档-技术白皮书45-EFT.WP.Data.Pipeline v1.0

第5章 Schema 与契约管理


I. 章节目的与范围

的版本化、兼容性、演进策略与校验流程;定义 schema_ref/compat_mode/evolution_policy 等保留键位,规范契约注册、影子对比与发布门槛;确保与数据卡/模型卡、计量章与引用锚点一致。Schema 与数据契约固化流水线中

II. 术语与依赖


III. 字段与结构(规范性)

contract:

schema_ref: "contracts/<name>@vX.Y" # 版本化 Schema 引用(必填)

compat_mode: "forward|backward|both|break"

evolution_policy:

add_field: "optional-by-default|feature-flag"

remove_field: "forbid|deprecate-then-remove"

change_type: "coercible|forbid"

change_sematic: "requires-shadow-and-signoff"

constraints:

primary_key: ["<col1>", "<col2?>"]

partition_by: ["<pcol?>"]

unique: [["<colA>","<colB>"]]

not_null: ["<colX>", "<colY>"]

range:

- {col:"<metric>", rule:"[lo,hi]"}

enum:

- {col:"<status>", values:["A","B","C"]}

units: { "<col>":"<SI-unit>" } # 与计量章一致

validation:

mode: "strict|lenient"

sample: {rows: 10000, strategy:"head|random|stratified"}

significance: {alpha: 0.05}

shadow:

enabled: true

route: "percent:5" # 影子比例或选择器

compare_metrics: ["dq.pass_rate","error_rate","latency_ms.p95"]

lineage_bind:

produce: ["<artifact_path>"]

consume: ["<upstream_schema_ref>"]


IV. 契约注册与发布流程

  1. 注册:schema_ref 在模式仓库登记,携带哈希与变更摘要;首次发布需附最小示例与 DQ 基线。
  2. 兼容性矩阵
    • forward:下游能接受上游新增可选字段;
    • backward:上游能输出下游旧字段子集;
    • both:双向兼容;
    • break:破坏性变更,需影子对比与签核。
  3. 演进策略:字段新增默认可选;删除采用“弃用→移除”的两阶段;类型变更仅允许 可强制转换(coercible) 场景并附转换规则。
  4. 发布门槛:Schema 校验=通过、DQ=通过、影子对比差异在阈内、metrology.check_dim=true、引用锚点齐全。

V. Schema 设计约束


VI. 契约影子对比与回滚


VII. 机器可读(规范性片段)

layers:

- name: "validate"

stages:

- name: "schema.check"

type: "validate.schema"

impl: "I16-2.schema_check"

inputs: ["raw_rows"]

outputs: ["clean_rows"]

contract:

schema_ref: "contracts/raw_rows@v1.2"

compat_mode: "both"

evolution_policy:

add_field: "optional-by-default"

remove_field: "deprecate-then-remove"

change_type: "coercible"

change_sematic: "requires-shadow-and-signoff"

constraints:

primary_key: ["id"]

not_null: ["id","ts"]

enum: [{col:"status", values:["ok","warn","err"]}]

units: {"lat":"deg","lon":"deg","power_w":"W"}

validation:

mode: "strict"

sample: {rows: 50000, strategy:"stratified"}

significance: {alpha: 0.05}

shadow:

enabled: true

route: "percent:5"

compare_metrics: ["dq.pass_rate","error_rate","latency_ms.p95"]

lineage_bind:

produce: ["lake/clean/2025/09/"]

consume: ["contracts/raw_json@v1.2"]


VIII. Lint 规则(节选,规范性)

lint_rules:

- id: SCHEMA.REF_FORMAT

when: "$..schema_ref"

assert: "matches('^contracts/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"

level: error

- id: SCHEMA.COMPAT_ALLOWED

when: "$..compat_mode"

assert: "value in ['forward','backward','both','break']"

level: error

- id: SCHEMA.UNITS_DECLARED

when: "$..constraints.units"

assert: "all_units_in_SI(value)"

level: error

- id: SCHEMA.PK_NOT_NULL

when: "$..constraints"

assert: "primary_key != null and all_not_null(primary_key, not_null)"

level: error

- id: SCHEMA.SHADOW_REQUIRED_ON_BREAK

when: "$..compat_mode"

assert: "value != 'break' or $.shadow.enabled == true"

level: error

- id: SCHEMA.METROLOGY_CHECKDIM

when: "$.pipeline.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


IX. 契约演进与通知


X. 导出清单与审计轨

export_manifest:

version: "v1.0"

artifacts:

- {path:"contracts/raw_rows.schema.json", sha256:"..."}

- {path:"contracts/changelog.md", sha256:"..."}

- {path:"validate/dq.report.jsonl", sha256:"..."}

- {path:"validate/shadow.diff.csv", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.12"


XI. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/