目录文档-技术白皮书45-EFT.WP.Data.Pipeline v1.0

第6章 数据验证与质量门


I. 章节目的与范围

的规范:规则类型、抽样与显著性、阻断与预警分级、异常处置、审计与导出;确保与 Σ_in/Σ_out 契约、切分/覆盖、计量与引用锚点一致。质量门(DQ gates)数据验证(validation)固化流水线

II. 术语与依赖


III. 字段与结构(规范性)

stage:

name: "schema.check|dq.scan|leakage.audit"

type: "validate.schema|validate.dq|validate.leakage"

impl: "I16-2.schema_check|I16-7.dq_scan|I16-8.leakage_audit"

inputs: ["<upstream_artifact>"]

outputs: ["<clean_rows>|<dq_report>|<leakage_report>"]

schema_ref: "contracts/<name>@vX.Y" # 与契约绑定

dq:

sample: {rows: 50000, strategy: "head|random|stratified"}

significance: {alpha: 0.05} # 统计检验阈值

gates: # 质量门清单

- {id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}

- {id:"DQ_002", kind:"unique", cols:[["id","ts"]], level:"block"}

- {id:"DQ_003", kind:"range", col:"value", rule:"[0,1e6]", unit:"<SI>", level:"block"}

- {id:"DQ_004", kind:"enum", col:"status", values:["ok","warn","err"], level:"block"}

- {id:"DQ_005", kind:"distribution", col:"latency_ms", rule:"p99<=200", level:"warn"}

- {id:"DQ_006", kind:"freshness", col:"updated_at", max_lag:"PT30M", level:"warn"}

- {id:"DQ_007", kind:"drift", col:"feature_*", metric:"psi<=0.2", level:"warn"}

- {id:"DQ_008", kind:"leakage", policy:["per-object","per-timewindow"], level:"block"}

on_fail: "quarantine|skip|block" # 失败处置

retries: {max: 2, backoff: "expo"}

timeout_s: 1800


IV. 规则类型与判定口径


V. 抽样、显著性与分级


VI. 异常处置与审计导出


VII. 计量与单位(SI)

  1. 性能与时间度量:QPS(1/s)、T_inf(ms,报告 {p50,p95,p99})、ρ(无量纲);带宽 net_mbps、体量 size_bytes。
  2. metrology:{units:"SI", check_dim:true} 为强制;对 range/unit/distribution 规则中涉及的单位按 SI 校核。
  3. 若规则涉及路径量(如 T_arr),需在规则或阶段配置中登记:delta_form、path="gamma(ell)"、measure="d ell",并采用两种等价表达之一进行一致性校验:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell )。

VIII. 机器可读片段(可直接嵌入)

layers:

- name: "validate"

stages:

- name: "dq.scan"

type: "validate.dq"

impl: "I16-7.dq_scan"

inputs: ["clean_rows"]

outputs: ["dq_report"]

schema_ref: "contracts/clean_rows@v1.3"

dq:

sample: {rows: 100000, strategy: "stratified"}

significance: {alpha: 0.05}

gates:

- {id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}

- {id:"DQ_003", kind:"range", col:"power_w", rule:"[0,2e3]", unit:"W", level:"block"}

- {id:"DQ_005", kind:"distribution", col:"latency_ms", rule:"p99<=150", level:"warn"}

- {id:"DQ_007", kind:"drift", col:"feature_*", metric:"psi<=0.2", level:"warn"}

- {id:"DQ_008", kind:"leakage", policy:["per-object","per-timewindow"], level:"block"}

on_fail: "quarantine"

retries: {max: 2, backoff: "expo"}

timeout_s: 1800


IX. Lint 规则(节选,规范性)

lint_rules:

- id: DQ.SCHEMA_REF_REQUIRED

when: "$.layers[*].stages[?(@.type=='validate.dq')]"

assert: "has_key('schema_ref')"

level: error

- id: DQ.SAMPLE_DEFINED

when: "$.layers[*].stages[?(@.type=='validate.dq')].dq.sample"

assert: "value.rows > 0 and value.strategy in ['head','random','stratified']"

level: error

- id: DQ.LEVEL_ALLOWED

when: "$.layers[*].stages[*].dq.gates[*].level"

assert: "value in ['block','warn']"

level: error

- id: DQ.RANGE_UNIT_SI

when: "$.layers[*].stages[*].dq.gates[?(@.kind=='range')]"

assert: "is_SI_unit($.unit)"

level: error

- id: DQ.DRIFT_THRESHOLDS

when: "$.layers[*].stages[*].dq.gates[?(@.kind=='drift')]"

assert: "psi_threshold_ok($.metric)"

level: warn

- id: DQ.LEAKAGE_POLICY

when: "$.layers[*].stages[*].dq.gates[?(@.kind=='leakage')]"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error


X. 导出清单与报告

export_manifest:

version: "v1.0"

artifacts:

- {path:"dq/report.jsonl", sha256:"..."}

- {path:"dq/summary.csv", sha256:"..."}

- {path:"dq/leakage_report.csv",sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.12"


XI. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/