目录 / 文档-技术白皮书 / 06-EFT.WP.Core.DataSpec v1.0
I. 目标与范围
- 规范 Schema Registry 的对象模型、字段清单与必填约束,使 register_schema 与 export_schema 在全卷内一致互通。
- 提供 DS.TARR.PathIntegral v1 的完整登记示例,覆盖单位/量纲、键与索引、契约、追溯与治理。
II. 对象模型概览
- SchemaRegistryRecord(简称 SRef)
- 语义:一次可发布的数据模式登记记录。
- 关系:SRef 引用若干 FieldSpec、ConstraintSpec、IndexSpec、GovernanceSpec、PrivacySpec。
- 统一引用:SRef.id 作为跨卷引用键;register_schema(...) -> SRef。
III. SRef 顶层字段(必填项与约束)
- name : str(required)
规范:^DS\.[A-Z0-9]+(\.[A-Za-z0-9]+)+$,示例 DS.TARR.PathIntegral。 - version : str(required)
规范:语义化版本 MAJOR.MINOR[.PATCH],示例 1.0。 - title : str(required)
规范:人类可读标题,示例 Arrival-time along path integrals。 - description : str(required)
规范:模式用途、来源与产出说明。 - fields : list[FieldSpec](required, len >= 1)
规范:字段名唯一;字段名 ^[a-z][a-z0-9_]*$。 - pk : list[str](required, len >= 1)
约束:pk ⊆ { f.name };满足唯一性。 - idx : list[IndexSpec](optional)
二级索引集合。 - constraints : list[ConstraintSpec](required, len >= 1)
至少包含主键唯一与关键物理约束(如单调性、量纲守恒)。 - units : dict[str, str](optional)
units[field] = unit(field),如 "c_ref_value":"m/s"。 - dims : dict[str, str](optional)
dims[field] = dim(field),如 "T_arr_const":"T"。 - equations : list[str](optional)
引用最小方程或公设编号,如 ["S610-1","S610-2"]。 - parameters : list[str](optional)
绑定参数引用,如 ["c_ref_ref","n_eff_model_ref"]。 - governance : GovernanceSpec(required)
数据责任、留存、SLA 与发布策略。 - privacy : PrivacySpec(required)
字段分级、脱敏/掩码策略与例外清单。 - provenance : ProvenanceSpec(required)
Trace = [source -> method -> artifact]、指纹与签名配置。 - quality_gates : QualityGateSpec(required)
发布阈值,如 q_score_min、delta_form_max。 - manifests : list[ManifestHook](optional)
导出清单的扩展钩子与模板名。 - see : list[str](optional)
跨卷参考,如 ["Core.Equations §S610","Core.Parameters §P3x"]。
IV. FieldSpec(字段字典)
- name : str(required)
- type : str(required)
允许集合:{"int32","int64","float32","float64","decimal(p,s)","bool","string","bytes","timestamp(UTC)","date","struct","list<T>","map<K,V>","categorical","geometry"}。 - unit : str|None(optional)
SI 或派生单位文本,如 "m", "s", "m/s"。 - dim : str|None(optional)
量纲标记,如 "L", "T", "L T^-1", "1"。 - nullable : bool(required)
- default : any|None(optional)
- pii_level : str(required)
允许集合:{"none","low","moderate","high"}。 - desc : str(required)
- aliases : list[str]|None(optional)
- enum : list[any]|None(optional)
- tags : list[str]|None(optional)
- quality_weight : float|None(optional, in [0,1])
- 约束规则:
- 若 unit 存在,则 dim 必给出且与 check_dim(expr) 一致。
- timestamp(UTC) 字段须声明时区为 UTC。
V. IndexSpec(二级索引)
- keys : list[str](required, len >= 1)
- kind : str(required)
允许集合:{"btree","hash","geo","inverted","composite"}。 - unique : bool(required)
- desc : str(optional)
VI. ConstraintSpec(契约模板)
- kind : str(required)
允许集合:{"unique","not_null","range","regex","enum_set","cross_field","referential","monotonic","dim_check","arrivaltime_dualform","custom"}。 - expr : str(required)
例:"ell_end >= ell_start"、"delta_form <= tol_Tarr"、"check_dim(T_arr_const)=='T'"。 - params : dict(optional)
例:{"tol_Tarr":"1e-9 s","fields":["T_arr_const","T_arr_integrand"]}。 - severity : str(required)
允许集合:{"ERROR","WARN","INFO"}。 - message : str(required)
VII. GovernanceSpec(治理与发布)
- owner : str(required)
- steward : str(required)
- retention_days : int(required)
- sla : dict(required)
例:{"freshness_max":"P1D","availability_target":"99.9%"}。 - release : dict(required)
例:{"freeze_policy":"immutable","signing_key":"key://k1"}。
VIII. PrivacySpec(隐私分级与策略)
- classification : dict[str,str](required)
classification[field] = pii_level。 - anonymization : dict(optional)
例:{"gamma_path":"geohash_r6","ts":"bucket_P1M"}。 - masking : dict[str,str](optional)
例:{"uid":"hash","sid":"salted_hash"}。 - exceptions : list[str](optional)
因法务或研究需求的豁免字段(须提供合法性依据)。
IX. ProvenanceSpec(追溯与指纹)
- trace : list[str](required)
例:["sensor.S1","method.integrate_path","artifact.T_arr_v1.parquet"]。 - checksum : dict(required)
例:{"algo":"sha256","field":"hash_sha256"}。 - signature : dict(required)
例:{"keyref":"key://k1","field":"signature"}。
X. QualityGateSpec(质量闸门)
- q_score_min : float(required, in [0,1])
- delta_form_max : str(required,时间单位文本,例如 "1e-9 s")
- completeness_min : float(required)
- drift_method : str(required,例 "KL")
- drift_max : float(required)
XI. 单位与量纲映射规则
- 若 equations 涉及 T_arr,两口径均需声明并校验:
- T_arr_const = ( 1 / c_ref_value ) * ( ∫_gamma n_eff d ell )。
- T_arr_integrand = ( ∫_gamma ( n_eff / c_ref_value ) d ell )。
- dim(n_eff) = 1,dim(c_ref) = L/T,dim( ( ∫_gamma · d ell ) ) = L,因此 dim(T_arr_*) = T。
- delta_form = | T_arr_const - T_arr_integrand |,单位为 "s"。
XII. 登记示例(YAML,精简可用)
name: DS.TARR.PathIntegral
version: "1.0"
title: Arrival-time along path integrals
description: Arrival time T_arr computed along gamma(ell) with dual-form check.
fields:
- { name: pid, type: string, unit: null, dim: null, nullable: false, pii_level: "none", desc: "path id" }
- { name: seg_id, type: int32, unit: null, dim: null, nullable: false, pii_level: "none", desc: "segment id" }
- { name: ts, type: timestamp(UTC), unit: "s", dim: "T", nullable: false, pii_level: "none", desc: "UTC time" }
- { name: CRS, type: string, unit: null, dim: null, nullable: false, pii_level: "none", desc: "coord ref sys" }
- { name: ell_start, type: float64, unit: "m", dim: "L", nullable: false, pii_level: "none", desc: "path coord start" }
- { name: ell_end, type: float64, unit: "m", dim: "L", nullable: false, pii_level: "none", desc: "path coord end" }
- { name: n_eff_mean, type: float64, unit: "1", dim: "1", nullable: false, pii_level: "none", desc: "mean effective index" }
- { name: c_ref_ref, type: string, unit: null, dim: null, nullable: false, pii_level: "none", desc: "parameter ref" }
- { name: c_ref_value,type: float64, unit: "m/s",dim: "L T^-1", nullable: false, pii_level: "none", desc: "resolved c_ref" }
- { name: T_arr_const,type: float64, unit: "s", dim: "T", nullable: false, pii_level: "none", desc: "const-pulled form" }
- { name: T_arr_integrand,type: float64, unit: "s", dim: "T", nullable: false, pii_level: "none", desc: "general integrand form" }
- { name: delta_form, type: float64, unit: "s", dim: "T", nullable: false, pii_level: "none", desc: "dual-form gap" }
- { name: q_score, type: float64, unit: "1", dim: "1", nullable: false, pii_level: "none", desc: "quality score" }
- { name: hash_sha256,type: string, unit: null, dim: null, nullable: false, pii_level: "none", desc: "checksum" }
- { name: signature, type: string, unit: null, dim: null, nullable: true, pii_level: "none", desc: "signature" }
pk: ["pid","seg_id"]
idx:
- { keys: ["ts"], kind: "btree", unique: false, desc: "time scan" }
- { keys: ["pid","seg_id"], kind: "btree", unique: true, desc: "segment lookup" }
constraints:
- { kind: "unique", expr: "unique(pid,seg_id)", severity: "ERROR", message: "pk must be unique" }
- { kind: "monotonic", expr: "ell_end >= ell_start", severity: "ERROR", message: "ell non-decreasing" }
- { kind: "dim_check", expr: "check_dim(T_arr_const)=='T'", severity: "ERROR", message: "dim(T_arr_const)=T" }
- { kind: "dim_check", expr: "check_dim(T_arr_integrand)=='T'", severity: "ERROR", message: "dim(T_arr_integrand)=T" }
- { kind: "arrivaltime_dualform", expr: "delta_form <= tol_Tarr", params: { tol_Tarr: "1e-9 s" }, severity: "WARN", message: "dual form mismatch" }
equations: ["S610-1","S610-2"]
parameters: ["c_ref_ref","n_eff_model_ref"]
governance:
owner: "team.eft-data"
steward: "user:alice"
retention_days: 3650
sla: { freshness_max: "P1D", availability_target: "99.9%" }
release: { freeze_policy: "immutable", signing_key: "key://k1" }
privacy:
classification: { pid: "none", seg_id: "none", ts: "none", CRS: "none" }
anonymization: { }
masking: { }
exceptions: [ ]
provenance:
trace: ["sensor.S1","method.integrate_path","artifact.T_arr_v1.parquet"]
checksum: { algo: "sha256", field: "hash_sha256" }
signature: { keyref: "key://k1", field: "signature" }
quality_gates:
q_score_min: 0.80
delta_form_max: "1e-9 s"
completeness_min: 0.98
drift_method: "KL"
drift_max: 0.02
see: ["Core.Equations §S610","Core.Parameters §P3x","Core.Metrology §Mx-?","Core.Errors §I50"]
XIII. 注册与导出(I60 对接)
- register_schema(name:str, version:str, fields:list[dict], constraints:list[str], units:dict, pk:list[str], idx:list[list[str]], see:list[str]) -> SRef
要点:fields 需包含 pii_level、unit、dim 与 nullable;constraints 含 unique(pk) 与关键物理约束。 - export_schema(SRef, format:str="yaml") -> str
输出等价于上节 YAML;保证无损往返。 - register_field(...) -> FRef
建议优先从字段字典复用,保证跨模式一致性。
XIV. 校验要点(发布前 Checklist)
- 名称与版本:name 合规、version 语义化且未占用。
- 键与索引:pk ⊆ fields、unique(pk) 可被 validate_dataset 证明。
- 单位与量纲:units/dims 与 equations 一致,check_dim(expr) 全通过。
- 契约闭环:constraints 覆盖唯一、非空、范围、正则、交叉字段、单调、量纲与两口径一致。
- 隐私与治理:pii_level 已分级;retention_days 与 release.freeze_policy 就绪。
- 追溯:checksum/signature/trace 字段齐备并可在样本集上验证。
- 互操作:parameters、equations 可由 bind_to_parameters、bind_to_equations 正确解析。
XV. 常见错误与对策(与《Core.Errors》联动)
- E.SCHEMA.NAME.INVALID:name 未满足正则 → 纠正命名并重试。
- E.SCHEMA.VERSION.CONFLICT:版本重复 → bump_version 后登记。
- E.SCHEMA.FIELD.DIM.MISMATCH:unit/dim 与 check_dim 不一致 → 修正映射或方程引用。
- E.SCHEMA.CONSTRAINT.UNCOVERED:缺失关键契约(如两口径一致) → 增补 ConstraintSpec。
- E.SCHEMA.PRIVACY.UNCLASSIFIED:存在未分级字段 → 补齐 pii_level 并评审。
XVI. 与到达时两口径的专用约束
- 定义:
- T_arr_const = ( 1 / c_ref_value ) * ( ∫_gamma n_eff d ell )。
- T_arr_integrand = ( ∫_gamma ( n_eff / c_ref_value ) d ell )。
- delta_form = | T_arr_const - T_arr_integrand |。
- 契约:
- kind="arrivaltime_dualform",expr="delta_form <= tol_Tarr",params={"tol_Tarr":"<time>"}。
- 发布闸门:delta_form_max 写入 quality_gates 并在 assert_contract 中强制。
XVII. 兼容性与变更登记片段(供发布记录引用)
- change_log : list[ChangeSpec](optional)
ChangeSpec = { since:"1.0", type:"add|modify|deprecate|remove", path:"fields.T_arr_const", note:"..." }。 - 破坏性修改(如 pk 变化或字段移除)必须 major+1,并在 see 中指向迁移指南。
XVIII. 最小可行模板(YAML,占位符)
name: DS.<DOMAIN>.<Subject>
version: "X.Y"
title: <human-readable title>
description: <what this dataset is for>
fields: [ { name: <f>, type: <t>, unit: <u|null>, dim: <d|null>, nullable: <bool>, pii_level: <level>, desc: <text> }, ... ]
pk: [ <field1>, <field2> ]
idx: [ { keys: [<f1>,<f2>], kind: "btree", unique: false } ]
constraints: [ { kind: "unique", expr: "unique(<k1>,<k2>)", severity: "ERROR", message: "pk unique" } ]
equations: [ ]
parameters: [ ]
units: { }
dims: { }
governance: { owner: "<team>", steward: "<user>", retention_days: <int>, sla: { freshness_max: "P?D", availability_target: "99.9%" }, release: { freeze_policy: "immutable" } }
privacy: { classification: { }, anonymization: { }, masking: { }, exceptions: [ ] }
provenance: { trace: [ ], checksum: { algo: "sha256", field: "hash_sha256" }, signature: { keyref: "key://...", field: "signature" } }
quality_gates: { q_score_min: 0.8, delta_form_max: "1e-9 s", completeness_min: 0.98, drift_method: "KL", drift_max: 0.02 }
see: [ ]
XIX. 小结
- 本模式注册表模式以 SRef 为核心,强制主键、单位/量纲、契约、隐私与治理“五件套”。
- 对 T_arr 等跨卷量,内置两口径一致性约束与发布闸门,确保度量、方程与数据实现的端到端一致与可追溯。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/