目录 / 文档-技术白皮书 / 43-EFT.WP.Data.DatasetCards v1.0
I. 章节目的与范围
,覆盖结构/类型/正则/依赖/引用锚点/计量校核/泄漏防控/合规模块;产物可直接用于发布前阻断检查与门户自动校验。键名一律 snake_case;跨卷引用采用“卷名+版本+锚点”。 Lint 规则集与规范性 JSON Schema提供数据集卡的II. 术语与依赖
- 术语来源:遵循《EFT.WP.Core.Terms v1.0》,本章仅增量定义 Schema 与 Lint 相关字段。
- 依赖卷:数据契约/导出:《Core.DataSpec v1.0》;计量/量纲与不确定度:《Core.Metrology v1.0》;到达时路径量:《Core.Equations v1.1》;引用与版本携带:《引用与交叉引用规范 v0.1》。
III. 规范性工件(发布必备)
artifacts:
- path: "schema/dataset_card.schema.json" # 规范性 JSON Schema
- path: "schema/lint_rules.yaml" # 规范性 Lint 规则(可扩展)
- path: "schema/examples/minimal.yaml" # 最小可用示例
- path: "schema/examples/full.yaml" # 全字段示例
(全部工件须在 export_manifest.artifacts[] 中登记并附 sha256;引用锚点写法与前述卷一致。)
IV. 规范性 JSON Schema(核心摘录)
(Schema 中对 references[] 采用“卷名 vX.Y:锚点”的正则,确保跨卷引用可机读。计量与单位口径遵循《Core.Metrology v1.0》。)
V. Lint 规则(规范性)
version: "v1.0"
rules:
# 结构与类型
- id: STRUCT.REQUIRED
when: "$"
assert: "has_keys(dataset_id,title,version,summary,modality,sources,license,access,provenance,splits,checksums,metrology,quality,export_manifest)"
level: error
- id: VERSION.SEMVER
when: "$.version"
assert: "matches('^v\\d+\\.\\d+(\\.\\d+)?$')"
level: error
# 比例与泄漏
- id: SPLIT.RATIO_SUM
when: "$.splits"
assert: "abs((train.ratio + validation.ratio + test.ratio) - 1) <= 1e-6"
level: error
- id: SPLIT.LEAKAGE_FORBID
when: "$.splits.policy.leakage_guard"
assert: "contains_any(['per-object','per-timewindow'])"
level: error
# 计量与单位
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
see: ["EFT.WP.Core.Metrology v1.0:check_dim"]
# 引用与锚点
- id: REFERENCES.FORMAT
when: "$.export_manifest.references[*]"
assert: "matches('^[^:]+ v\\d+\\.\\d+:[A-Z].+$')"
level: error
see: ["EFT 引用与交叉引用规范 v0.1:P/S/M/I-*"]
# 到达时路径依赖(如出现)
- id: PATH.TARR_FIELDS
when: "$.path_dependence"
assert: "has_keys(applies_to,delta_form,path,measure)"
level: error
see: ["EFT.WP.Core.Equations v1.1:S20-1"]
# 文本与记号护栏
- id: MATH.NO_CHINESE
when: "$"
assert: "no_chinese_in_math()"
level: warn
- id: SYMBOLS.CONFLICT
when: "$"
assert: "not_mixed(['T_fil','T_trans']) and not_mixed(['n','n_eff'])"
level: error
(上述规则集用于阻断发布前的结构性错误、引用口径与计量违规,文本护栏警告可升级为阻断项。)
VI. 校验与执行接口(实现绑定 Ixx-?)
# I15-2(接口原型)
def validate_card(card: dict) -> dict: ...
def lint_card(card: dict, rules: dict) -> dict: ...
def check_units(card: dict) -> dict: ... # uses Core.Metrology v1.0:check_dim
def verify_references(card: dict) -> dict: ...# regex + anchor reachability
def export_manifest(card: dict) -> dict: ... # includes version & references[]
(接口返回统一 {"ok": bool, "errors":[...], "warnings":[...], "metrics":{...}} 结构,便于门户与 CI 集成。)
VII. 典型失败示例与诊断(节选)
fail_examples:
- case: "missing references version"
input: {export_manifest:{references:["EFT.WP.Core.DataSpec:EXPORT"]}}
expect: {rule:"REFERENCES.FORMAT", level:"error"}
- case: "ratio sum not 1"
input: {splits:{train:{ratio:0.7}, validation:{ratio:0.2}, test:{ratio:0.2}}}
expect: {rule:"SPLIT.RATIO_SUM", level:"error"}
- case: "metrology not SI"
input: {metrology:{units:"CGS", check_dim:false}}
expect: {rule:"METROLOGY.SI_AND_CHECKDIM", level:"error"}
(所有错误均需附定位路径与修复建议;引用类错误须展示不合规字符串与合规示例。)
VIII. 与导出清单的耦合(规范性)
export_manifest:
artifacts:
- {path:"schema/dataset_card.schema.json", sha256:"..."}
- {path:"schema/lint_rules.yaml", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
(Schema 与 Lint 作为发布阻断件,必须在导出清单中列出并可校验。)
IX. 最小可用示例(可直接落盘)
dataset_id: "eift.obs.demo"
title: "EIFT Demo Dataset"
version: "v1.0"
summary: "Demo card with minimal required fields for validation..."
modality: ["time_series"]
sources: ["doi:10.1234/demo"]
license: "CC-BY-4.0"
access: "open"
provenance:
collection_method: "simulation"
time_coverage: "2024-01-01..2024-12-31"
splits:
train: {count: 800, ratio: 0.8}
validation: {count: 100, ratio: 0.1}
test: {count: 100, ratio: 0.1}
checksums: {}
metrology: {units:"SI", c_ref:299792458, check_dim:true, angle_unit:"deg"}
quality:
gates:
- {name:"leakage", metric:"leakage_rate", threshold:0.0}
export_manifest:
version: "v1.0"
artifacts: []
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
(通过本章 Schema 与 Lint 的最小卡片。)
X. 本章合规自检
- dataset_card.schema.json 与 lint_rules.yaml 已生成、登记并附 sha256;门户/CI 校验可复现。
- Schema 强制 export_manifest.references[] 采用“卷名 vX.Y:锚点”;Lint 阻断 references 的短码/无版本/无锚点。
- 计量项 units="SI" 且 check_dim=true;涉及 T_arr 的卡片含 delta_form/path/measure 并通过等价表达一致性校核。
- 分割比例与泄漏规则由 Lint 强约束;T_fil/T_trans、n/n_eff 不混用由冲突检测守护。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/