目录文档-技术白皮书43-EFT.WP.Data.DatasetCards v1.0

第15章 机器可读 Schema 与 Lint


I. 章节目的与范围

,覆盖结构/类型/正则/依赖/引用锚点/计量校核/泄漏防控/合规模块;产物可直接用于发布前阻断检查与门户自动校验。键名一律 snake_case;跨卷引用采用“卷名+版本+锚点”。 Lint 规则集规范性 JSON Schema提供数据集卡的

II. 术语与依赖


III. 规范性工件(发布必备)

artifacts:

- path: "schema/dataset_card.schema.json" # 规范性 JSON Schema

- path: "schema/lint_rules.yaml" # 规范性 Lint 规则(可扩展)

- path: "schema/examples/minimal.yaml" # 最小可用示例

- path: "schema/examples/full.yaml" # 全字段示例

(全部工件须在 export_manifest.artifacts[] 中登记并附 sha256;引用锚点写法与前述卷一致。)


IV. 规范性 JSON Schema(核心摘录)

JSON json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://eift.org/schema/dataset_card.schema.json",
  "title": "EFT Dataset Card",
  "type": "object",
  "required": [
    "dataset_id",
    "title",
    "version",
    "summary",
    "modality",
    "sources",
    "license",
    "access",
    "provenance",
    "splits",
    "checksums",
    "metrology",
    "quality",
    "export_manifest"
  ],
  "properties": {
    "dataset_id": { "type": "string", "pattern": "^[a-z0-9_\\-\\.]+$" },
    "title": { "type": "string", "minLength": 3 },
    "version": { "type": "string", "pattern": "^v\\d+\\.\\d+(\\.\\d+)?$" },
    "summary": { "type": "string", "minLength": 100, "maxLength": 600 },
    "modality": {
      "type": "array",
      "minItems": 1,
      "items": { "type": "string", "enum": [ "radio", "optical", "image", "time_series", "text", "tabular" ] }
    },
    "sources": { "type": "array", "minItems": 1, "items": { "type": "string" } },
    "license": { "type": "string" },
    "access": { "type": "string", "enum": [ "open", "restricted", "closed" ] },
    "provenance": {
      "type": "object",
      "required": [ "collection_method", "time_coverage" ],
      "properties": {
        "collection_method": { "type": "string" },
        "instruments": { "type": "array", "items": { "type": "object" } },
        "time_coverage": { "type": "string" },
        "spatial_coverage": { "type": "string" },
        "selection_bias": { "type": "string" }
      }
    },
    "splits": {
      "type": "object",
      "required": [ "train", "validation", "test" ],
      "properties": {
        "train": { "type": "object", "required": [ "count", "ratio" ] },
        "validation": { "type": "object", "required": [ "count", "ratio" ] },
        "test": { "type": "object", "required": [ "count", "ratio" ] },
        "policy": { "type": "object" },
        "audit": { "type": "object" }
      }
    },
    "checksums": { "type": "object" },
    "metrology": {
      "type": "object",
      "required": [ "units", "c_ref", "check_dim" ],
      "properties": {
        "units": { "type": "string", "const": "SI" },
        "c_ref": { "type": "number" },
        "check_dim": { "type": "boolean", "const": true },
        "time_standard": { "type": "string" },
        "angle_unit": { "type": "string", "enum": [ "deg", "rad" ] }
      }
    },
    "quality": { "type": "object" },
    "export_manifest": {
      "type": "object",
      "required": [ "version", "artifacts", "references" ],
      "properties": {
        "version": { "type": "string" },
        "artifacts": { "type": "array", "items": { "type": "object" } },
        "references": {
          "type": "array",
          "minItems": 1,
          "items": { "type": "string", "pattern": "^[^:]+ v\\d+\\.\\d+:[A-Z].+$" }
        }
      }
    },
    "labels": { "type": "object" },
    "uncertainty": { "type": "object" },
    "privacy": { "type": "object" },
    "ethics": { "type": "object" }
  },
  "additionalProperties": false
}

(Schema 中对 references[] 采用“卷名 vX.Y:锚点”的正则,确保跨卷引用可机读。计量与单位口径遵循《Core.Metrology v1.0》。)


V. Lint 规则(规范性)

version: "v1.0"

rules:

# 结构与类型

- id: STRUCT.REQUIRED

when: "$"

assert: "has_keys(dataset_id,title,version,summary,modality,sources,license,access,provenance,splits,checksums,metrology,quality,export_manifest)"

level: error

- id: VERSION.SEMVER

when: "$.version"

assert: "matches('^v\\d+\\.\\d+(\\.\\d+)?$')"

level: error

# 比例与泄漏

- id: SPLIT.RATIO_SUM

when: "$.splits"

assert: "abs((train.ratio + validation.ratio + test.ratio) - 1) <= 1e-6"

level: error

- id: SPLIT.LEAKAGE_FORBID

when: "$.splits.policy.leakage_guard"

assert: "contains_any(['per-object','per-timewindow'])"

level: error

# 计量与单位

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

see: ["EFT.WP.Core.Metrology v1.0:check_dim"]

# 引用与锚点

- id: REFERENCES.FORMAT

when: "$.export_manifest.references[*]"

assert: "matches('^[^:]+ v\\d+\\.\\d+:[A-Z].+$')"

level: error

see: ["EFT 引用与交叉引用规范 v0.1:P/S/M/I-*"]

# 到达时路径依赖(如出现)

- id: PATH.TARR_FIELDS

when: "$.path_dependence"

assert: "has_keys(applies_to,delta_form,path,measure)"

level: error

see: ["EFT.WP.Core.Equations v1.1:S20-1"]

# 文本与记号护栏

- id: MATH.NO_CHINESE

when: "$"

assert: "no_chinese_in_math()"

level: warn

- id: SYMBOLS.CONFLICT

when: "$"

assert: "not_mixed(['T_fil','T_trans']) and not_mixed(['n','n_eff'])"

level: error

(上述规则集用于阻断发布前的结构性错误、引用口径与计量违规,文本护栏警告可升级为阻断项。)


VI. 校验与执行接口(实现绑定 Ixx-?)

# I15-2(接口原型)

def validate_card(card: dict) -> dict: ...

def lint_card(card: dict, rules: dict) -> dict: ...

def check_units(card: dict) -> dict: ... # uses Core.Metrology v1.0:check_dim

def verify_references(card: dict) -> dict: ...# regex + anchor reachability

def export_manifest(card: dict) -> dict: ... # includes version & references[]

(接口返回统一 {"ok": bool, "errors":[...], "warnings":[...], "metrics":{...}} 结构,便于门户与 CI 集成。)


VII. 典型失败示例与诊断(节选)

fail_examples:

- case: "missing references version"

input: {export_manifest:{references:["EFT.WP.Core.DataSpec:EXPORT"]}}

expect: {rule:"REFERENCES.FORMAT", level:"error"}

- case: "ratio sum not 1"

input: {splits:{train:{ratio:0.7}, validation:{ratio:0.2}, test:{ratio:0.2}}}

expect: {rule:"SPLIT.RATIO_SUM", level:"error"}

- case: "metrology not SI"

input: {metrology:{units:"CGS", check_dim:false}}

expect: {rule:"METROLOGY.SI_AND_CHECKDIM", level:"error"}

(所有错误均需附定位路径与修复建议;引用类错误须展示不合规字符串与合规示例。)


VIII. 与导出清单的耦合(规范性)

export_manifest:

artifacts:

- {path:"schema/dataset_card.schema.json", sha256:"..."}

- {path:"schema/lint_rules.yaml", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

(Schema 与 Lint 作为发布阻断件,必须在导出清单中列出并可校验。)


IX. 最小可用示例(可直接落盘)

dataset_id: "eift.obs.demo"

title: "EIFT Demo Dataset"

version: "v1.0"

summary: "Demo card with minimal required fields for validation..."

modality: ["time_series"]

sources: ["doi:10.1234/demo"]

license: "CC-BY-4.0"

access: "open"

provenance:

collection_method: "simulation"

time_coverage: "2024-01-01..2024-12-31"

splits:

train: {count: 800, ratio: 0.8}

validation: {count: 100, ratio: 0.1}

test: {count: 100, ratio: 0.1}

checksums: {}

metrology: {units:"SI", c_ref:299792458, check_dim:true, angle_unit:"deg"}

quality:

gates:

- {name:"leakage", metric:"leakage_rate", threshold:0.0}

export_manifest:

version: "v1.0"

artifacts: []

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

(通过本章 Schema 与 Lint 的最小卡片。)


X. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/