目录文档-技术白皮书(V5.05)44-EFT.WP.Data.ModelCards v1.0

第8章 训练数据与采样绑定


I. 章节目的与范围

:引用口径、冻结切分映射、采样策略对齐、污染/泄漏防控、代表性与偏差记录;确保与评测协议、质量门及计量章一致。绑定方式固化模型卡中 training_data 与数据集卡的

II. 术语与依赖


III. 字段与结构(规范性)

training_data:

refs: # 数据来源(仅引用,不复制)

- "EFT.WP.Data.DatasetCards v1.0:Ch.6" # provenance & sampling

- "EFT.WP.Data.DatasetCards v1.0:Ch.11" # splits & distribution

- "EFT.WP.Data.DatasetCards v1.0:Ch.12" # quality & baselines

splits_ref: "<dataset_id@vX.Y>" # 冻结切分引用(精确到版本)

mapping: # 任务标签/本体映射(如需)

label_map: {"ext.catalog.v2:frb": "FRB", "ext.catalog.v2:rfi": "RFI"}

sampling_binding: # 与模型训练使用的采样绑定

strategy: "<random|stratified|time-based|spatial-tiles|systematic>"

strata: [{by:"class|region|snr_bin", buckets: {"A":100,"B":200}}]

weights: {class:"inverse_freq" } # 训练采样权重策略(如适用)

contamination_policy: "forbid-cross-split|allow-train-only"

leakage_guards: ["per-object","per-timewindow","per-scene"]

representativeness: # 代表性与偏差记录

target_distribution: "uniform|empirical|custom"

bias_notes: "class-long-tail; region-imbalance"

licenses: ["<license-id>"] # 与数据卡对表

notes: "引用的数据事实以数据卡为准;此处仅记录绑定与策略差异。"

see:

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.6"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"


IV. 数据引用与冻结切分


V. 污染与泄漏防控


VI. 采样一致性与代表性


VII. 计量与单位(如涉物理量/时间/频率)


VIII. 机器可读片段(可直接嵌入)

training_data:

refs:

- "EFT.WP.Data.DatasetCards v1.0:Ch.6"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Data.DatasetCards v1.0:Ch.12"

splits_ref: "eift.radio.toa-set@v1.2"

sampling_binding:

strategy: "stratified"

strata: [{by:"snr_bin", buckets:{"7-10":300,"10-20":500,"20+":700}}]

weights: {class:"inverse_freq"}

contamination_policy: "forbid-cross-split"

leakage_guards: ["per-object","per-timewindow"]

representativeness:

target_distribution: "empirical"

bias_notes: "long-tail on FRB; station-imbalance"


IX. 与评测协议、优化与超参的一致性


X. 导出工件与审计轨

export_manifest:

artifacts:

- {path:"data/splits/train.index", sha256:"..."}

- {path:"data/sampling_binding.yaml", sha256:"..."}

- {path:"audits/leakage_report.md", sha256:"..."}

references:

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Core.Metrology v1.0:check_dim"

所有与绑定/审计相关的工件必须在导出清单中列出并可校验。

XI. 与路径依赖量的衔接(如适用)

当训练目标或特征涉及 T_arr 等路径量:

XII. 本章合规自检


版权与许可:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(屠广林)享有。
许可方式(CC BY 4.0):在注明作者与来源的前提下,允许复制、转载、节选、改编与再分发。
署名格式(建议):作者:屠广林|作品:《能量丝理论》|来源:energyfilament.org|许可证:CC BY 4.0
验证召集: 作者独立自费、无雇主无资助;下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地,不限国家。欢迎各国媒体与同行抓住窗口组织验证,并与我们联系。
版本信息: 首次发布:2025-11-11 | 当前版本:v6.0+5.05