目录文档-技术白皮书44-EFT.WP.Data.ModelCards v1.0

第8章 训练数据与采样绑定


I. 章节目的与范围

:引用口径、冻结切分映射、采样策略对齐、污染/泄漏防控、代表性与偏差记录;确保与评测协议、质量门及计量章一致。绑定方式固化模型卡中 training_data 与数据集卡的

II. 术语与依赖


III. 字段与结构(规范性)

training_data:

refs: # 数据来源(仅引用,不复制)

- "EFT.WP.Data.DatasetCards v1.0:Ch.6" # provenance & sampling

- "EFT.WP.Data.DatasetCards v1.0:Ch.11" # splits & distribution

- "EFT.WP.Data.DatasetCards v1.0:Ch.12" # quality & baselines

splits_ref: "<dataset_id@vX.Y>" # 冻结切分引用(精确到版本)

mapping: # 任务标签/本体映射(如需)

label_map: {"ext.catalog.v2:frb": "FRB", "ext.catalog.v2:rfi": "RFI"}

sampling_binding: # 与模型训练使用的采样绑定

strategy: "<random|stratified|time-based|spatial-tiles|systematic>"

strata: [{by:"class|region|snr_bin", buckets: {"A":100,"B":200}}]

weights: {class:"inverse_freq" } # 训练采样权重策略(如适用)

contamination_policy: "forbid-cross-split|allow-train-only"

leakage_guards: ["per-object","per-timewindow","per-scene"]

representativeness: # 代表性与偏差记录

target_distribution: "uniform|empirical|custom"

bias_notes: "class-long-tail; region-imbalance"

licenses: ["<license-id>"] # 与数据卡对表

notes: "引用的数据事实以数据卡为准;此处仅记录绑定与策略差异。"

see:

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.6"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"


IV. 数据引用与冻结切分


V. 污染与泄漏防控


VI. 采样一致性与代表性


VII. 计量与单位(如涉物理量/时间/频率)


VIII. 机器可读片段(可直接嵌入)

training_data:

refs:

- "EFT.WP.Data.DatasetCards v1.0:Ch.6"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Data.DatasetCards v1.0:Ch.12"

splits_ref: "eift.radio.toa-set@v1.2"

sampling_binding:

strategy: "stratified"

strata: [{by:"snr_bin", buckets:{"7-10":300,"10-20":500,"20+":700}}]

weights: {class:"inverse_freq"}

contamination_policy: "forbid-cross-split"

leakage_guards: ["per-object","per-timewindow"]

representativeness:

target_distribution: "empirical"

bias_notes: "long-tail on FRB; station-imbalance"


IX. 与评测协议、优化与超参的一致性


X. 导出工件与审计轨

export_manifest:

artifacts:

- {path:"data/splits/train.index", sha256:"..."}

- {path:"data/sampling_binding.yaml", sha256:"..."}

- {path:"audits/leakage_report.md", sha256:"..."}

references:

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Core.Metrology v1.0:check_dim"

所有与绑定/审计相关的工件必须在导出清单中列出并可校验。

XI. 与路径依赖量的衔接(如适用)

当训练目标或特征涉及 T_arr 等路径量:

XII. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/