46-EFT.WP.Data.Benchmarks v1.0 | 第5章数据来源、采样与冻结切分 | 能量丝理论

第5章数据来源、采样与冻结切分

I. 章节目的与范围

冻结切分（frozen splits）**的规范：来源合规与引用、采样策略与分层、冻结索引与一致性、泄漏防控与审计导出；确保与数据卡/模型卡/流水线、计量与引用锚点一致。与、**采样（sampling）数据来源（sources）固化

II. 术语与依赖

术语：dataset_ref、sampling.strategy/strata/weights、splits.train/val/test、freeze_indices、leakage_guard、items_ref、index。
依赖：数据契约与导出（《Core.DataSpec v1.0》）；单位与量纲校核（《Core.Metrology v1.0》）；切分/覆盖与质量（《DatasetCards v1.0》）；评测协议（《ModelCards v1.0》）；分发与镜像（《Pipeline v1.0》）。
数学与符号：内联符号一律用反引号（如 QPS、T_inf、ρ、ψ）；含除号/积分/复合算符必须加括号；若涉路径量 T_arr，声明 gamma(ell) 与 d ell；公式/符号/定义禁用中文。

III. 字段与结构（规范性）

data:

dataset_ref: "datasets/<name>@vX.Y" # 引用，不复制

sources: ["<uri-or-citation>", "..."] # 数据来源与引文

licensing: "CC-BY-4.0|ODC-BY|custom"

provenance:

collection_window: "<YYYY-MM-DD..YYYY-MM-DD>"

geography: ["<region>"]

permits: ["<ethics/permit-ref>"]

sampling:

strategy: "random|stratified|time-based|spatial-tiles|systematic"

strata: [{by:"<label|locale|domain|difficulty|snr_bin>", buckets: {"A":100,"B":200}}]

weights: {class:"inverse_freq|none"} # 训练重加权说明

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard:

policy: ["per-object","per-timewindow","per-scene"]

audits:

report: "splits/leakage_report.csv"

sha256: "<hex>"

IV. 来源合规与引用口径

只引用不复制：所有数据事实以 dataset_ref 与引文 sources[] 指代；导出清单中登记许可与伦理批件锚点。
窗口与地域：明确 collection_window/geography，用于风险与偏移评估；与覆盖矩阵对表。
许可一致性：与分发/镜像策略一致；若有限制字段，须在任务约束中同步。

V. 采样策略与分层

策略：random（全局均匀）、stratified（类/地域/语种/难度/信噪分层）、time-based（滑窗/滚动）、spatial-tiles（空间瓦片）、systematic（步长系统抽样）。
分层：strata[] 必显式列出分桶与配额；报告抽样实现后的计数与偏差（%），并给出 seed。
重采样/重加权：训练侧使用 weights 时需记录策略，并在模型卡评测节解释其影响与显著性。

VI. 冻结切分与一致性

比例一致：ratio.train + ratio.val + ratio.test = 1±1e-6。
冻结索引：freeze_indices=true；以索引文件（行/文件/对象 ID）固定 train/val/test；所有覆盖统计与评测均基于冻结索引。
跨卷一致：模型卡评测必须使用本章冻结索引；流水线分发与镜像使用相同索引，数据卡的冻结切分与之对齐。

VII. 泄漏防控与审计导出

粒度：per-object|per-timewindow|per-scene；相同对象/相邻时间窗/同一场景不可跨分割集出现。
阻断：任一泄漏命中为阻断；
审计：生成 splits/leakage_report.csv 与摘要指标（重叠计数/比例、按粒度分解），并登记 sha256。

VIII. 计量与单位（SI）

性能与体量：QPS(1/s)、T_inf(ms)、ρ(—)、net_mbps、size_bytes；
强制：metrology:{units:"SI", check_dim:true}；复合量合成前先做单位归一。
路径量（如 T_arr）：若切分/采样与路径依赖量耦合，登记：delta_form、path="gamma(ell)"、measure="d ell"；采用
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )，并通过 check_dim。

IX. 机器可读片段（可直接嵌入）

data:

dataset_ref: "datasets/core_cls@v1.0"

sources: ["doi:10.1234/core-ds", "arXiv:2501.01234"]

licensing: "CC-BY-4.0"

provenance: {collection_window:"2024-01-01..2025-06-30", geography:["EU","US"], permits:["ethics-IRB-2024-09"]}

sampling:

strategy: "stratified"

strata: [{by:"label", buckets:{"A":520,"B":2100,"C":12380}}]

weights: {class:"inverse_freq"}

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"..."}

val: {frozen:true, index:"splits/val.index", sha256:"..."}

test: {frozen:true, index:"splits/test.index", sha256:"..."}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard:

policy: ["per-object","per-timewindow"]

audits: {report:"splits/leakage_report.csv", sha256:"..."}

metrology: {units:"SI", check_dim:true}

X. Lint 规则（节选，规范性）

lint_rules:

- id: DATA.REF_FORMAT

when: "$.data.dataset_ref"

assert: "matches('^datasets/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"

level: error

- id: SAMPLE.STRATEGY_ALLOWED

when: "$.sampling.strategy"

assert: "value in ['random','stratified','time-based','spatial-tiles','systematic']"

level: error

- id: SPLITS.RATIO_SUM

when: "$.splits.ratio"

assert: "abs(value.train + value.val + value.test - 1) <= 1e-6"

level: error

- id: SPLITS.FROZEN_REQUIRED

when: "$.splits"

assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen and splits.freeze_indices == true"

level: error

- id: LEAKAGE.GUARD_PRESENT

when: "$.leakage_guard.policy"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error

- id: AUDIT.REPORT_HASH

when: "$.leakage_guard.audits"

assert: "has_keys(report, sha256)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

XI. 交叉引用锚点

数据来源与契约：见《EFT.WP.Core.DataSpec v1.0:EXPORT》。
冻结切分与分发：见《EFT.WP.Data.DatasetCards v1.0》第11章；《EFT.WP.Data.Pipeline v1.0》第9章。
评测协议与指标：见《EFT.WP.Data.ModelCards v1.0》第11章。
单位与量纲校核：见《EFT.WP.Core.Metrology v1.0:check_dim》。

XII. 本章合规自检

dataset_ref/sources/licensing/provenance 明确且仅引用，不复制数据事实。
sampling.strategy/strata/weights/seed 齐备；抽样偏差已报告。
splits 比例和为 1±1e-6，freeze_indices=true；索引文件具 sha256。
泄漏护栏启用并审计导出；任一泄漏为阻断。
SI 计量与 check_dim=true 生效；若涉 T_arr，delta_form/path/measure 已登记并校核。
导出清单列出索引与泄漏报告工件及引用锚点并具 sha256，满足发布门槛。