第6章 切分、版本与新鲜度(Splits/Versioning/Freshness)


I. 目的与范围(Purpose & Scope)


II. 输入与依赖(Prerequisites & Inputs)


III. 切分策略(Splits Strategy)

  1. 切分集合:train / val / test / holdout / slice_k;每个 split 在 split.yaml 中唯一命名并记录意图。
  2. 防泄漏原则
    • 时间泄漏:按时间窗切分(TS → {train < val < test}),严禁跨窗交叉。
    • 实体泄漏:按实体/场站/设备 group_by(entity) 切分,确保实体不跨 split。
    • 路径量一致:路径数组 gamma_ell/d_ell/n_eff 在任一 split 内长度与采样步长一致。
  3. 分层与抽样:按 batch/device/region/quality.flags 分层;必要时保障类/难例比例一致。
  4. 切片(slices):对关键亚群体(极端工况/低信噪/特定区域)生成 slice_k 并显式记录筛选条件。
  5. 再现性:在 split.yaml 记录随机种子 seed、算法与参数;生成 split_manifest.json 写明样本计数与校验和。

IV. 版本管理(Versioning — SemVer)


V. 新鲜度策略(Freshness / Validity)


VI. 质量门映射(Gates Mapping)


VII. 机读配置(Machine-Readable Configs)
A. split.yaml

version: "1.0.0"

seed: 20250924

strategy:

group_by: ["entity_id"]

time_ordered: true

splits:

train: 0.70

val: 0.15

test: 0.15

constraints:

leakage:

time: { enforce: true }

entity: { enforce: true }

path:

require_alignment: true

delta_form: "general"

coverage:

mode: "k" # k|alpha|quantile

k: 2


B. split_manifest.json(节选)

JSON json
{
  "dataset_version": "1.2.0",
  "splits": {
    "train": { "count": 120345, "checksum": "sha256:..." },
    "val": { "count": 25780, "checksum": "sha256:..." },
    "test": { "count": 25812, "checksum": "sha256:..." }
  },
  "slices": { "low_snr": { "count": 8142, "rule": "snr<5" } },
  "freshness": {
    "valid_from": "2025-09-01T00:00:00Z",
    "valid_to": "2026-03-01T00:00:00Z",
    "policy": { "tau_calib_s_max": 86400, "clock_state": "locked" }
  }
}

C. version_matrix.yaml(兼容矩阵)

dataset: "ds-core"

current: "1.2.0"

compatibility:

"1.2.x": { api: ">=1.2,<2.0", schema: ">=1.2,<2.0" }

"1.1.x": { api: ">=1.1,<1.3", schema: ">=1.1,<1.3" }

migration:

from: "1.1.x"

to: "1.2.x"

steps:

- change: "add slice 'low_snr'"

- change: "add field quality.score_Q"

rollback:

tag: "v1.1.3-lock"


VIII. 反例与修正(Anti-Patterns & Fixes)


IX. 交叉引用(Cross-References)


X. 执行勾选清单(Checklist)