目录文档-技术白皮书46-EFT.WP.Data.Benchmarks v1.0

第5章 数据来源、采样与冻结切分


I. 章节目的与范围

冻结切分(frozen splits)**的规范:来源合规与引用、采样策略与分层、冻结索引与一致性、泄漏防控与审计导出;确保与数据卡/模型卡/流水线、计量与引用锚点一致。、**采样(sampling)数据来源(sources)固化

II. 术语与依赖


III. 字段与结构(规范性)

data:

dataset_ref: "datasets/<name>@vX.Y" # 引用,不复制

sources: ["<uri-or-citation>", "..."] # 数据来源与引文

licensing: "CC-BY-4.0|ODC-BY|custom"

provenance:

collection_window: "<YYYY-MM-DD..YYYY-MM-DD>"

geography: ["<region>"]

permits: ["<ethics/permit-ref>"]

sampling:

strategy: "random|stratified|time-based|spatial-tiles|systematic"

strata: [{by:"<label|locale|domain|difficulty|snr_bin>", buckets: {"A":100,"B":200}}]

weights: {class:"inverse_freq|none"} # 训练重加权说明

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard:

policy: ["per-object","per-timewindow","per-scene"]

audits:

report: "splits/leakage_report.csv"

sha256: "<hex>"


IV. 来源合规与引用口径


V. 采样策略与分层


VI. 冻结切分与一致性


VII. 泄漏防控与审计导出


VIII. 计量与单位(SI)

  1. 性能与体量:QPS(1/s)、T_inf(ms)、ρ(—)、net_mbps、size_bytes;
  2. 强制:metrology:{units:"SI", check_dim:true};复合量合成前先做单位归一
  3. 路径量(如 T_arr):若切分/采样与路径依赖量耦合,登记:delta_form、path="gamma(ell)"、measure="d ell";采用
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ),并通过 check_dim。

IX. 机器可读片段(可直接嵌入)

data:

dataset_ref: "datasets/core_cls@v1.0"

sources: ["doi:10.1234/core-ds", "arXiv:2501.01234"]

licensing: "CC-BY-4.0"

provenance: {collection_window:"2024-01-01..2025-06-30", geography:["EU","US"], permits:["ethics-IRB-2024-09"]}

sampling:

strategy: "stratified"

strata: [{by:"label", buckets:{"A":520,"B":2100,"C":12380}}]

weights: {class:"inverse_freq"}

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"..."}

val: {frozen:true, index:"splits/val.index", sha256:"..."}

test: {frozen:true, index:"splits/test.index", sha256:"..."}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard:

policy: ["per-object","per-timewindow"]

audits: {report:"splits/leakage_report.csv", sha256:"..."}

metrology: {units:"SI", check_dim:true}


X. Lint 规则(节选,规范性)

lint_rules:

- id: DATA.REF_FORMAT

when: "$.data.dataset_ref"

assert: "matches('^datasets/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"

level: error

- id: SAMPLE.STRATEGY_ALLOWED

when: "$.sampling.strategy"

assert: "value in ['random','stratified','time-based','spatial-tiles','systematic']"

level: error

- id: SPLITS.RATIO_SUM

when: "$.splits.ratio"

assert: "abs(value.train + value.val + value.test - 1) <= 1e-6"

level: error

- id: SPLITS.FROZEN_REQUIRED

when: "$.splits"

assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen and splits.freeze_indices == true"

level: error

- id: LEAKAGE.GUARD_PRESENT

when: "$.leakage_guard.policy"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error

- id: AUDIT.REPORT_HASH

when: "$.leakage_guard.audits"

assert: "has_keys(report, sha256)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


XI. 交叉引用锚点


XII. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/