目录文档-技术白皮书45-EFT.WP.Data.Pipeline v1.0

第9章 采样、切分与分发


I. 章节目的与范围

的规范:切分定义与比例、冻结索引与一致性、分层与重采样/重加权、泄漏防控与审计、包构建与镜像/速率/区域合规、完整性校验与导出清单;确保与数据契约、数据卡冻结切分、模型卡评测协议、计量章与引用锚点一致。分发(distribution)切分(splits)采样(sampling)固化流水线

II. 术语与依赖


III. 字段与结构(规范性)

stage:

name: "split.package|split.export"

type: "export.splits"

impl: "I16-5.split_package"

inputs: ["<Σ_in_feat>"]

outputs: ["train_pkg","val_pkg","test_pkg"]

splits:

train: {ratio: 0.8, count: null} # 比例为主,count 可选

validation: {ratio: 0.1, count: null}

test: {ratio: 0.1, count: null}

policy:

sampling:

strategy: "random|stratified|time-based|spatial-tiles|systematic"

strata: [{by:"class|region|snr_bin", buckets: {"A":100,"B":200}}]

weights: {class:"inverse_freq|none"}

leakage_guard: ["per-object","per-timewindow","per-scene"]

freeze_indices: true # 冻结索引用于复现

distribution:

packaging:

format: "tgz|zip|parquet|zarr"

shard_bytes: 134217728 # 128 MiB

layout: ["train","validation","test"]

mirrors: ["https://mirror-a.example/foo/","s3://bucket/foo/"]

rate_limit: {mbps: 50}

regional_compliance: ["EU-GDPR"] # 如适用

checksums:

package: {sha256: "<hex>"} # 顶层包校验

shards:

- {path:"train-000.tgz", sha256:"<hex>"}

- {path:"train-001.tgz", sha256:"<hex>"}

on_fail: "block|quarantine|skip"

timeout_s: 1800


IV. 采样策略与分层口径


V. 切分定义与冻结一致性


VI. 泄漏防控与审计


VII. 分发与完整性


VIII. 计量与单位(SI)

  1. 性能:QPS(1/s)、T_inf(ms {p50,p95,p99})、利用率 ρ(—);网络 net_mbps、包体量 size_bytes;
  2. metrology:{units:"SI", check_dim:true} 为强制;合成/聚合前先做单位归一
  3. 若分发/切分涉及路径量(如 T_arr),需登记:delta_form、path="gamma(ell)"、measure="d ell",并采用以下等价式之一并通过 check_dim:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell )。

IX. 机器可读片段(可直接嵌入)

layers:

- name: "export"

stages:

- name: "split.package"

type: "export.splits"

impl: "I16-5.split_package"

inputs: ["feat_rows"]

outputs: ["train_pkg","val_pkg","test_pkg"]

splits:

train: {ratio: 0.8}

validation: {ratio: 0.1}

test: {ratio: 0.1}

policy:

sampling:

strategy: "stratified"

strata: [{by:"class", buckets: {"A":520,"B":2100,"C":12380}}]

leakage_guard: ["per-object","per-timewindow"]

freeze_indices: true

distribution:

packaging: {format:"tgz", shard_bytes:134217728, layout:["train","validation","test"]}

mirrors: ["https://mirror-a.example/datasets/foo/","s3://bucket/foo/"]

rate_limit: {mbps: 50}

checksums:

package: {sha256: "…"}

shards:

- {path:"train-000.tgz", sha256:"…"}

- {path:"train-001.tgz", sha256:"…"}

on_fail: "block"

timeout_s: 1800


X. Lint 规则(节选,规范性)

lint_rules:

- id: SPLIT.RATIO_SUM

when: "$.layers[*].stages[?(@.type=='export.splits')].splits"

assert: "abs(train.ratio + validation.ratio + test.ratio - 1) <= 1e-6"

level: error

- id: SPLIT.FREEZE_REQUIRED

when: "$.layers[*].stages[?(@.type=='export.splits')].policy.freeze_indices"

assert: "value == true"

level: error

- id: SPLIT.LEAKAGE_GUARDS

when: "$.layers[*].stages[?(@.type=='export.splits')].policy.leakage_guard"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error

- id: DIST.PACKAGING_ALLOWED

when: "$.layers[*].stages[?(@.type=='export.splits')].distribution.packaging.format"

assert: "value in ['tgz','zip','parquet','zarr']"

level: error

- id: DIST.CHECKSUMS_PRESENT

when: "$.layers[*].stages[?(@.type=='export.splits')].checksums"

assert: "has_key('package') and len(shards) >= 1"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.pipeline.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


XI. 导出清单与审计

export_manifest:

version: "v1.0"

artifacts:

- {path:"splits/train.index", sha256:"..."}

- {path:"splits/validation.index", sha256:"..."}

- {path:"splits/test.index", sha256:"..."}

- {path:"packages/train-000.tgz", sha256:"..."}

- {path:"packages/train-001.tgz", sha256:"..."}

- {path:"splits/leakage_report.csv", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Data.ModelCards v1.0:Ch.11"


XII. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/