目录文档-技术白皮书43-EFT.WP.Data.DatasetCards v1.0

第11章 数据切分与分发


I. 章节目的与范围

固化训练/验证/测试切分的定义、比例与一致性约束;规范分发清单、镜像与分片策略、完整性校验与速率/区域合规;所有键名使用 snake_case,跨卷引用采用“卷名+版本+锚点”。

II. 术语与依赖


III. 字段与结构(规范性)

splits:

train: {count: <int>, ratio: <0..1>}

validation: {count: <int>, ratio: <0..1>}

test: {count: <int>, ratio: <0..1>}

policy:

leakage_guard: ["per-object","per-timewindow"] # 防泄漏粒度

stratify_by: ["class","region","snr_bin"] # 与 sampling.strata 对齐

freeze_indices: true # 索引冻结以确保可复现

audit:

coverage: {by:"class", report:true}

leakage: {cross_split:"forbid"}

imbalance: {metric:"gini", threshold: 0.2}

distribution:

packaging:

format: "tgz" # tgz | zip | parquet | zarr | other

shard_bytes: 134217728 # 128 MiB 示例

layout: ["train","validation","test"]

mirrors: ["https://mirror-a.example/foo/","s3://bucket/foo/"]

rate_limit: {mbps: 50}

regional_compliance: ["EU-GDPR","CN-DSR"] # 仅示例

checksums:

package: {sha256: "<hex>"} # 顶层包校验

shards:

- {path:"train-000.tgz", sha256:"<hex>"}

- {path:"train-001.tgz", sha256:"<hex>"}

see:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

(导出物与引用锚点在 export_manifest 中记录并可校验。)


IV. 切分定义与一致性约束


V. 分发与工件组织


VI. 与质量与基线的联动


VII. 计量与单位(涉及时空/频率切分时)


VIII. 导出清单与引用(规范性)

export_manifest:

version: "v1.0"

artifacts:

- {path:"splits/train.index", sha256:"..."}

- {path:"splits/validation.index", sha256:"..."}

- {path:"splits/test.index", sha256:"..."}

- {path:"packages/train-000.tgz", sha256:"..."}

- {path:"packages/train-001.tgz", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

(所有工件必须在导出清单中列出并可校验;引用携带卷名+版本+锚点。)


IX. 示例片段(可直接嵌入卡片)

splits:

train: {count: 12000, ratio: 0.8}

validation: {count: 1500, ratio: 0.1}

test: {count: 1500, ratio: 0.1}

policy:

leakage_guard: ["per-object","per-timewindow"]

stratify_by: ["class","snr_bin"]

freeze_indices: true

distribution:

packaging: {format:"tgz", shard_bytes:134217728, layout:["train","validation","test"]}

mirrors: ["https://mirror-a.example/datasets/foo/","s3://bucket/foo/"]

rate_limit: {mbps: 50}

checksums:

package: {sha256: "…"}

shards:

- {path:"train-000.tgz", sha256:"…"}

- {path:"train-001.tgz", sha256:"…"}


X. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/