目录 / 文档-技术白皮书 / 45-EFT.WP.Data.Pipeline v1.0
I. 章节目的与范围
的规范:切分定义与比例、冻结索引与一致性、分层与重采样/重加权、泄漏防控与审计、包构建与镜像/速率/区域合规、完整性校验与导出清单;确保与数据契约、数据卡冻结切分、模型卡评测协议、计量章与引用锚点一致。分发(distribution) 与 切分(splits)、采样(sampling)固化流水线II. 术语与依赖
- 术语:sampling.strategy/strata/weights、splits.train/validation/test、freeze_indices、leakage_guard、shard、mirror、checksum。
- 依赖:契约与导出(《Core.DataSpec v1.0》);单位/量纲校核(《Core.Metrology v1.0》);切分/覆盖与质量(《DatasetCards v1.0》);评测协议与冻结切分(《ModelCards v1.0》)。
- 数学与符号:内联符号用反引号(如 QPS、T_inf、ρ);含除号/积分/复合算符必须加括号;如涉路径量 T_arr,登记 gamma(ell) 与 d ell;公式/符号/定义禁用中文。
III. 字段与结构(规范性)
stage:
name: "split.package|split.export"
type: "export.splits"
impl: "I16-5.split_package"
inputs: ["<Σ_in_feat>"]
outputs: ["train_pkg","val_pkg","test_pkg"]
splits:
train: {ratio: 0.8, count: null} # 比例为主,count 可选
validation: {ratio: 0.1, count: null}
test: {ratio: 0.1, count: null}
policy:
sampling:
strategy: "random|stratified|time-based|spatial-tiles|systematic"
strata: [{by:"class|region|snr_bin", buckets: {"A":100,"B":200}}]
weights: {class:"inverse_freq|none"}
leakage_guard: ["per-object","per-timewindow","per-scene"]
freeze_indices: true # 冻结索引用于复现
distribution:
packaging:
format: "tgz|zip|parquet|zarr"
shard_bytes: 134217728 # 128 MiB
layout: ["train","validation","test"]
mirrors: ["https://mirror-a.example/foo/","s3://bucket/foo/"]
rate_limit: {mbps: 50}
regional_compliance: ["EU-GDPR"] # 如适用
checksums:
package: {sha256: "<hex>"} # 顶层包校验
shards:
- {path:"train-000.tgz", sha256:"<hex>"}
- {path:"train-001.tgz", sha256:"<hex>"}
on_fail: "block|quarantine|skip"
timeout_s: 1800
IV. 采样策略与分层口径
- 策略:random(全局均匀)、stratified(按类/区域/信噪等分层)、time-based(时间窗/周期)、spatial-tiles(空间瓦片)、systematic(固定步长)。
- 分层:strata[] 需显式列出分桶与配额;与数据卡 sampling.strata 一致;报告实际占比与偏差。
- 重采样/重加权:训练侧采用 weights 时需记录策略,并在模型卡评测节解释其影响与显著性。
V. 切分定义与冻结一致性
- 比例一致:train/validation/test.ratio 之和必须为 1±1e-6;
- 冻结切分:freeze_indices=true,输出文件/行/对象 ID 的索引清单,以保证可比复现;
- 跨卷一致:模型卡评测必须使用本章冻结索引;与数据卡冻结切分保持一致。
VI. 泄漏防控与审计
- 防控粒度:leakage_guard:["per-object","per-timewindow","per-scene"];同一对象/相邻时间窗/同一场景不得跨分割集出现;
- 阻断项:任一泄漏命中为阻断;
- 审计导出:生成 splits/leakage_report.csv 与摘要统计,登记 sha256。
VII. 分发与完整性
- 打包:推荐可流式/并行读取格式;如 tgz/zip 归档,需提供分片与校验表;
- 镜像与速率:至少两个镜像端点;对外给出建议并发与速率限制;
- 完整性:顶层与分片均提供 sha256(可附 SIZE/LASTMOD);
- 区域合规:如涉隐私/地理敏感,声明可分发区域与合规依据。
VIII. 计量与单位(SI)
- 性能:QPS(1/s)、T_inf(ms {p50,p95,p99})、利用率 ρ(—);网络 net_mbps、包体量 size_bytes;
- metrology:{units:"SI", check_dim:true} 为强制;合成/聚合前先做单位归一。
- 若分发/切分涉及路径量(如 T_arr),需登记:delta_form、path="gamma(ell)"、measure="d ell",并采用以下等价式之一并通过 check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )。
IX. 机器可读片段(可直接嵌入)
layers:
- name: "export"
stages:
- name: "split.package"
type: "export.splits"
impl: "I16-5.split_package"
inputs: ["feat_rows"]
outputs: ["train_pkg","val_pkg","test_pkg"]
splits:
train: {ratio: 0.8}
validation: {ratio: 0.1}
test: {ratio: 0.1}
policy:
sampling:
strategy: "stratified"
strata: [{by:"class", buckets: {"A":520,"B":2100,"C":12380}}]
leakage_guard: ["per-object","per-timewindow"]
freeze_indices: true
distribution:
packaging: {format:"tgz", shard_bytes:134217728, layout:["train","validation","test"]}
mirrors: ["https://mirror-a.example/datasets/foo/","s3://bucket/foo/"]
rate_limit: {mbps: 50}
checksums:
package: {sha256: "…"}
shards:
- {path:"train-000.tgz", sha256:"…"}
- {path:"train-001.tgz", sha256:"…"}
on_fail: "block"
timeout_s: 1800
X. Lint 规则(节选,规范性)
lint_rules:
- id: SPLIT.RATIO_SUM
when: "$.layers[*].stages[?(@.type=='export.splits')].splits"
assert: "abs(train.ratio + validation.ratio + test.ratio - 1) <= 1e-6"
level: error
- id: SPLIT.FREEZE_REQUIRED
when: "$.layers[*].stages[?(@.type=='export.splits')].policy.freeze_indices"
assert: "value == true"
level: error
- id: SPLIT.LEAKAGE_GUARDS
when: "$.layers[*].stages[?(@.type=='export.splits')].policy.leakage_guard"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
- id: DIST.PACKAGING_ALLOWED
when: "$.layers[*].stages[?(@.type=='export.splits')].distribution.packaging.format"
assert: "value in ['tgz','zip','parquet','zarr']"
level: error
- id: DIST.CHECKSUMS_PRESENT
when: "$.layers[*].stages[?(@.type=='export.splits')].checksums"
assert: "has_key('package') and len(shards) >= 1"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.pipeline.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. 导出清单与审计
export_manifest:
version: "v1.0"
artifacts:
- {path:"splits/train.index", sha256:"..."}
- {path:"splits/validation.index", sha256:"..."}
- {path:"splits/test.index", sha256:"..."}
- {path:"packages/train-000.tgz", sha256:"..."}
- {path:"packages/train-001.tgz", sha256:"..."}
- {path:"splits/leakage_report.csv", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.11"
- "EFT.WP.Data.ModelCards v1.0:Ch.11"
XII. 本章合规自检
- splits 比例之和为 1±1e-6,freeze_indices=true;索引清单已导出并可复现。
- sampling.strategy/strata/weights 与数据卡一致;若训练采用重采样/重加权,已记录并在模型卡评测中解释影响。
- 泄漏护栏有效,跨 splits 重叠为阻断;泄漏审计报告已生成并登记 sha256。
- 分发包/分片具 sha256,镜像与速率限制明确;如涉区域合规已在 references[] 显示。
- metrology.units="SI" 且 check_dim=true;QPS/T_inf/ρ/net_mbps/size_bytes 等单位一致。
- export_manifest 列出索引/包/报告与引用锚点,满足发布门槛。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/