43-EFT.WP.Data.DatasetCards v1.0 | 第17章示例与最佳实践 | 能量丝理论

第17章示例与最佳实践

I. 章节目的与范围

于公式。 禁用中文提供从最小卡片到完整卡片的端到端示例，覆盖来源/采样、清洗/预处理、标签/本体、计量/不确定度、切分/分发、质量与基线、隐私与合规、发布与版本化，以及机器可读 Schema/Lint 与 API 落地；并给出常见失败模式与修复策略。键名统一 snake_case，引用采用“卷名+版本+锚点”，数学表达使用反引号与括号，

II. 最小可用卡片（通过 Schema 与 Lint 的发布级示例）

dataset_id: "eift.obs.demo"

title: "EIFT Demo Dataset"

version: "v1.0"

summary: "Minimal, release-grade card passing schema and lint with SI metrology and frozen splits."

modality: ["time_series"]

sources: ["doi:10.1234/demo"]

license: "CC-BY-4.0"

access: "open"

provenance:

collection_method: "simulation"

time_coverage: "2024-01-01..2024-12-31"

selection_bias: "none"

sampling:

strategy: "random"

rates: {train:0.8, validation:0.1, test:0.1}

seed: 1701

splits:

train: {count: 800, ratio: 0.8}

validation: {count: 100, ratio: 0.1}

test: {count: 100, ratio: 0.1}

policy: {leakage_guard: ["per-object"], freeze_indices: true}

preprocess:

pipeline_id: "prep-minimal-v1"

steps:

- {name:"normalize", enabled:true, idempotent:true,

params:{type:"zscore", stats_from:"train-only"}}

labels:

schema_version: "v1.0"

taxonomy:

root: "event"

nodes:

- {id:"Signal", parent:"event", kind:"class", definition:"presence of pattern"}

- {id:"Noise", parent:"event", kind:"background"}

class_map: {include:["Signal","Noise"], exclude: []}

encoding: {type:"multi_class", policy:{positive_rules:["explicit-evidence"], negative_rules:["contradiction-or-missing-signal"], tie_breaker:"lowest-risk"}}

metrology:

units: "SI"

c_ref: 299792458

time_standard: "UTC"

angle_unit: "deg"

check_dim: true

quality:

gates:

- {name:"leakage", metric:"leakage_rate", threshold:0.0}

- {name:"coverage_min", metric:"split_coverage", threshold:0.99}

coverage:

samples: 1000

per_class: {"Signal": 420, "Noise": 580}

ci_method: "bootstrap-bca"

target_ci: 0.95

baseline:

tasks: [{name:"cls_signal_vs_noise", type:"classification", split:"test"}]

metrics: [{name:"f1_macro"}, {name:"roc_auc"}]

eval_protocol: {splits:"frozen", seeds:[0,1,2,3,4], repeats:5,

ci:{method:"bootstrap-bca", level:0.95},

significance:{test:"permutation", alpha:0.05}}

checksums: {}

export_manifest:

version: "v1.0"

artifacts: []

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

（引用锚点采用“卷名 vX.Y:锚点”，与导出清单口径一致。）

III. 路径依赖量（含 T_arr）的完整示例

dataset_id: "eift.radio.toa-set"

title: "Arrival-Time (TOA) Measurement Set"

version: "v1.2"

summary: "Path-dependent arrival-time dataset with iono/tropo/instrumental corrections and full uncertainty budget."

modality: ["radio","time_series"]

sources: ["doi:10.5678/toa2025"]

provenance:

collection_method: "beamformed-array"

instruments: [{name:"LOFAR", station:"DE601"}]

time_coverage: "2023-01-01..2025-06-30"

spatial_coverage: "RA[120..240],Dec[-30..+30]"

selection_bias: "flux-limited, SNR>=7"

sampling:

strategy: "stratified"

strata: [{by:"snr_bin", buckets:{"7-10":300,"10-20":500,"20+":700}}]

rates: {train:0.8, validation:0.1, test:0.1}

seed: 1701

audits: ["coverage","leakage","class-imbalance"]

preprocess:

pipeline_id: "toa-prep-v2"

steps:

- {name:"rfi_clean", enabled:true, idempotent:true,

params:{method:"spectral-kurtosis", window:256, thr_sigma:5}}

- {name:"filter", enabled:true, idempotent:true,

params:{type:"bandpass", f_lo_hz:1.2e6, f_hi_hz:3.8e6, order:5, phase:"zero"}}

- {name:"normalize", enabled:true, idempotent:true,

params:{type:"zscore", stats_from:"train-only"}}

labels:

schema_version: "v1.0"

taxonomy:

root: "event"

nodes:

- {id:"FRB", parent:"event", kind:"class", definition:"fast radio burst"}

- {id:"RFI", parent:"event", kind:"artifact", definition:"radio frequency interference"}

class_map: {include:["FRB","RFI"], exclude: []}

encoding: {type:"multi_class", policy:{positive_rules:["explicit-evidence"], negative_rules:["contradiction-or-missing-signal"], tie_breaker:"lowest-risk"}}

path_dependence:

applies_to: ["T_arr"]

delta_form: "const-factor" # or "general"

path: "gamma(ell)"

measure: "d ell"

see:

- "EFT.WP.Core.Equations v1.1:S20-1"

- "EFT.WP.Core.Metrology v1.0:check_dim"

metrology:

units: "SI"

c_ref: 299792458

time_standard: "UTC"

angle_unit: "deg"

check_dim: true

uncertainty:

model: "GUM"

components:

- {name:"thermal", type:"random", value:2.1, unit:"K", distribution:"normal", coverage:{k:1.0}}

- {name:"iono", type:"systematic", value:0.7e-9, unit:"s", distribution:"normal", coverage:{k:2.0}, corr_group:"path"}

- {name:"tropo", type:"systematic", value:0.3e-9, unit:"s", distribution:"normal", coverage:{k:2.0}, corr_group:"path"}

correlation: {posture:"groups", groups:[{name:"path", pairwise:"rho=0.5"}]}

propagation: {rule:"linear", linearization:"first-order"}

coverage_policy: {target_p:0.95, k:2.0}

splits:

train: {count: 12000, ratio: 0.8}

validation: {count: 1500, ratio: 0.1}

test: {count: 1500, ratio: 0.1}

policy: {leakage_guard:["per-object","per-timewindow"], stratify_by:["snr_bin"], freeze_indices:true}

distribution:

packaging: {format:"tgz", shard_bytes:134217728, layout:["train","validation","test"]}

mirrors: ["https://mirror-a.example/toa/","s3://bucket/toa/"]

checksums:

package: {sha256:"..."}

shards:

- {path:"train-000.tgz", sha256:"..."}

quality:

gates:

- {name:"leakage", metric:"leakage_rate", threshold:0.0}

- {name:"coverage_min", metric:"split_coverage", threshold:0.99}

baseline:

tasks: [{name:"cls_frb_vs_rfi", type:"classification", split:"test"}]

metrics: [{name:"f1_macro"}, {name:"roc_auc"}, {name:"ece"}]

eval_protocol: {splits:"frozen", seeds:[0,1,2,3,4], repeats:5,

ci:{method:"bootstrap-bca", level:0.95},

significance:{test:"permutation", alpha:0.05}}

privacy:

policy: "no-PII"

export_manifest:

version: "v1.2"

artifacts:

- {path:"splits/train.index", sha256:"..."}

- {path:"packages/train-000.tgz", sha256:"..."}

- {path:"quality/summary.csv", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Core.Equations v1.1:S20-1"

（路径/测度登记与两种 T_arr 等价表达按计量口径执行。）

IV. 常见失败模式与修复策略（对照 Lint 规则）

比例和不等于 1：splits.*.ratio 求和偏离 1±1e-6 → 修复：统一 sampling.rates 与 splits，并更新计数。
泄漏：同一对象/时间窗跨分割集出现 → 修复：启用 policy.leakage_guard 并重建索引，审计入 quality.gates。
非 SI 或未启用 check_dim → 修复：metrology.units="SI"、check_dim=true，在字段级补齐单位。
引用锚点不合规："Core.DataSpec:EXPORT" → 修复："EFT.WP.Core.DataSpec v1.0:EXPORT"。
数学表达含中文或未加括号 → 修复：以反引号包裹、用括号明示算符优先级。
T_arr 未登记路径/测度 → 修复：补齐 delta_form、path="gamma(ell)"、measure="d ell" 并通过 check_dim。

V. 最佳实践清单（发布必读）

引用与版本：始终使用“卷名+版本+锚点”，外部材料建议锁定到稳定线次版本（如 v1.2.*）。
计量一致：在 Schema、示例与说明中统一单位；复合量先归一再合成不确定度。
冻结切分：导出索引并随包发布；基线脚本消费冻结索引确保可比。
清洗可复现：流水线 steps[] 参数全显式、parameter_lock=true，导出日志与锁定配置。
隐私合规：若涉 PII，启用 privacy/ethics，在分发层落实区域限制与保留策略。
报告透明：质量门/覆盖/显著性与不确定度并呈，表与图纳入 export_manifest 并校验。

VI. 与验证 API 的联动示例

# 结构与跨卷校验

curl -s -X POST https://api.eift.org/api/v1/validate_card \

-H "Authorization: Bearer <token>" \

-H "x-eift-idempotency: a0d7b6c4-9478-4e4a-9f13-8c1b9f77f111" \

-H "Content-Type: application/json" \

-d @card.json

# 计量一致性

curl -s -X POST https://api.eift.org/api/v1/check_units \

-H "Authorization: Bearer <token>" -H "Content-Type: application/json" \

-d @card.json

（API 返回统一结构，阻断项在发布前清零。）

VII. 附：失败—修复对照表（可纳入 CI 日志）

失败规则	典型症状	修复要点	参照条目
SPLIT.RATIO_SUM	三分比例之和≠1	对齐 sampling.rates 与 splits	第11章；Lint 规则集
REFERENCES.FORMAT	引用缺版本/锚点	改为“卷名 vX.Y:锚点”	引用规范；DataSpec 导出
METROLOGY.SI_AND_CHECKDIM	单位不一致/未启用校核	设 units="SI", check_dim=true	Core.Metrology v1.0
PATH.TARR_FIELDS	T_arr 未登记路径/测度	填 delta_form/path/measure	Core.Equations v1.1:S20-*
SYMBOLS.CONFLICT	T_fil/T_trans 混用	分离张力/透射系数	写作清单口径

VIII. 本章合规自检

示例卡片可直接通过第15章 Schema 与 Lint，并在 export_manifest 中列出引用锚点与工件 sha256。
涉及 T_arr 的示例已登记 delta_form/path/measure 且通过 check_dim；数学表达使用反引号与括号、不含中文。
切分与采样、清洗与质量、隐私与分发口径一致，冻结索引与基线协议到位。
与验证 API 的调用示例可直接落地，阻断项清零后方可发布。