目录文档-技术白皮书(V5.05)44-EFT.WP.Data.ModelCards v1.0

第11章 评测协议与指标


I. 章节目的与范围

、冻结切分与随机性、重复次数与显著性检验、区间估计与报告格式、鲁棒性与公平性条目、线上一致性与复现实验要求;指标口径涵盖分类/检索/检测/回归/时序与校准指标,确保与《任务与 I/O》《训练数据与采样绑定》《预处理与特征工程》《目标函数、优化与超参》及计量章一致。规范性定义固化 evaluation 的

II. 字段与结构(规范性)

evaluation:

protocol:

splits: "frozen" # 必须使用冻结切分

seeds: [0,1,2,3,4] # 随机种子集合

repeats: 5 # 重复次数

batch_size_eval: 256 # 评测批量(显式化)

significance: {test:"permutation|bootstrap", alpha:0.05}

ci: {method:"bootstrap-bca", level:0.95, samples:1000}

reporting_time: "wallclock|device-normalized"

online_consistency: {shadow_mode:true, window:"7d"}

metrics: # 指标清单(按任务分组)

classification: ["accuracy","f1_macro","roc_auc","pr_auc","ece","brier"]

retrieval: ["map","recall@k","mrr"]

detection: ["mAP@0.50:0.95","mAP@0.50","AR@k"]

regression: ["rmse","mae","mape","nll"]

timeseries: ["rmse","mae","qloss@{0.1,0.5,0.9}"]

fairness: # 公平性评测

axes: ["class","region","device"]

gap_metric: "abs_diff|ratio"

threshold: 0.05

robustness: # 鲁棒性/分布偏移/对抗

shift_tests: ["snr_drop","time_jitter","spec_notch"]

adversarial: {enabled:false, norm:"Linf", epsilon:0.01}

drop_rel_max: 0.10

calibration: # 校准与覆盖

report: ["ece","brier","calibration_curve"]

coverage: {target_p:0.95, method:"tolerance|bayes"}

leakage_checks: ["object","timewindow","scene"]

see:

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Core.Metrology v1.0:check_dim"


III. 冻结切分与随机性


IV. 指标定义与报告口径


V. 鲁棒性与分布偏移


VI. 公平性评测


VII. 校准与不确定度


VIII. 线上一致性与回放


IX. 机器可读片段(可直接嵌入)

evaluation:

protocol:

splits: "frozen"

seeds: [0,1,2,3,4]

repeats: 5

batch_size_eval: 256

significance: {test:"permutation", alpha:0.05}

ci: {method:"bootstrap-bca", level:0.95, samples:1000}

reporting_time: "device-normalized"

metrics:

classification: ["f1_macro","roc_auc","ece","brier"]

detection: ["mAP@0.50:0.95","mAP@0.50"]

fairness: {axes:["class","region"], gap_metric:"abs_diff", threshold:0.05}

robustness: {shift_tests:["snr_drop","time_jitter","spec_notch"], drop_rel_max:0.10}

calibration: {report:["ece","brier","calibration_curve"], coverage:{target_p:0.95, method:"tolerance"}}

leakage_checks: ["object","timewindow"]


X. 导出清单与审计轨

export_manifest:

artifacts:

- {path:"eval/summary.csv", sha256:"..."}

- {path:"eval/by_axis_fairness.csv", sha256:"..."}

- {path:"eval/robustness_shift.csv", sha256:"..."}

- {path:"eval/calibration_curve.png", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

可校验并与模型卡字段一致;引用携带“卷名 vX.Y:锚点”。必须评测表与图

XI. 本章合规自检


版权与许可:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(屠广林)享有。
许可方式(CC BY 4.0):在注明作者与来源的前提下,允许复制、转载、节选、改编与再分发。
署名格式(建议):作者:屠广林|作品:《能量丝理论》|来源:energyfilament.org|许可证:CC BY 4.0
验证召集: 作者独立自费、无雇主无资助;下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地,不限国家。欢迎各国媒体与同行抓住窗口组织验证,并与我们联系。
版本信息: 首次发布:2025-11-11 | 当前版本:v6.0+5.05