目录文档-技术白皮书44-EFT.WP.Data.ModelCards v1.0

第11章 评测协议与指标


I. 章节目的与范围

、冻结切分与随机性、重复次数与显著性检验、区间估计与报告格式、鲁棒性与公平性条目、线上一致性与复现实验要求;指标口径涵盖分类/检索/检测/回归/时序与校准指标,确保与《任务与 I/O》《训练数据与采样绑定》《预处理与特征工程》《目标函数、优化与超参》及计量章一致。规范性定义固化 evaluation 的

II. 字段与结构(规范性)

evaluation:

protocol:

splits: "frozen" # 必须使用冻结切分

seeds: [0,1,2,3,4] # 随机种子集合

repeats: 5 # 重复次数

batch_size_eval: 256 # 评测批量(显式化)

significance: {test:"permutation|bootstrap", alpha:0.05}

ci: {method:"bootstrap-bca", level:0.95, samples:1000}

reporting_time: "wallclock|device-normalized"

online_consistency: {shadow_mode:true, window:"7d"}

metrics: # 指标清单(按任务分组)

classification: ["accuracy","f1_macro","roc_auc","pr_auc","ece","brier"]

retrieval: ["map","recall@k","mrr"]

detection: ["mAP@0.50:0.95","mAP@0.50","AR@k"]

regression: ["rmse","mae","mape","nll"]

timeseries: ["rmse","mae","qloss@{0.1,0.5,0.9}"]

fairness: # 公平性评测

axes: ["class","region","device"]

gap_metric: "abs_diff|ratio"

threshold: 0.05

robustness: # 鲁棒性/分布偏移/对抗

shift_tests: ["snr_drop","time_jitter","spec_notch"]

adversarial: {enabled:false, norm:"Linf", epsilon:0.01}

drop_rel_max: 0.10

calibration: # 校准与覆盖

report: ["ece","brier","calibration_curve"]

coverage: {target_p:0.95, method:"tolerance|bayes"}

leakage_checks: ["object","timewindow","scene"]

see:

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Core.Metrology v1.0:check_dim"


III. 冻结切分与随机性


IV. 指标定义与报告口径


V. 鲁棒性与分布偏移


VI. 公平性评测


VII. 校准与不确定度


VIII. 线上一致性与回放


IX. 机器可读片段(可直接嵌入)

evaluation:

protocol:

splits: "frozen"

seeds: [0,1,2,3,4]

repeats: 5

batch_size_eval: 256

significance: {test:"permutation", alpha:0.05}

ci: {method:"bootstrap-bca", level:0.95, samples:1000}

reporting_time: "device-normalized"

metrics:

classification: ["f1_macro","roc_auc","ece","brier"]

detection: ["mAP@0.50:0.95","mAP@0.50"]

fairness: {axes:["class","region"], gap_metric:"abs_diff", threshold:0.05}

robustness: {shift_tests:["snr_drop","time_jitter","spec_notch"], drop_rel_max:0.10}

calibration: {report:["ece","brier","calibration_curve"], coverage:{target_p:0.95, method:"tolerance"}}

leakage_checks: ["object","timewindow"]


X. 导出清单与审计轨

export_manifest:

artifacts:

- {path:"eval/summary.csv", sha256:"..."}

- {path:"eval/by_axis_fairness.csv", sha256:"..."}

- {path:"eval/robustness_shift.csv", sha256:"..."}

- {path:"eval/calibration_curve.png", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

可校验并与模型卡字段一致;引用携带“卷名 vX.Y:锚点”。必须评测表与图

XI. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/