第8章 评测基准与对比评分(Bench/Score)


I. 目标与范围(Purpose & Scope)


II. 前置条件与输入(Prerequisites & Inputs)


III. 基准任务与可比性(Bench Tasks & Comparability)


IV. 泄漏防护与一致性(Leakage Prevention & Consistency)


V. 指标与区间(Metrics & Intervals)

  1. 主指标(示例):AUC、ACC、MAE、RMSE、r_phi、ε_flux、Q_res、Latency_P95/Throughput(若含性能约束)。
  2. 区间规则:
    • k 覆盖:U = k·u_c;
    • alpha:t_{ν,1−α/2} 或正态近似;
    • quantile:如 [0.025, 0.975];全卷任选其一并保持一致。

VI. 对比评分映射(Scoring Mapping)


VII. 门阈映射与判定(Gates & Decisions)

  1. 与《误差预算卡》阈值对齐:
    • |ΔT_arr| + U(T_arr) ≤ τ_T;
    • LB(r_phi) ≥ r_phi_min;
    • P95(ε_flux) ≤ ε_flux_guard;
    • p_dim = 1.0、Σ PD。
  2. 发布判定:核心门通过且 Q ≥ Q_base + δQ_min → Pass;否则 Fail / [Restricted](仅发布定性图表与诊断)。

VIII. 路径量统一口径(Normative Path Forms)

正文显式 gamma(ell) 与 d ell;数据侧记录 delta_form;所有表达括号化。


IX. 机读配置与清单(Machine-Readable)
A. bench_plan.yaml

version: "1.0.0"

tasks:

- id: "bench-arrival"

split: "test"

metrics: ["DeltaT_arr_s","Q_res","p_dim"]

coverage: { mode: "k", k: 2 }

- id: "bench-phase"

split: "test"

metrics: ["r_phi","epsilon_flux"]

coverage: { mode: "quantile", p: [0.025, 0.975] }

baseline: { id: "base-001", version: "1.2.3" }

weights: { DeltaT_arr_s: 0.35, r_phi: 0.25, epsilon_flux: 0.15, p_dim: 0.15, Q_res: 0.10 }


B. scorecard.json(示例)

JSON json
{
  "version": "1.0.0",
  "baseline": { "id": "base-001", "Q": 0.62 },
  "method": { "id": "mdl-core", "Q": 0.78 },
  "weights": { "DeltaT_arr_s": 0.35, "r_phi": 0.25, "epsilon_flux": 0.15, "p_dim": 0.15, "Q_res": 0.1 },
  "metrics": {
    "DeltaT_arr_s": { "mean": -2.3e-09, "Uk2": 1.5e-09 },
    "r_phi": { "value": 0.72, "lb95": 0.61, "ub95": 0.8 },
    "epsilon_flux": { "median": 0.004, "p95": 0.011 },
    "p_dim": 1.0,
    "Q_res": 0.13
  },
  "decision": "pass",
  "see": [ "EFT.WP.Core.Equations v1.1:S20-1", "Error Budget Card v1.0:Ch.8" ]
}

C. eval_report.md(提纲)

# Evaluation Report

- Tasks, splits, seeds

- Metrics with intervals & convergence

- Score mapping, weights, final Q

- Gate comparison & decision


X. 反例与修正(Anti-Patterns & Fixes)


XI. 交叉引用(Cross-References)


XII. 执行勾选清单(Checklist)