46-EFT.WP.Data.Benchmarks v1.0 | 第8章评分、归一化与排名

目录／文档-技术白皮书（V5.05）／ 46-EFT.WP.Data.Benchmarks v1.0

第8章评分、归一化与排名

I. 章节目的与范围

排名（ranking）**的规范：多指标/多任务聚合口径，权重与锚点基线，分档与并列处理，显著性与置信区间联动，排行榜治理与稳定线；确保与任务定义、指标体系、评测协议、计量与引用锚点一致。与、**归一化（normalization）评分（scoring）固化

II. 术语与依赖

术语：score_raw、score_norm、aggregation.levels（task|suite）、weights.scheme（uniform|sample_share|expert）、normalize.scheme（zscore|minmax|fixed-anchor）、anchors、tie_break、rank、stability_line、gating。
依赖：指标与单位（《Benchmarks v1.0》第6章）、评测协议（《ModelCards v1.0》第11章）、监控计量（《Pipeline v1.0》第12章）、单位与量纲（《Core.Metrology v1.0:check_dim》）。
数学与符号：内联符号用反引号；含除号/复合算符必须加括号；路径量 T_arr 采用
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )，并声明 gamma(ell) 与 d ell；公式/符号/定义禁用中文。

III. 字段与结构（规范性）

scoring:

aggregation:

levels: ["task","suite"] # 聚合层级

weights:

scheme: "uniform|sample_share|expert" # 跨任务权重

w_i: null # 当 scheme=expert 时显式给定

metrics:

include: ["F1_macro","ECE","QPS","latency_ms.p99"]

directions: {F1_macro: "max", ECE: "min", QPS:"max", "latency_ms.p99":"min"}

combine: "weighted_mean|geomean|harmonic" # 指标组合方式

normalization:

scheme: "zscore|minmax|fixed-anchor"

params:

zscore: {μ_ref:"suite|task|anchors", σ_ref:"suite|task|anchors"}

minmax: {min_ref:"anchors", max_ref:"anchors"}

fixed_anchor:

anchors: ["baseline.logreg","baseline.rf"] # 基线ID

anchor_scores:

baseline.logreg: {F1_macro:0.72, ECE:0.06}

baseline.rf: {F1_macro:0.75, ECE:0.05}

ranking:

objective: "score_norm|score_raw" # 排名依据

tie_break: ["F1_macro","-latency_ms.p99","model_id"] # 依次比较，前缀“-”表示升序优先低值

buckets: {gold: "top5%", silver: "top20%", bronze: "top40%"} # 分档

stability:

stability_line: "v1.*"

gating:

require_ci: true

min_runs: 3

significance: {method:"bootstrap", B:10000, alpha:0.05, correction:"Holm-Bonferroni"}

audit:

export: ["reports/score_breakdown.json","reports/leaderboard.csv"]

IV. 评分与聚合口径

单指标方向：为每个指标声明方向 max|min，组合前对齐方向（如对 ECE 取 -ECE 或在归一化后统一为“越大越好”）。
指标组合：weighted_mean|geomean|harmonic；当存在量纲差异时先归一化再组合。
层级聚合：task → suite；跨任务权重 w_i 取 uniform|sample_share|expert 之一，expert 须在导出清单登记来源与负责人。

V. 归一化策略

zscore：score_norm = ( score_raw - μ_ref ) / σ_ref；μ_ref/σ_ref 来源 suite|task|anchors 需显式。
minmax：score_norm = ( score_raw - min_ref ) / ( max_ref - min_ref )；min_ref/max_ref 推荐来自固定锚点。
fixed-anchor：相对锚点基线得分；锚点必须为公开、可复现的基线，并在导出物中给出 sha256 与环境锁定。

VI. 排名与并列处理

排名对象：默认使用 score_norm；如为 score_raw，必须说明理由。
并列规则：tie_break[] 逐项比较，支持 -metric 表示低值优先；全部相同则按 model_id 字典序。
分档：按百分位或固定阈值划分 gold/silver/bronze；分档边界与样本量在报告中固定。

VII. 显著性与置信区间

显著性：默认 bootstrap，B≥10k；报告 Δ 与 CI_95；跨多模型比较应用 Holm–Bonferroni。
门槛：若候选优于基线但 p≥α，不得晋级；排行榜更新需满足 min_runs 与 require_ci=true。

VIII. 计量与单位（SI）

强制：metrology:{units:"SI", check_dim:true}；性能与资源指标（QPS(1/s)、T_inf(ms)、ρ(—)、net_mbps、size_bytes）与第6章口径一致；组合前先单位归一。
路径量：若评分涉及 T_arr，登记 delta_form/path/measure 并采用两种等价式之一通过 check_dim。

IX. 机器可读片段（可直接嵌入）

scoring:

aggregation:

levels: ["task","suite"]

weights: {scheme:"sample_share"}

metrics:

include: ["F1_macro","ECE","latency_ms.p99"]

directions: {F1_macro:"max", ECE:"min", "latency_ms.p99":"min"}

combine: "weighted_mean"

normalization:

scheme: "fixed-anchor"

params:

fixed_anchor:

anchors: ["baseline.logreg","baseline.rf"]

anchor_scores:

baseline.logreg: {F1_macro:0.72, ECE:0.06, "latency_ms.p99":180}

baseline.rf: {F1_macro:0.75, ECE:0.05, "latency_ms.p99":170}

ranking:

objective: "score_norm"

tie_break: ["F1_macro","-latency_ms.p99","model_id"]

buckets: {gold:"top5%", silver:"top20%", bronze:"top40%"}

stability:

stability_line: "v1.*"

gating:

require_ci: true

min_runs: 3

significance: {method:"bootstrap", B:10000, alpha:0.05, correction:"Holm-Bonferroni"}

audit:

export: ["reports/score_breakdown.json","reports/leaderboard.csv"]

metrology: {units:"SI", check_dim:true}

X. Lint 规则（节选，规范性）

lint_rules:

- id: SCORE.METRICS_DIRECTIONS

when: "$.scoring.aggregation.metrics"

assert: "has_key(include) and has_key(directions) and len(include) == len(directions)"

level: error

- id: NORM.SCHEME_ALLOWED

when: "$.scoring.normalization.scheme"

assert: "value in ['zscore','minmax','fixed-anchor']"

level: error

- id: NORM.ANCHORS_REQUIRED

when: "$.scoring.normalization.scheme == 'fixed-anchor'"

assert: "len($.scoring.normalization.params.fixed_anchor.anchors) >= 1"

level: error

- id: RANK.OBJECTIVE_ALLOWED

when: "$.scoring.ranking.objective"

assert: "value in ['score_norm','score_raw']"

level: error

- id: STABILITY.BOOTSTRAP_PARAMS

when: "$.scoring.stability.gating.significance"

assert: "has_keys(method, B, alpha)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

XI. 交叉引用锚点

指标体系与单位：见《EFT.WP.Data.Benchmarks v1.0》第6章。
评测协议与统计设置：见《EFT.WP.Data.ModelCards v1.0》第11章。
在线窗口与监控：见《EFT.WP.Data.Pipeline v1.0》第12章。
单位与量纲校核：见《EFT.WP.Core.Metrology v1.0:check_dim》。

XII. 本章合规自检

聚合层级、指标集合、方向、组合方式明确；跨任务权重来源可追溯。
归一化方案与参数（含锚点/参考统计）显式；先归一再组合。
排名目标、并列规则与分档边界固定；输出明细与汇总可复现。
显著性方法与置信区间报告到位；排行榜更新满足最小重复数与显著性门槛。
SI 计量与 check_dim=true 生效；涉路径量按等价式登记与校核。
机器可读片段可直接落盘并通过 Lint；export_manifest.references[] 采用“卷名 vX.Y:锚点”。

版权与许可：除另有说明外，《能量丝理论》（含文本、图表、插图、符号与公式）的著作权由作者（屠广林）享有。
许可方式（CC BY 4.0）：在注明作者与来源的前提下，允许复制、转载、节选、改编与再分发。
署名格式（建议）：作者：屠广林｜作品：《能量丝理论》｜来源：energyfilament.org｜许可证：CC BY 4.0
验证召集： 作者独立自费、无雇主无资助；下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地，不限国家。欢迎各国媒体与同行抓住窗口组织验证，并与我们联系。
版本信息： 首次发布：2025-11-11 ｜当前版本：v6.0+5.05