目录文档-技术白皮书46-EFT.WP.Data.Benchmarks v1.0

第8章 评分、归一化与排名


I. 章节目的与范围

排名(ranking)**的规范:多指标/多任务聚合口径,权重与锚点基线,分档与并列处理,显著性与置信区间联动,排行榜治理与稳定线;确保与任务定义、指标体系、评测协议、计量与引用锚点一致。、**归一化(normalization)评分(scoring)固化

II. 术语与依赖

  1. 术语:score_raw、score_norm、aggregation.levels(task|suite)、weights.scheme(uniform|sample_share|expert)、normalize.scheme(zscore|minmax|fixed-anchor)、anchors、tie_break、rank、stability_line、gating。
  2. 依赖:指标与单位(《Benchmarks v1.0》第6章)、评测协议(《ModelCards v1.0》第11章)、监控计量(《Pipeline v1.0》第12章)、单位与量纲(《Core.Metrology v1.0:check_dim》)。
  3. 数学与符号:内联符号用反引号;含除号/复合算符必须加括号;路径量 T_arr 采用
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ),并声明 gamma(ell) 与 d ell;公式/符号/定义禁用中文

III. 字段与结构(规范性)

scoring:

aggregation:

levels: ["task","suite"] # 聚合层级

weights:

scheme: "uniform|sample_share|expert" # 跨任务权重

w_i: null # 当 scheme=expert 时显式给定

metrics:

include: ["F1_macro","ECE","QPS","latency_ms.p99"]

directions: {F1_macro: "max", ECE: "min", QPS:"max", "latency_ms.p99":"min"}

combine: "weighted_mean|geomean|harmonic" # 指标组合方式

normalization:

scheme: "zscore|minmax|fixed-anchor"

params:

zscore: {μ_ref:"suite|task|anchors", σ_ref:"suite|task|anchors"}

minmax: {min_ref:"anchors", max_ref:"anchors"}

fixed_anchor:

anchors: ["baseline.logreg","baseline.rf"] # 基线ID

anchor_scores:

baseline.logreg: {F1_macro:0.72, ECE:0.06}

baseline.rf: {F1_macro:0.75, ECE:0.05}

ranking:

objective: "score_norm|score_raw" # 排名依据

tie_break: ["F1_macro","-latency_ms.p99","model_id"] # 依次比较,前缀“-”表示升序优先低值

buckets: {gold: "top5%", silver: "top20%", bronze: "top40%"} # 分档

stability:

stability_line: "v1.*"

gating:

require_ci: true

min_runs: 3

significance: {method:"bootstrap", B:10000, alpha:0.05, correction:"Holm-Bonferroni"}

audit:

export: ["reports/score_breakdown.json","reports/leaderboard.csv"]


IV. 评分与聚合口径


V. 归一化策略


VI. 排名与并列处理


VII. 显著性与置信区间


VIII. 计量与单位(SI)


IX. 机器可读片段(可直接嵌入)

scoring:

aggregation:

levels: ["task","suite"]

weights: {scheme:"sample_share"}

metrics:

include: ["F1_macro","ECE","latency_ms.p99"]

directions: {F1_macro:"max", ECE:"min", "latency_ms.p99":"min"}

combine: "weighted_mean"

normalization:

scheme: "fixed-anchor"

params:

fixed_anchor:

anchors: ["baseline.logreg","baseline.rf"]

anchor_scores:

baseline.logreg: {F1_macro:0.72, ECE:0.06, "latency_ms.p99":180}

baseline.rf: {F1_macro:0.75, ECE:0.05, "latency_ms.p99":170}

ranking:

objective: "score_norm"

tie_break: ["F1_macro","-latency_ms.p99","model_id"]

buckets: {gold:"top5%", silver:"top20%", bronze:"top40%"}

stability:

stability_line: "v1.*"

gating:

require_ci: true

min_runs: 3

significance: {method:"bootstrap", B:10000, alpha:0.05, correction:"Holm-Bonferroni"}

audit:

export: ["reports/score_breakdown.json","reports/leaderboard.csv"]

metrology: {units:"SI", check_dim:true}


X. Lint 规则(节选,规范性)

lint_rules:

- id: SCORE.METRICS_DIRECTIONS

when: "$.scoring.aggregation.metrics"

assert: "has_key(include) and has_key(directions) and len(include) == len(directions)"

level: error

- id: NORM.SCHEME_ALLOWED

when: "$.scoring.normalization.scheme"

assert: "value in ['zscore','minmax','fixed-anchor']"

level: error

- id: NORM.ANCHORS_REQUIRED

when: "$.scoring.normalization.scheme == 'fixed-anchor'"

assert: "len($.scoring.normalization.params.fixed_anchor.anchors) >= 1"

level: error

- id: RANK.OBJECTIVE_ALLOWED

when: "$.scoring.ranking.objective"

assert: "value in ['score_norm','score_raw']"

level: error

- id: STABILITY.BOOTSTRAP_PARAMS

when: "$.scoring.stability.gating.significance"

assert: "has_keys(method, B, alpha)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


XI. 交叉引用锚点


XII. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/