46-EFT.WP.Data.Benchmarks v1.0 | 第6章指标体系与单位

目录／文档-技术白皮书（V5.05）／ 46-EFT.WP.Data.Benchmarks v1.0

第6章指标体系与单位

I. 章节目的与范围

单位（units）**的规范：分类/回归/排序/检索/检测/生成/多模态/ASR 等任务的指标定义、聚合与窗口、阈值与门槛、校准与不确定度关联、性能与资源计量；确保与数据卡/模型卡/流水线、计量章与引用锚点一致。与固化本卷**指标体系（metrics system）

II. 术语与依赖

术语：higher_is_better、agg（macro|micro|weighted|quant|max|min|mean|sum）、window、thresholds、target_ci、calibration（ECE|Brier）、perf（QPS|T_inf|ρ|net_mbps|size_bytes|power_w）。
依赖：计量与量纲校核（《Core.Metrology v1.0:check_dim》）；评测协议与聚合（《ModelCards v1.0》第11章）；监控计量（《Pipeline v1.0》第12章）。
数学与符号：内联符号一律用反引号；含除号/积分/复合算符必须加括号；路径量 T_arr 采用
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )，并声明 gamma(ell) 与 d ell；公式/符号/定义禁用中文。

III. 指标字段与结构（规范性）

metrics:

- name: "<metric_name>"

unit: "—|ms|1/s|dB|W|bytes|%" # SI 或无量纲（—）

higher_is_better: true|false

window: "N/A|1m|5m" # 仅对流式/在线场景

thresholds:

warn: "<expr>" # 例：p99<=200

block: "<expr>" # 例：ECE<=0.05

weighting:

scheme: "uniform|sample_share|expert"

w_i: null # 显式给定时填写

target_ci:

method: "bootstrap|t|bayes"

level: 0.95

see:

- "EFT.WP.Core.Metrology v1.0:check_dim"

IV. 常见指标定义与口径

分类：Acc、F1_macro/F1_micro、ROC_AUC/PR_AUC；宏/微/加权平均口径必须显式。
回归：RMSE、MAE、MAPE、R^2；单位随目标量，合成前先单位归一。
排序/检索：mAP、mAR、MRR、recall@k/precision@k；声明候选池与去重策略。
检测：mAP@IoU（如 mAP@0.50:0.95）、AR@k；IoU 阈列与匹配准则固定。
NLP/生成/ASR：BLEU/ROUGE/chrF、WER/CER、BERTScore；标注规范与正则化前处理需在协议中固定。
校准：ECE、Brier、calibration_curve；概率输出任务推荐同时报告 ECE 与 Brier。
性能与资源（perf）：QPS(1/s)、latency_ms.{p50,p95,p99}、ρ(—)、net_mbps、size_bytes、power_w；与流水线监控口径一致。
不确定度：如报告区间，给出方法（bootstrap|t|bayes）与 level；必要时与模型卡第12章合成规则对齐。

V. 任务家族到指标映射（规范性）

families:

classification: ["Acc","F1_macro","F1_micro","ROC_AUC","PR_AUC","ECE","Brier"]

regression: ["RMSE","MAE","MAPE","R2"]

ranking: ["NDCG@k","MRR","precision@k","recall@k"]

retrieval: ["mAP","mAR","MRR","recall@k","latency_ms.p99","QPS"]

detection: ["mAP@0.50:0.95","AR@k"]

nlp: ["BLEU","ROUGE-L","chrF","BERTScore"]

asr: ["WER","CER","latency_ms.p95"]

generation: ["BLEU","ROUGE-L","NLL","ECE"]

perf: ["QPS","latency_ms.p50","latency_ms.p95","latency_ms.p99","ρ","net_mbps","size_bytes","power_w"]

VI. 聚合、加权与归一化

聚合：macro = 类别均权；micro = 全体样本；weighted = 按样本占比或指定 w_i。
跨任务加权：weights.scheme ∈ {uniform|sample_share|expert}；expert 需在导出清单记录来源。
归一化：zscore|minmax|fixed-anchor；fixed-anchor 须提供锚点基线 ID（如 baseline.logreg）。

VII. 表达与阈值

阈值表达：使用布尔表达式与分位标记（如 latency_ms.p99<=200、ECE<=0.05）；block 为发布阻断，warn 为预警。
窗口与在线指标：window 必为 1m|5m|15m|N/A；离线任务设为 N/A。

VIII. 计量与单位（SI）

强制：metrology:{units:"SI", check_dim:true}；指标单位以 SI 或无量纲 — 表示；复合量合成前先做单位归一。
路径量：若指标依赖 T_arr，登记：delta_form、path="gamma(ell)"、measure="d ell"；采用
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )，并通过 check_dim 校核。

IX. 机器可读片段（可直接嵌入）

metrics:

- name: "F1_macro"

family: "classification"

unit: "—"

higher_is_better: true

agg: "macro"

window: "N/A"

thresholds: {warn: "F1_macro>=0.75", block: "F1_macro>=0.80"}

target_ci: {method:"bootstrap", level:0.95}

- name: "ECE"

family: "calibration"

unit: "—"

higher_is_better: false

agg: "mean"

window: "N/A"

thresholds: {block: "ECE<=0.05"}

- name: "latency_ms.p99"

family: "perf"

unit: "ms"

higher_is_better: false

agg: "quant"

window: "1m"

thresholds: {warn: "latency_ms.p99<=200", block: "latency_ms.p99<=150"}

- name: "QPS"

family: "perf"

unit: "1/s"

higher_is_better: true

agg: "sum"

window: "1m"

X. Lint 规则（节选，规范性）

lint_rules:

- id: METRIC.NAME_FORMAT

when: "$.metrics[*].name"

assert: "matches('^[A-Za-z0-9_.@]+$')"

level: error

- id: METRIC.FAMILY_ALLOWED

when: "$.metrics[*].family"

assert: "value in ['classification','regression','ranking','retrieval','detection','nlp','asr','generation','multimodal','calibration','perf']"

level: error

- id: METRIC.UNIT_SI_OR_DIMLESS

when: "$.metrics[*].unit"

assert: "all_units_in_SI(value) or value in ['—','%']"

level: error

- id: METRIC.AGG_ALLOWED

when: "$.metrics[*].agg"

assert: "value in ['macro','micro','weighted','mean','quant','max','min','sum']"

level: error

- id: METRIC.WINDOW_FORMAT

when: "$.metrics[*].window"

assert: "value in ['N/A','1m','5m','15m']"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

XI. 交叉引用锚点

评测协议与聚合：见《EFT.WP.Data.ModelCards v1.0》第11章。
监控与在线窗口：见《EFT.WP.Data.Pipeline v1.0》第12章。
单位与量纲校核：见《EFT.WP.Core.Metrology v1.0:check_dim》。

XII. 本章合规自检

指标定义含 family/unit/higher_is_better/agg/window/thresholds/target_ci 且与任务家族映射一致。
聚合/加权/归一化策略显式，跨任务权重来源可追溯。
校准与性能指标在适用任务中同时报告；显著性区间方法与信度明确。
SI 计量与 check_dim=true 生效；若涉 T_arr，已登记 delta_form/path/measure 并通过校核。
机器可读片段可直接落盘并通过 Lint；export_manifest.references[] 采用“卷名 vX.Y:锚点”。

版权与许可：除另有说明外，《能量丝理论》（含文本、图表、插图、符号与公式）的著作权由作者（屠广林）享有。
许可方式（CC BY 4.0）：在注明作者与来源的前提下，允许复制、转载、节选、改编与再分发。
署名格式（建议）：作者：屠广林｜作品：《能量丝理论》｜来源：energyfilament.org｜许可证：CC BY 4.0
验证召集： 作者独立自费、无雇主无资助；下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地，不限国家。欢迎各国媒体与同行抓住窗口组织验证，并与我们联系。
版本信息： 首次发布：2025-11-11 ｜当前版本：v6.0+5.05