目录 / 文档-技术白皮书 / 46-EFT.WP.Data.Benchmarks v1.0
I. 章节目的与范围
场景建模(scenario modeling)**的规范:io_mode、输入先验与约束、评测对象(模型/系统/流程)、轨道(track)与切换规则、资源与工具可用性;确保与数据卡/模型卡/流水线、计量与引用锚点一致。与固化**任务(task)II. 任务对象与关系(规范性)
- 对象层级:suite.tasks[*] → subtasks[*] → items_ref。
- 最小要求:每个 task 必须声明 dataset_ref、io_mode、protocol、metrics[]、splits 与 leakage_guard。
- 评测对象:evaluatee 可为 model|system|pipeline,需在 assumptions/constraints 中明确接口假设与边界。
III. 字段与结构(规范性)
task:
id: "<task.id>"
title: "<Human-readable Title>"
io_mode: "offline|online|stream|interactive"
evaluatee: "model|system|pipeline"
dataset_ref: "datasets/<name>@vX.Y"
assumptions:
inputs: ["<x schema or modality>"]
outputs: ["<y schema or semantics>"]
priors: ["<domain priors|knowledge>"]
constraints:
resources: {qps:"<=100", latency_ms:{p99:"<=200"}, memory_gb:"<=16"}
tools_allowed: false
retrieval: false
open_book: false
tracks: ["closed-book","open-book?","tools?"]
sampling:
strategy: "random|stratified|time-based|spatial-tiles|systematic"
strata: [{by:"<label|locale|domain|difficulty>", buckets: {"A":100,"B":200}}]
splits:
train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}
val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}
test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}
leakage_guard: ["per-object","per-timewindow","per-scene"]
protocol:
seed: 1701
repeats: 5
temperature: 0.0
tools_allowed: false
runtime_limits: {timeout_s: 3600}
metrics:
- {name:"Acc|F1_macro|ROC_AUC|PR_AUC|RMSE|mAP|ECE|NLL|WER|BLEU|...", unit:"—|SI", higher_is_better:true}
aggregation:
levels: ["task","suite"]
weights: {task:"uniform|sample_share|expert"}
normalize: {scheme:"zscore|minmax|fixed-anchor", anchors:["<baselineA>","<baselineB>"]}
significance:
method: "bootstrap|permutation"
B: 10000
alpha: 0.05
correction: "Holm-Bonferroni|none"
env:
hardware: {cpu:"<Nc>", mem_gb:"<GB>", gpu:"<N|0>"}
os: "ubuntu-22.04"
containers: ["ghcr.io/eift/runner@sha256:<hex>"]
deps_lock: "env.lock"
see:
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.11"
- "EFT.WP.Data.ModelCards v1.0:Ch.11"
IV. 场景建模与轨道规则
- 场景要素:domain|locale|difficulty|noise|constraints;以 slice:{dimension:[buckets]} 表示子切片。
- 轨道(track)切换:closed-book|open-book|tools 互斥或可叠加;切换后必须在 assumptions/constraints 与 protocol 中同步工具/检索/上下文长度等配额。
- 交互/流式:io_mode:"interactive|stream" 需定义轮次 rounds、最长上下文与延迟容忍度。
V. 约束与资源
- 资源约束:以 SI 计量:QPS(1/s)、latency_ms.{p50,p95,p99}、memory_gb、power_w;
- 工具与外部知识:tools_allowed/retrieval/open_book 的真值决定是否允许外部 API/检索/知识库;若允许,需固定接口、延迟预算与缓存策略。
- 合规模块:涉及隐私/驻留/第三方时,在 see[] 与 export_manifest.references[] 登记锚点。
VI. 指标与聚合口径
- 指标定义:分类/检索/检测/回归/生成/ASR/NLP 等按常见指标列出;校准类指标(ECE/Brier)在概率输出任务中为推荐。
- 聚合:macro/micro/weighted 必显式;跨任务权重 w_i 取 uniform|sample_share|expert 之一。
- 归一化:zscore|minmax|fixed-anchor,锚点包含公开基线 ID。
VII. 计量与单位(SI)
- 强制:metrology:{units:"SI", check_dim:true};所有指标/资源采用 SI;复合量合成前先做单位归一。
- 路径量(如 T_arr):若任务涉及到达时,登记:delta_form、path="gamma(ell)"、measure="d ell",并采用:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell );通过 check_dim 校核。
VIII. 机器可读片段(可直接嵌入)
tasks:
- id: "retrieval.doc"
title: "Document Retrieval"
io_mode: "offline"
evaluatee: "system"
dataset_ref: "datasets/retr_core@v1.0"
assumptions: {inputs:["query"], outputs:["doc_ids"], priors:["BM25-compatible semantics"]}
constraints:
resources: {qps:"<=200", latency_ms:{p99:"<=150"}}
tools_allowed: true
retrieval: true
open_book: true
tracks: ["closed-book","open-book"]
sampling: {strategy:"stratified", strata:[{by:"locale", buckets:{"en":70,"zh":30}}]}
splits:
train: {frozen:true, index:"splits/train.index", sha256:"..."}
val: {frozen:true, index:"splits/val.index", sha256:"..."}
test: {frozen:true, index:"splits/test.index", sha256:"..."}
leakage_guard: ["per-object","per-scene"]
protocol: {seed:1701, repeats:5, temperature:0.0, tools_allowed:true, runtime_limits:{timeout_s:3600}}
metrics:
- {name:"mAP", unit:"—", higher_is_better:true}
- {name:"recall@k", unit:"—", higher_is_better:true}
- {name:"latency_ms.p99", unit:"ms", higher_is_better:false}
aggregation: {levels:["task","suite"], weights:{task:"sample_share"}, normalize:{scheme:"fixed-anchor", anchors:["baseline.bm25"]}}
significance: {method:"bootstrap", B:10000, alpha:0.05, correction:"Holm-Bonferroni"}
env: {hardware:{cpu:"16c", mem_gb:64, gpu:0}, os:"ubuntu-22.04", containers:["ghcr.io/eift/runner@sha256:<hex>"], deps_lock:"env.lock"}
IX. Lint 规则(节选,规范性)
lint_rules:
- id: TASK.ID_FORMAT
when: "$.tasks[*].id"
assert: "matches('^[a-z0-9_.\\-]+$')"
level: error
- id: TASK.REQUIRED_KEYS
when: "$.tasks[*]"
assert: "has_keys(io_mode, dataset_ref, protocol, metrics, splits, leakage_guard)"
level: error
- id: TASK.TRACKS_SWITCH
when: "$.tasks[*].tracks"
assert: "value == null or all(track in ['closed-book','open-book','tools'] for track in value)"
level: error
- id: TASK.SPLITS_FROZEN
when: "$.tasks[*].splits"
assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen"
level: error
- id: TASK.METRICS_UNITS_SI
when: "$.tasks[*].metrics[*].unit"
assert: "all_units_in_SI(value) or value == '—'"
level: error
- id: TASK.METROLOGY_REQUIRED
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
X. 交叉引用锚点
- 评测协议与指标:见《EFT.WP.Data.ModelCards v1.0》第11章。
- 冻结切分与分发:见《EFT.WP.Data.DatasetCards v1.0》第11章。
- 性能与资源约束:见《EFT.WP.Data.Pipeline v1.0》第10/12/13章。
- 单位与量纲校核:见《EFT.WP.Core.Metrology v1.0:check_dim》。
XI. 本章合规自检
- task 定义包含 io_mode/dataset_ref/protocol/metrics/splits/leakage_guard 且与套件分层一致。
- 场景/轨道切换规则清晰,工具/检索/开放度与 constraints/protocol 同步。
- 覆盖与采样分层明确,冻结切分启用,泄漏护栏为阻断。
- 指标/聚合/显著性设置完备;跨任务权重来源明确;归一化锚点可追溯。
- SI 计量与 check_dim=true 生效;若涉 T_arr,已登记 delta_form/path/measure 并通过校核。
- 机器可读片段可直接落盘并通过 Lint;export_manifest.references[] 采用“卷名 vX.Y:锚点”。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/