46-EFT.WP.Data.Benchmarks v1.0 | 第4章任务定义与场景建模

目录／文档-技术白皮书（V5.05）／ 46-EFT.WP.Data.Benchmarks v1.0

第4章任务定义与场景建模

I. 章节目的与范围

场景建模（scenario modeling）**的规范：io_mode、输入先验与约束、评测对象（模型/系统/流程）、轨道（track）与切换规则、资源与工具可用性；确保与数据卡/模型卡/流水线、计量与引用锚点一致。与固化**任务（task）

II. 任务对象与关系（规范性）

对象层级：suite.tasks[*] → subtasks[*] → items_ref。
最小要求：每个 task 必须声明 dataset_ref、io_mode、protocol、metrics[]、splits 与 leakage_guard。
评测对象：evaluatee 可为 model|system|pipeline，需在 assumptions/constraints 中明确接口假设与边界。

III. 字段与结构（规范性）

task:

id: "<task.id>"

title: "<Human-readable Title>"

io_mode: "offline|online|stream|interactive"

evaluatee: "model|system|pipeline"

dataset_ref: "datasets/<name>@vX.Y"

assumptions:

inputs: ["<x schema or modality>"]

outputs: ["<y schema or semantics>"]

priors: ["<domain priors|knowledge>"]

constraints:

resources: {qps:"<=100", latency_ms:{p99:"<=200"}, memory_gb:"<=16"}

tools_allowed: false

retrieval: false

open_book: false

tracks: ["closed-book","open-book?","tools?"]

sampling:

strategy: "random|stratified|time-based|spatial-tiles|systematic"

strata: [{by:"<label|locale|domain|difficulty>", buckets: {"A":100,"B":200}}]

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

leakage_guard: ["per-object","per-timewindow","per-scene"]

protocol:

seed: 1701

repeats: 5

temperature: 0.0

tools_allowed: false

runtime_limits: {timeout_s: 3600}

metrics:

- {name:"Acc|F1_macro|ROC_AUC|PR_AUC|RMSE|mAP|ECE|NLL|WER|BLEU|...", unit:"—|SI", higher_is_better:true}

aggregation:

levels: ["task","suite"]

weights: {task:"uniform|sample_share|expert"}

normalize: {scheme:"zscore|minmax|fixed-anchor", anchors:["<baselineA>","<baselineB>"]}

significance:

method: "bootstrap|permutation"

B: 10000

alpha: 0.05

correction: "Holm-Bonferroni|none"

env:

hardware: {cpu:"<Nc>", mem_gb:"<GB>", gpu:"<N|0>"}

os: "ubuntu-22.04"

containers: ["ghcr.io/eift/runner@sha256:<hex>"]

deps_lock: "env.lock"

see:

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Data.ModelCards v1.0:Ch.11"

IV. 场景建模与轨道规则

场景要素：domain|locale|difficulty|noise|constraints；以 slice:{dimension:[buckets]} 表示子切片。
轨道（track）切换：closed-book|open-book|tools 互斥或可叠加；切换后必须在 assumptions/constraints 与 protocol 中同步工具/检索/上下文长度等配额。
交互/流式：io_mode:"interactive|stream" 需定义轮次 rounds、最长上下文与延迟容忍度。

V. 约束与资源

资源约束：以 SI 计量：QPS(1/s)、latency_ms.{p50,p95,p99}、memory_gb、power_w；
工具与外部知识：tools_allowed/retrieval/open_book 的真值决定是否允许外部 API/检索/知识库；若允许，需固定接口、延迟预算与缓存策略。
合规模块：涉及隐私/驻留/第三方时，在 see[] 与 export_manifest.references[] 登记锚点。

VI. 指标与聚合口径

指标定义：分类/检索/检测/回归/生成/ASR/NLP 等按常见指标列出；校准类指标（ECE/Brier）在概率输出任务中为推荐。
聚合：macro/micro/weighted 必显式；跨任务权重 w_i 取 uniform|sample_share|expert 之一。
归一化：zscore|minmax|fixed-anchor，锚点包含公开基线 ID。

VII. 计量与单位（SI）

强制：metrology:{units:"SI", check_dim:true}；所有指标/资源采用 SI；复合量合成前先做单位归一。
路径量（如 T_arr）：若任务涉及到达时，登记：delta_form、path="gamma(ell)"、measure="d ell"，并采用：
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )；通过 check_dim 校核。

VIII. 机器可读片段（可直接嵌入）

tasks:

- id: "retrieval.doc"

title: "Document Retrieval"

io_mode: "offline"

evaluatee: "system"

dataset_ref: "datasets/retr_core@v1.0"

assumptions: {inputs:["query"], outputs:["doc_ids"], priors:["BM25-compatible semantics"]}

constraints:

resources: {qps:"<=200", latency_ms:{p99:"<=150"}}

tools_allowed: true

retrieval: true

open_book: true

tracks: ["closed-book","open-book"]

sampling: {strategy:"stratified", strata:[{by:"locale", buckets:{"en":70,"zh":30}}]}

splits:

train: {frozen:true, index:"splits/train.index", sha256:"..."}

val: {frozen:true, index:"splits/val.index", sha256:"..."}

test: {frozen:true, index:"splits/test.index", sha256:"..."}

leakage_guard: ["per-object","per-scene"]

protocol: {seed:1701, repeats:5, temperature:0.0, tools_allowed:true, runtime_limits:{timeout_s:3600}}

metrics:

- {name:"mAP", unit:"—", higher_is_better:true}

- {name:"recall@k", unit:"—", higher_is_better:true}

- {name:"latency_ms.p99", unit:"ms", higher_is_better:false}

aggregation: {levels:["task","suite"], weights:{task:"sample_share"}, normalize:{scheme:"fixed-anchor", anchors:["baseline.bm25"]}}

significance: {method:"bootstrap", B:10000, alpha:0.05, correction:"Holm-Bonferroni"}

env: {hardware:{cpu:"16c", mem_gb:64, gpu:0}, os:"ubuntu-22.04", containers:["ghcr.io/eift/runner@sha256:<hex>"], deps_lock:"env.lock"}

IX. Lint 规则（节选，规范性）

lint_rules:

- id: TASK.ID_FORMAT

when: "$.tasks[*].id"

assert: "matches('^[a-z0-9_.\\-]+$')"

level: error

- id: TASK.REQUIRED_KEYS

when: "$.tasks[*]"

assert: "has_keys(io_mode, dataset_ref, protocol, metrics, splits, leakage_guard)"

level: error

- id: TASK.TRACKS_SWITCH

when: "$.tasks[*].tracks"

assert: "value == null or all(track in ['closed-book','open-book','tools'] for track in value)"

level: error

- id: TASK.SPLITS_FROZEN

when: "$.tasks[*].splits"

assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen"

level: error

- id: TASK.METRICS_UNITS_SI

when: "$.tasks[*].metrics[*].unit"

assert: "all_units_in_SI(value) or value == '—'"

level: error

- id: TASK.METROLOGY_REQUIRED

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

X. 交叉引用锚点

评测协议与指标：见《EFT.WP.Data.ModelCards v1.0》第11章。
冻结切分与分发：见《EFT.WP.Data.DatasetCards v1.0》第11章。
性能与资源约束：见《EFT.WP.Data.Pipeline v1.0》第10/12/13章。
单位与量纲校核：见《EFT.WP.Core.Metrology v1.0:check_dim》。

XI. 本章合规自检

task 定义包含 io_mode/dataset_ref/protocol/metrics/splits/leakage_guard 且与套件分层一致。
场景/轨道切换规则清晰，工具/检索/开放度与 constraints/protocol 同步。
覆盖与采样分层明确，冻结切分启用，泄漏护栏为阻断。
指标/聚合/显著性设置完备；跨任务权重来源明确；归一化锚点可追溯。
SI 计量与 check_dim=true 生效；若涉 T_arr，已登记 delta_form/path/measure 并通过校核。
机器可读片段可直接落盘并通过 Lint；export_manifest.references[] 采用“卷名 vX.Y:锚点”。

版权与许可：除另有说明外，《能量丝理论》（含文本、图表、插图、符号与公式）的著作权由作者（屠广林）享有。
许可方式（CC BY 4.0）：在注明作者与来源的前提下，允许复制、转载、节选、改编与再分发。
署名格式（建议）：作者：屠广林｜作品：《能量丝理论》｜来源：energyfilament.org｜许可证：CC BY 4.0
验证召集： 作者独立自费、无雇主无资助；下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地，不限国家。欢迎各国媒体与同行抓住窗口组织验证，并与我们联系。
版本信息： 首次发布：2025-11-11 ｜当前版本：v6.0+5.05