46-EFT.WP.Data.Benchmarks v1.0 | 第18章附录：Benchmark 模板

目录／文档-技术白皮书（V5.05）／ 46-EFT.WP.Data.Benchmarks v1.0

第18章附录：Benchmark 模板

I. 模板范围与口径

。泄漏护栏与冻结切分两套可直接落盘的 YAML/JSON 模板；键名统一 snake_case；跨卷引用采用“卷名 vX.Y:锚点”；单位体系遵循 SI 与 check_dim=true；评测一律使用完整骨架模板与最小模板提供

II. 最小模板（可直接复制）

suite:

id: "eift.bench.core"

title: "EIFT Core Benchmarks"

version: "v1.0"

modalities: ["text"]

tasks:

- id: "cls.binary"

io_mode: "offline"

dataset_ref: "datasets/core_cls@v1.0"

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard: ["per-object"]

protocol:

mode: "offline"

seed: 1701

repeats: 5

metrics:

- {name:"F1_macro", family:"classification", unit:"—", higher_is_better:true, agg:"macro"}

aggregation: {levels:["task"], weights:{scheme:"uniform"}}

significance: {method:"bootstrap", alpha:0.05}

metrology: {units:"SI", check_dim:true}

export_manifest:

version: "v1.0"

artifacts:

- {path:"benchmark.yaml", sha256:"<hex>"}

- {path:"splits/train.index", sha256:"<hex>"}

- {path:"splits/val.index", sha256:"<hex>"}

- {path:"splits/test.index", sha256:"<hex>"}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

III. 完整骨架模板（发布级，含可选扩展）

suite:

id: "<org.project.bench>"

title: "<Human-readable Title>"

version: "v1.0.0"

modalities: ["text","image","audio"]

risks: ["leakage","bias","spurious_correlation"]

coverage_matrix:

modality: {"text": 9000, "image": 6000, "audio": 3000} # 计数或 %

locale: {"en": 60, "zh": 20, "es": 20} # %

domain: {"news": 40, "science": 30, "open": 30} # %

tasks:

- id: "<task.id>"

title: "<Task Title>"

io_mode: "offline|online|stream|interactive"

evaluatee: "model|system|pipeline"

dataset_ref: "datasets/<name>@vX.Y"

sampling:

strategy: "random|stratified|time-based|spatial-tiles|systematic"

strata: [{by:"<label|locale|domain|difficulty>", buckets: {"A":100,"B":200}}]

seed: 1701

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

ratio: {train:0.8, val:0.1, test:0.1}

freeze_indices: true

leakage_guard: ["per-object","per-timewindow","per-scene"]

protocol:

mode: "offline|online|stream|interactive"

seed: 1701

repeats: 5

temperature: 0.0

context: {length: 4096, template_ref: "prompts/<id>@vX.Y"}

tools: {allowed:false, retrieval:false, open_book:false, registry_ref:null}

runtime_limits: {timeout_s:3600, memory_gb:16}

execution: {concurrency:8, rate_limit_qps:50, batching:{enabled:true, max_batch:32}}

stream: {window_ms:1000, hop_ms:250, max_latency_ms:200, watermark:"event_time"}

interactive: {rounds:3, turn_timeout_s:30, max_context_turns:8}

online: {traffic_allocation:{control:0.5,treatment:0.5}, exposure:{shadow:true, canary:0.05}, guardrails:["latency_ms.p99<=200","error_rate<=0.01"]}

logging: {format:"jsonl", fields:["ts","task_id","item_id","run_id","latency_ms"], retention:"P30D"}

reporting: {metrics:["F1_macro","ECE","latency_ms.p99","QPS"], target_ci:{method:"bootstrap", level:0.95}}

metrics:

- {name:"F1_macro", family:"classification", unit:"—", higher_is_better:true, agg:"macro", window:"N/A"}

- {name:"ECE", family:"calibration", unit:"—", higher_is_better:false, agg:"mean", window:"N/A"}

- {name:"latency_ms.p99", family:"perf", unit:"ms", higher_is_better:false, agg:"quant", window:"1m"}

aggregation:

levels: ["task","suite"]

weights: {scheme:"uniform|sample_share|expert", w_i:null}

metrics:

include: ["F1_macro","ECE","latency_ms.p99"]

directions: {F1_macro:"max", ECE:"min", "latency_ms.p99":"min"}

combine: "weighted_mean|geomean|harmonic"

normalization:

scheme: "zscore|minmax|fixed-anchor"

params:

zscore: {μ_ref:"suite|task|anchors", σ_ref:"suite|task|anchors"}

minmax: {min_ref:"anchors", max_ref:"anchors"}

fixed_anchor:

anchors: ["baseline.logreg","baseline.rf"]

anchor_scores:

baseline.logreg: {F1_macro:0.72, ECE:0.06, "latency_ms.p99":180}

baseline.rf: {F1_macro:0.75, ECE:0.05, "latency_ms.p99":170}

significance:

method: "bootstrap|permutation|t|bayes"

B: 10000

alpha: 0.05

correction: "Holm-Bonferroni|BH|none"

robustness:

shift_tests:

- {name:"snr_drop", severity:[3,6,9], unit:"dB", policy:"additive-noise"}

thresholds: {drop_rel_max:0.10, acc_robust_min:0.80}

fairness_ethics:

slices: [{axis:"locale", buckets:["en","zh","es"]}]

gap_metric: "abs_diff|ratio|stat_parity|eq_opp"

thresholds: {fairness_warn:0.03, fairness_block:0.05}

env:

hardware: {cpu:"16c@3.0GHz", mem_gb:64, gpu:"0|A100-40GB:x1", storage:{type:"nvme", iops:">=50k", size_gb:512}, network:{nic_gbps:10}}

os: "ubuntu-22.04"

kernel: "linux-6.8"

containers: ["ghcr.io/eift/runner@sha256:<hex>"]

deps_lock: "env.lock"

metrology: {units:"SI", check_dim:true}

export_manifest:

version: "v1.0"

artifacts:

- {path:"benchmark.yaml", sha256:"<hex>"}

- {path:"splits/train.index", sha256:"<hex>"}

- {path:"splits/val.index", sha256:"<hex>"}

- {path:"splits/test.index", sha256:"<hex>"}

- {path:"reports/summary.json", sha256:"<hex>"}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.ModelCards v1.0:Ch.11"

- "EFT.WP.Data.Benchmarks v1.0:Ch.6"

IV. 字段占位符与最小正则（速查）

suite.id: ^[a-z0-9_.\\-]+$；suite.version: ^v\\d+\\.\\d+(\\.\\d+)?$；
dataset_ref: ^datasets/[a-z0-9_\\-]+@v\\d+\\.\\d+$；
export_manifest.references[*]: ^[^:]+ v\\d+\\.\\d+:[A-Z].+$；
切分比例和：1±1e-6；freeze_indices:true；leakage_guard 至少包含 per-object|per-timewindow|per-scene之一；
计量：metrology.units="SI" 与 check_dim=true。

V. 导出清单模板（规范性）

export_manifest:

version: "v1.0"

artifacts:

- {path:"benchmark.yaml", sha256:"<hex>"}

- {path:"splits/train.index", sha256:"<hex>"}

- {path:"splits/val.index", sha256:"<hex>"}

- {path:"splits/test.index", sha256:"<hex>"}

- {path:"reports/summary.json", sha256:"<hex>"}

- {path:"reports/leaderboard.csv", sha256:"<hex>"}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.ModelCards v1.0:Ch.11"

- "EFT.WP.Data.Benchmarks v1.0:Ch.8"

- "EFT.WP.Data.Benchmarks v1.0:Ch.9"

VI. 发布前阻断自检（清单）

结构/必填：suite/tasks/metrology/export_manifest 齐备，Schema 校验通过。
引用/版本：references[] 使用“卷名 vX.Y:锚点”，无短码/缺版本。
切分/泄漏：splits 冻结、比例和为 1、索引具 sha256；泄漏护栏生效。
协议/指标：protocol 与任务约束一致；指标含 family/unit/higher_is_better/agg/window。
评分/统计：aggregation/normalization 与显著性配置完整；若使用 fixed-anchor，锚点基线与哈希固定。
环境/计量：容器 @digest 与 env.lock 锁定；metrology.units="SI" & check_dim=true 生效。
工件可验：导出清单全部文件具 sha256，可复现。

VII. 机器可读空白模板（无注释，CI 友好）

suite: {id:"", title:"", version:"v1.0", modalities:[]}

tasks: []

metrology: {units:"SI", check_dim:true}

export_manifest: {version:"v1.0", artifacts:[], references:["EFT.WP.Core.DataSpec v1.0:EXPORT","EFT.WP.Core.Metrology v1.0:check_dim"]}

版权与许可：除另有说明外，《能量丝理论》（含文本、图表、插图、符号与公式）的著作权由作者（屠广林）享有。
许可方式（CC BY 4.0）：在注明作者与来源的前提下，允许复制、转载、节选、改编与再分发。
署名格式（建议）：作者：屠广林｜作品：《能量丝理论》｜来源：energyfilament.org｜许可证：CC BY 4.0
验证召集： 作者独立自费、无雇主无资助；下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地，不限国家。欢迎各国媒体与同行抓住窗口组织验证，并与我们联系。
版本信息： 首次发布：2025-11-11 ｜当前版本：v6.0+5.05