目录 / 文档-技术白皮书 / 46-EFT.WP.Data.Benchmarks v1.0
I. 章节目的与范围
固定基准**套件→任务→子任务→项目(item)**的分层结构、覆盖矩阵与风险盲区;给出机器可读字段与校核口径;确保与数据卡/模型卡/流水线、计量与引用锚点一致。II. 分层结构与对象关系(规范性)
- 层级:
- suite(套件):总体定义、覆盖与治理;
- task(任务):场景与 I/O 模式、协议与指标;
- subtask(子任务):细分维度(模态/域/语种/资源轨道);
- item(项目):最小评测单元(题目/样本/片段/查询)。
- 关系约束:suite.tasks[*].subtasks[*].items[*] 为有向包含;任一 task 必有 dataset_ref 与 splits;任一 subtask 必声明 track 或 slice;任一 item 必绑定 split ∈ {train,val,test}。
- 覆盖矩阵:coverage_matrix[dimension][bucket] = count/ratio,维度至少包含 modality/locale/domain/difficulty。
III. 字段与结构(规范性)
suite:
id: "eift.benchmarks.core"
title: "EIFT Core Benchmarks"
version: "v1.0.0"
modalities: ["text","image","audio"]
risks: ["leakage","bias","spurious_correlation"]
coverage_matrix:
modality: {"text": 12000, "image": 8000, "audio": 3000}
locale: {"en": 60, "zh": 20, "es": 20} # 单位:%
domain: {"news": 40, "science": 30, "open": 30} # 单位:%
tasks:
- id: "qa.extractive"
io_mode: "offline|stream|interactive"
dataset_ref: "datasets/qa_core@v1.0"
sampling: {strategy:"stratified", strata:[{by:"difficulty", buckets:{"easy":40,"med":40,"hard":20}}]}
splits:
train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}
val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}
test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}
leakage_guard: ["per-object","per-scene"]
protocol:
seed: 1701
repeats: 5
tools_allowed: false
runtime_limits: {timeout_s: 3600}
metrics:
- {name:"F1_macro", unit:"—", higher_is_better:true}
- {name:"ECE", unit:"—", higher_is_better:false}
subtasks:
- id: "qa.extractive.zh"
track: "closed-book"
slice: {locale:["zh"]}
items_ref: "lists/qa_zh_test.index"
- id: "qa.extractive.en.open"
track: "open-book"
slice: {locale:["en"], retrieval:true}
items_ref: "lists/qa_en_open.index"
IV. 覆盖与风险口径
- 覆盖:对 modality/locale/domain/difficulty 等维度,报告计数与占比;占比以 %(— 维度量纲)计。
- 风险:risks[] 至少包含 leakage|bias|spurious_correlation;对每项风险提供检测规则与阈值(如偏移 ψ<=0.2)。
- 冻结一致:所有覆盖报告基于冻结切分 S_val/S_test 统计,严禁使用训练集估计覆盖。
V. 协议与聚合对接
- 协议映射:task.protocol 与模型卡第11章一致(seed/repeats/tools/runtime_limits);
- 聚合映射:套件级汇总由第8章 aggregation 定义的 macro|micro|weighted 规则实施;对跨任务权重 w_i 明确来源(均匀/样本占比/专家权重)。
VI. 计量与单位(SI)
- 指标与资源统一计量:QPS(1/s)、T_inf(ms)、ρ(—)、size_bytes、net_mbps;metrology:{units:"SI", check_dim:true} 为强制。
- 若任务或特征涉及路径量 T_arr,在对象上登记:delta_form、path="gamma(ell)"、measure="d ell",并采用:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),通过 check_dim 校核。
VII. 机器可读片段(可直接嵌入)
suite:
id: "eift.bench.core"
title: "EIFT Core"
version: "v1.0.0"
modalities: ["text","image"]
risks: ["leakage","bias"]
coverage_matrix:
modality: {"text": 9000, "image": 6000}
locale: {"en":70, "zh":30}
tasks:
- id: "cls.multiclass"
io_mode: "offline"
dataset_ref: "datasets/core_cls@v1.0"
sampling: {strategy:"stratified", strata:[{by:"label"}]}
splits:
train: {frozen:true, index:"splits/train.index", sha256:"..."}
val: {frozen:true, index:"splits/val.index", sha256:"..."}
test: {frozen:true, index:"splits/test.index", sha256:"..."}
leakage_guard: ["per-object"]
protocol: {seed:1701, repeats:5, tools_allowed:false, runtime_limits:{timeout_s:3600}}
metrics: [{name:"Acc", unit:"—", higher_is_better:true}, {name:"ECE", unit:"—", higher_is_better:false}]
subtasks:
- {id:"cls.multiclass.en", track:"closed-book", slice:{locale:["en"]}, items_ref:"lists/cls_en.index"}
- {id:"cls.multiclass.zh", track:"closed-book", slice:{locale:["zh"]}, items_ref:"lists/cls_zh.index"}
VIII. Lint 规则(节选,规范性)
lint_rules:
- id: SUITE.ID_FORMAT
when: "$.suite.id"
assert: "matches('^[a-z0-9_.\\-]+$')"
level: error
- id: SUITE.COVERAGE_DIM_REQUIRED
when: "$.suite.coverage_matrix"
assert: "has_keys(modality)"
level: error
- id: TASK.DATASET_AND_SPLITS
when: "$.tasks[*]"
assert: "has_key(dataset_ref) and has_key(splits) and splits.train.frozen and splits.val.frozen and splits.test.frozen"
level: error
- id: TASK.LEAKAGE_GUARD
when: "$.tasks[*].leakage_guard"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
- id: SUBTASK.TRACK_OR_SLICE
when: "$.tasks[*].subtasks[*]"
assert: "has_key(track) or has_key(slice)"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
IX. 交叉引用锚点
- 数据切分与分发:见《EFT.WP.Data.DatasetCards v1.0》第11章。
- 评测协议与指标:见《EFT.WP.Data.ModelCards v1.0》第11章。
- 覆盖/监控计量:见《EFT.WP.Data.Pipeline v1.0》第12章。
- 单位与量纲校核:见《EFT.WP.Core.Metrology v1.0:check_dim》。
X. 本章合规自检
- 套件/任务/子任务层级完整,dataset_ref/splits/leakage_guard 与覆盖矩阵齐备。
- protocol/metrics 与模型卡对齐;聚合规则在第8章定义并被任务引用。
- 冻结切分与泄漏护栏启用;覆盖统计以 val/test 为基准。
- SI 计量生效且 check_dim=true;如涉 T_arr,已登记 delta_form/path/measure 并通过校核。
- export_manifest.references[] 使用“卷名 vX.Y:锚点”,机器可读片段可直接落盘并通过 Lint。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/