目录文档-技术白皮书46-EFT.WP.Data.Benchmarks v1.0

第7章 评测协议(离线/在线/流式/交互)


I. 章节目的与范围

固化**评测协议(protocol)**在离线/在线/流式/交互四种形态下的规范:随机性与可复现、轨道与工具约束、上下文与提示、并发与速率、流式窗口与交互轮次、A/B 与影子流量、日志与度量上报;确保与任务定义、指标体系、数据冻结切分、计量与引用锚点一致。

II. 术语与依赖

  1. 术语:mode(offline/online/stream/interactive)、seed、repeats、temperature、context_length、rounds、canary/shadow、traffic_allocation、caching、tools_allowed、retrieval/open_book、runtime_limits、concurrency、rate_limit。
  2. 依赖:数据冻结切分与分发(《DatasetCards v1.0》第11章)、评测与聚合(《ModelCards v1.0》第11章)、监控与在线窗口(《Pipeline v1.0》第12章)、单位与量纲(《Core.Metrology v1.0:check_dim》)。
  3. 数学与符号:内联符号一律用反引号;含除号/积分/复合算符必须加括号;如涉路径量 T_arr,采用
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ) 或
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ),并声明 gamma(ell) 与 d ell;公式/符号/定义禁用中文

III. 字段与结构(规范性)

protocol:

mode: "offline|online|stream|interactive"

seed: 1701 # 随机性控制

repeats: 5 # 重复评测次数(offline/stream)

temperature: 0.0 # 生成温度(如适用)

max_tokens: 0 # 生成上限(如适用)

context:

length: 4096 # 上下文长度上限

template_ref: "prompts/<id>@vX.Y" # 提示模板(可选)

tools:

allowed: false

retrieval: false

open_book: false

registry_ref: null # 工具/检索接口清单(如允许)

runtime_limits:

timeout_s: 3600

memory_gb: 16

execution:

concurrency: 8

rate_limit_qps: 50

batching: {enabled: true, max_batch: 32}

caching: {enabled: false, policy: "none|warm|full"}

stream: # mode=stream

window_ms: 1000

hop_ms: 250

max_latency_ms: 200

watermark: "event_time|processing_time"

interactive: # mode=interactive

rounds: 3

turn_timeout_s: 30

max_context_turns: 8

online: # mode=online

traffic_allocation: {control: 0.5, treatment: 0.5}

exposure: {shadow: true, canary: 0.05}

guardrails: ["latency_ms.p99<=200","error_rate<=0.01"]

logging:

format: "jsonl"

fields: ["ts","task_id","item_id","run_id","trace_id","input_hash","output_hash","latency_ms","error_code"]

retention: "P30D"

reporting:

metrics: ["F1_macro","ECE","latency_ms.p99","QPS"]

target_ci: {method: "bootstrap", level: 0.95}

see:

- "EFT.WP.Data.ModelCards v1.0:Ch.11"

- "EFT.WP.Core.Metrology v1.0:check_dim"


IV. 协议态口径


V. 轨道与资源约束


VI. 统计与显著性


VII. 机器可读片段(可直接嵌入)

# 离线协议示例

protocol:

mode: "offline"

seed: 1701

repeats: 5

temperature: 0.0

context: {length: 4096, template_ref: "prompts/qa_v1@v1.0"}

tools: {allowed: false, retrieval: false, open_book: false}

runtime_limits: {timeout_s: 3600, memory_gb: 16}

execution: {concurrency: 8, rate_limit_qps: 50, batching:{enabled:true, max_batch:32}}

logging: {format:"jsonl", fields:["ts","task_id","item_id","run_id","latency_ms"], retention:"P30D"}

reporting: {metrics:["F1_macro","ECE"], target_ci:{method:"bootstrap", level:0.95}}

see: ["EFT.WP.Data.ModelCards v1.0:Ch.11","EFT.WP.Core.Metrology v1.0:check_dim"]

# 在线协议示例

protocol:

mode: "online"

seed: 1701

repeats: 1

online:

traffic_allocation: {control: 0.5, treatment: 0.5}

exposure: {shadow: true, canary: 0.05}

guardrails: ["latency_ms.p99<=200","error_rate<=0.01"]

execution: {concurrency: 64, rate_limit_qps: 500, batching:{enabled:false}}

logging: {format:"jsonl", fields:["ts","trace_id","latency_ms","error_code"], retention:"P30D"}

reporting: {metrics:["QPS","latency_ms.p99"], target_ci:{method:"t", level:0.95}}

see: ["EFT.WP.Data.Pipeline v1.0:Ch.12","EFT.WP.Core.Metrology v1.0:check_dim"]


VIII. Lint 规则(节选,规范性)

lint_rules:

- id: PROTOCOL.MODE_ALLOWED

when: "$.protocol.mode"

assert: "value in ['offline','online','stream','interactive']"

level: error

- id: PROTOCOL.SEED_REPEATS

when: "$.protocol"

assert: "has_key(seed) and (mode != 'online' -> has_key(repeats))"

level: error

- id: PROTOCOL.FROZEN_SPLITS_REQUIRED

when: "$.splits"

assert: "splits.train.frozen and splits.val.frozen and splits.test.frozen"

level: error

- id: PROTOCOL.TOOLS_TRACK_CONSISTENCY

when: "$.protocol.tools"

assert: "value.allowed == false or has_key($.tracks)"

level: error

- id: ONLINE.TRAFFIC_SUM

when: "$.protocol.online.traffic_allocation"

assert: "abs(value.control + value.treatment - 1) <= 1e-6"

level: error

- id: STREAM.WINDOW_PARAMS

when: "$.protocol.mode == 'stream'"

assert: "has_keys($.protocol.stream.window_ms, $.protocol.stream.hop_ms, $.protocol.stream.max_latency_ms)"

level: error

- id: INTERACTIVE.ROUNDS_DEFINED

when: "$.protocol.mode == 'interactive'"

assert: "has_keys($.protocol.interactive.rounds, $.protocol.interactive.turn_timeout_s)"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


IX. 交叉引用锚点


X. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/