目录文档-技术白皮书19-EFT.WP.Methods.SynthData v1.0

附录C 清单模板与样例(synth manifest)


I. 适用范围与目标


II. 键空间与命名规范

  1. 根对象:manifest.synth
  2. 命名层级
    • trace.*:追溯与版本
    • dataset.*:数据集与模式绑定
    • engine.*:生成引擎与可复现性
    • generation.*:条件、控制与时基路径
    • metrics.*:保真/效用/漂移度量与不确定度
    • privacy.*:差分隐私与攻击评测
    • contracts.*:契约评估与处置
    • runtime.*:流式运行与 SLO
    • sign.*:校验与签名

III. 最小键集与计量口径

  1. version:清单版本,语义化。
  2. trace.TraceID:全局追溯 ID;trace.build, trace.commit, trace.timestamp.
  3. dataset.name, dataset.tag, dataset.modality ∈ {tabular,image,text,audio,graph,timeseries}, dataset.SRef, dataset.sref_hash.
  4. dataset.n_real, dataset.n_syn, dataset.split ∈ {train,valid,test,release}。
  5. engine.type ∈ {copula,glm,vae,gan,flow,diffusion,scm}, engine.version, engine.seed, engine.rng, engine.spec_uri, engine.train_data_ref.
  6. generation.condition(c 的声明与字典口径),generation.controls(如 cfg_scale, top_p, temperature),generation.schedule.
  7. generation.timepath.*:
    • ts_timezone, tau_mono_origin, offset, skew, J.
    • c_ref, T_arr_form1 = ( 1 / c_ref ) * ( ∫ n_eff d ell ), T_arr_form2 = ( ∫ ( n_eff / c_ref ) d ell ), delta_form = | T_arr_form1 − T_arr_form2 |.
  8. metrics.*:name, value, u(value), unit(value), window, details(核/嵌入/带宽等)。
  9. privacy.dp.eps_total, privacy.dp.delta_total, privacy.dp.accounting_method, privacy.attack_suite = {membership,linkability,attribute}, privacy.MI_risk.
  10. contracts[]:id, expr, tol, severity ∈ {info,warn,block}, result ∈ {pass,fail}, evidence_ref, action_plan.
  11. runtime.rho, runtime.latency_ms_p99, runtime.drop_rate, runtime.window.
  12. provenance.source_hash, provenance.blob_hash = hash_sha256(blob).
  13. sign.method, sign.signer, sign.signature, sign.timestamp.

IV. 单位、量纲与不确定度

  1. unit(x) 与 dim(x) 对进入方程的字段显式声明;示例:
    • unit(W1)="feature_unit", dim(W1)="-"
    • unit(T_arr)="s", dim(T_arr)="[T]"
  2. 不确定度发布:u(metric) 或区间 {lo, hi};可由 bootstrap 或后验分位生成,记录 metrics.details.source ∈ {bootstrap,posterior,analytic}。
  3. 量纲校核:check_dim( y - f(x) ) = pass 后方可发布。

V. 模板(YAML 最小骨架)

version: "1.0.0"

trace:

TraceID: "trc-xxxxxxxx"

build: "2025.09.01"

commit: "abcdef12"

timestamp: "2025-09-01T12:00:00Z"

dataset:

name: "synth-demo"

tag: "r1"

modality: "tabular"

SRef: "SRef-2025A"

sref_hash: "sha256:..."

n_real: 120000

n_syn: 120000

split: "release"

engine:

type: "diffusion"

version: "2.1.0"

seed: 20250901

rng: "pcg64"

spec_uri: "s3://specs/eng.json"

train_data_ref: "lake://real/train@sha256:..."

generation:

condition:

c_schema: "prompt|policy|conditioning-keys"

c_payload: {}

controls:

cfg_scale: 6.0

temperature: 1.0

schedule:

batches: 240

batch_size: 512

timepath:

ts_timezone: "UTC"

tau_mono_origin: "2025-09-01T00:00:00Z"

offset: 0.001

skew: 1.0e-6

J: 0.0005

c_ref: 2.99792458e8

T_arr_form1: "( 1 / c_ref ) * ( ∫ n_eff d ell )"

T_arr_form2: "( ∫ ( n_eff / c_ref ) d ell )"

delta_form: 1.2e-9

metrics:

- name: "W1"

value: 0.034

u: 0.006

unit: "feature_unit"

window: "all"

details: {distance: "Wasserstein-1", feature_space: "scaled-numeric"}

- name: "MMD_RBF"

value: 0.012

u: 0.004

unit: "-"

window: "all"

details: {kernel: "rbf", bandwidth: 1.0}

- name: "utility_gap_auc"

value: -0.005

u: 0.003

unit: "-"

window: "valid"

details: {model: "xgb", metric: "AUC"}

privacy:

dp:

eps_total: 2.0

delta_total: 1.0e-6

accounting_method: "moments"

attack_suite: ["membership","linkability"]

MI_risk: 0.03

contracts:

- id: "C40-121"

expr: "FID ≤ tol_FID ∧ KID ≤ tol_KID"

tol: {FID: 15.0, KID: 0.02}

severity: "warn"

result: "pass"

evidence_ref: ["metrics:FID","metrics:KID"]

action_plan: "none"

- id: "C40-141"

expr: "eps_total ≤ eps_budget ∧ delta_total ≤ delta_budget"

tol: {eps_budget: 3.0, delta_budget: 1.0e-5}

severity: "block"

result: "pass"

evidence_ref: ["privacy.dp"]

action_plan: "none"

runtime:

rho: 0.73

latency_ms_p99: 420

drop_rate: 0.002

window: "1h"

provenance:

source_hash: "sha256:..."

blob_hash: "sha256:..."

sign:

method: "ed25519"

signer: "release-bot@org"

signature: "base64:..."

timestamp: "2025-09-01T12:01:00Z"


VI. 样例 A(表格:DP 合成,发布版)

  1. 设定
    • modality="tabular",engine.type="copula",privacy.dp=(eps_total=1.5, delta_total=1.0e-6)。
    • 保真阈值:W1 ≤ 0.05,MMD_RBF ≤ 0.02;效用不劣:utility_gap_auc ≥ -0.01。
  2. 关键落盘差异
    • engine.type="copula", engine.version="1.4.2",controls 留空。
    • metrics 发布 n_eff = ( (∑ w)^2 ) / ( ∑ w^2 )(若重加权),在 details 标注。
  3. 片段

engine: {type: "copula", version: "1.4.2", seed: 42, rng: "pcg64", spec_uri: "s3://specs/copula.json"}

metrics:

- name: "W1"; value: 0.028; u: 0.005; unit: "feature_unit"; window: "all"; details: {space: "num+onehot"}

- name: "MMD_RBF"; value: 0.010; u: 0.003; unit: "-"; window: "all"; details: {kernel: "rbf", bandwidth: 0.8}

- name: "utility_gap_auc"; value: -0.006; u: 0.002; unit: "-"; window: "valid"; details: {model: "logit"}

privacy:

dp: {eps_total: 1.5, delta_total: 1.0e-6, accounting_method: "advanced-composition"}

contracts:

- id: "C40-121"; expr: "W1 ≤ 0.05 ∧ MMD ≤ 0.02"; tol: {W1: 0.05, MMD: 0.02}; severity: "warn"; result: "pass"; evidence_ref: ["metrics:W1","metrics:MMD_RBF"]; action_plan: "none"

- id: "C40-141"; expr: "eps_total ≤ 2.0 ∧ delta_total ≤ 1.0e-5"; tol: {eps_budget: 2.0, delta_budget: 1.0e-5}; severity: "block"; result: "pass"; evidence_ref: ["privacy.dp"]; action_plan: "none"


VII. 样例 B(成像:扩散生成 + 到达时一致)

  1. 设定
    • modality="image",engine.type="diffusion";嵌入度量 FID/KID,声明特征抽取网络与层。
    • 路径与到达时两口径并行落盘,delta_form ≤ tol_Tarr。
  2. 片段

dataset: {name: "synth-imaging", tag: "r2", modality: "image", SRef: "SRef-IMG-2025B", sref_hash: "sha256:..."}

engine: {type: "diffusion", version: "2.3.0", seed: 1337, rng: "philox", spec_uri: "s3://specs/diff.json"}

generation:

condition: {c_schema: "text-prompt", c_payload: {"scene": "factory floor", "illum": "D65"}}

controls: {cfg_scale: 7.5, sampler: "ddim", steps: 30}

timepath:

ts_timezone: "UTC"

tau_mono_origin: "2025-09-01T00:00:00Z"

offset: 0.0007

skew: 7.0e-7

J: 0.0003

c_ref: 2.99792458e8

T_arr_form1: "( 1 / c_ref ) * ( ∫ n_eff d ell )"

T_arr_form2: "( ∫ ( n_eff / c_ref ) d ell )"

delta_form: 9.0e-10

metrics:

- name: "FID"; value: 11.8; u: 1.4; unit: "-"; window: "all"; details: {embed_net: "InceptionV3", layer: "pool3"}

- name: "KID"; value: 0.013; u: 0.003; unit: "-"; window: "all"; details: {estimator: "poly-kernel"}

- name: "coverage"; value: 0.92; u: 0.02; unit: "-"; window: "all"; details: {bins: 64}

contracts:

- id: "C40-022"; expr: "delta_form ≤ tol_Tarr"; tol: {tol_Tarr: 1.0e-9}; severity: "block"; result: "pass"; evidence_ref: ["generation.timepath"]; action_plan: "none"

- id: "C40-121"; expr: "FID ≤ 15 ∧ KID ≤ 0.02"; tol: {FID: 15.0, KID: 0.02}; severity: "warn"; result: "pass"; evidence_ref: ["metrics:FID","metrics:KID"]; action_plan: "none"


VIII. 流式运行增量清单(窗口化)

runtime:

window: "5m"

rho: 0.81

latency_ms_p99: 510

drop_rate: 0.004

metrics_windowed:

- name: "W1_cur"; value: 0.041; u: 0.008; unit: "feature_unit"; window: "2025-09-01T12:00Z/12:05Z"

- name: "psi_cur"; value: 0.07; u: 0.01; unit: "-"; window: "2025-09-01T12:00Z/12:05Z"

contracts_windowed:

- id: "C40-184"; expr: "W1_cur ≤ 0.06 ∧ psi_cur ≤ 0.1"; tol: {W1_cur: 0.06, psi_cur: 0.1}; severity: "warn"; result: "pass"; evidence_ref: ["metrics_windowed:W1_cur","metrics_windowed:psi_cur"]; action_plan: "none"


IX. 校验、签名与发布流程要点

  1. 清单冻结前执行
    • assert_synth_contract(ds_syn, rules) -> report,将 contracts.* 写满并产出 result.
    • 追溯与签名:provenance.blob_hash = hash_sha256(blob);sign.signature 由 sign.method 生成。
    • 量纲与公式校核:check_dim( y - f(x) ) = pass;delta_form ≤ tol_Tarr。
  2. 发布准入写法
    pass = ( ∧ contracts.result=pass ) ∧ ( metrics 有效 ) ∧ ( privacy 预算充足 ) ∧ ( sign 完成 )。

X. 交叉引用


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/