目录 / 文档-技术白皮书 / 19-EFT.WP.Methods.SynthData v1.0
I. 适用范围与目标
- 给出 manifest.synth 的最小键集、计量与单位、契约与证据映射,以及可直接落盘的 YAML/JSON 模板与两类样例(表格/成像)。
- 交叉引用:清洗清单与签名见《Methods.Cleaning v1.0》第10章;时基与到达时见第6/12章;统计指标与区间见《Methods.CrossStats v1.0》附录C/D/E;几何/物理一致见《Methods.Imaging v1.0》第9/14章。
II. 键空间与命名规范
- 根对象:manifest.synth
- 命名层级
- trace.*:追溯与版本
- dataset.*:数据集与模式绑定
- engine.*:生成引擎与可复现性
- generation.*:条件、控制与时基路径
- metrics.*:保真/效用/漂移度量与不确定度
- privacy.*:差分隐私与攻击评测
- contracts.*:契约评估与处置
- runtime.*:流式运行与 SLO
- sign.*:校验与签名
III. 最小键集与计量口径
- version:清单版本,语义化。
- trace.TraceID:全局追溯 ID;trace.build, trace.commit, trace.timestamp.
- dataset.name, dataset.tag, dataset.modality ∈ {tabular,image,text,audio,graph,timeseries}, dataset.SRef, dataset.sref_hash.
- dataset.n_real, dataset.n_syn, dataset.split ∈ {train,valid,test,release}。
- engine.type ∈ {copula,glm,vae,gan,flow,diffusion,scm}, engine.version, engine.seed, engine.rng, engine.spec_uri, engine.train_data_ref.
- generation.condition(c 的声明与字典口径),generation.controls(如 cfg_scale, top_p, temperature),generation.schedule.
- generation.timepath.*:
- ts_timezone, tau_mono_origin, offset, skew, J.
- c_ref, T_arr_form1 = ( 1 / c_ref ) * ( ∫ n_eff d ell ), T_arr_form2 = ( ∫ ( n_eff / c_ref ) d ell ), delta_form = | T_arr_form1 − T_arr_form2 |.
- metrics.*:name, value, u(value), unit(value), window, details(核/嵌入/带宽等)。
- privacy.dp.eps_total, privacy.dp.delta_total, privacy.dp.accounting_method, privacy.attack_suite = {membership,linkability,attribute}, privacy.MI_risk.
- contracts[]:id, expr, tol, severity ∈ {info,warn,block}, result ∈ {pass,fail}, evidence_ref, action_plan.
- runtime.rho, runtime.latency_ms_p99, runtime.drop_rate, runtime.window.
- provenance.source_hash, provenance.blob_hash = hash_sha256(blob).
- sign.method, sign.signer, sign.signature, sign.timestamp.
IV. 单位、量纲与不确定度
- unit(x) 与 dim(x) 对进入方程的字段显式声明;示例:
- unit(W1)="feature_unit", dim(W1)="-"
- unit(T_arr)="s", dim(T_arr)="[T]"
- 不确定度发布:u(metric) 或区间 {lo, hi};可由 bootstrap 或后验分位生成,记录 metrics.details.source ∈ {bootstrap,posterior,analytic}。
- 量纲校核:check_dim( y - f(x) ) = pass 后方可发布。
V. 模板(YAML 最小骨架)
version: "1.0.0"
trace:
TraceID: "trc-xxxxxxxx"
build: "2025.09.01"
commit: "abcdef12"
timestamp: "2025-09-01T12:00:00Z"
dataset:
name: "synth-demo"
tag: "r1"
modality: "tabular"
SRef: "SRef-2025A"
sref_hash: "sha256:..."
n_real: 120000
n_syn: 120000
split: "release"
engine:
type: "diffusion"
version: "2.1.0"
seed: 20250901
rng: "pcg64"
spec_uri: "s3://specs/eng.json"
train_data_ref: "lake://real/train@sha256:..."
generation:
condition:
c_schema: "prompt|policy|conditioning-keys"
c_payload: {}
controls:
cfg_scale: 6.0
temperature: 1.0
schedule:
batches: 240
batch_size: 512
timepath:
ts_timezone: "UTC"
tau_mono_origin: "2025-09-01T00:00:00Z"
offset: 0.001
skew: 1.0e-6
J: 0.0005
c_ref: 2.99792458e8
T_arr_form1: "( 1 / c_ref ) * ( ∫ n_eff d ell )"
T_arr_form2: "( ∫ ( n_eff / c_ref ) d ell )"
delta_form: 1.2e-9
metrics:
- name: "W1"
value: 0.034
u: 0.006
unit: "feature_unit"
window: "all"
details: {distance: "Wasserstein-1", feature_space: "scaled-numeric"}
- name: "MMD_RBF"
value: 0.012
u: 0.004
unit: "-"
window: "all"
details: {kernel: "rbf", bandwidth: 1.0}
- name: "utility_gap_auc"
value: -0.005
u: 0.003
unit: "-"
window: "valid"
details: {model: "xgb", metric: "AUC"}
privacy:
dp:
eps_total: 2.0
delta_total: 1.0e-6
accounting_method: "moments"
attack_suite: ["membership","linkability"]
MI_risk: 0.03
contracts:
- id: "C40-121"
expr: "FID ≤ tol_FID ∧ KID ≤ tol_KID"
tol: {FID: 15.0, KID: 0.02}
severity: "warn"
result: "pass"
evidence_ref: ["metrics:FID","metrics:KID"]
action_plan: "none"
- id: "C40-141"
expr: "eps_total ≤ eps_budget ∧ delta_total ≤ delta_budget"
tol: {eps_budget: 3.0, delta_budget: 1.0e-5}
severity: "block"
result: "pass"
evidence_ref: ["privacy.dp"]
action_plan: "none"
runtime:
rho: 0.73
latency_ms_p99: 420
drop_rate: 0.002
window: "1h"
provenance:
source_hash: "sha256:..."
blob_hash: "sha256:..."
sign:
method: "ed25519"
signer: "release-bot@org"
signature: "base64:..."
timestamp: "2025-09-01T12:01:00Z"
VI. 样例 A(表格:DP 合成,发布版)
- 设定
- modality="tabular",engine.type="copula",privacy.dp=(eps_total=1.5, delta_total=1.0e-6)。
- 保真阈值:W1 ≤ 0.05,MMD_RBF ≤ 0.02;效用不劣:utility_gap_auc ≥ -0.01。
- 关键落盘差异
- engine.type="copula", engine.version="1.4.2",controls 留空。
- metrics 发布 n_eff = ( (∑ w)^2 ) / ( ∑ w^2 )(若重加权),在 details 标注。
- 片段
engine: {type: "copula", version: "1.4.2", seed: 42, rng: "pcg64", spec_uri: "s3://specs/copula.json"}
metrics:
- name: "W1"; value: 0.028; u: 0.005; unit: "feature_unit"; window: "all"; details: {space: "num+onehot"}
- name: "MMD_RBF"; value: 0.010; u: 0.003; unit: "-"; window: "all"; details: {kernel: "rbf", bandwidth: 0.8}
- name: "utility_gap_auc"; value: -0.006; u: 0.002; unit: "-"; window: "valid"; details: {model: "logit"}
privacy:
dp: {eps_total: 1.5, delta_total: 1.0e-6, accounting_method: "advanced-composition"}
contracts:
- id: "C40-121"; expr: "W1 ≤ 0.05 ∧ MMD ≤ 0.02"; tol: {W1: 0.05, MMD: 0.02}; severity: "warn"; result: "pass"; evidence_ref: ["metrics:W1","metrics:MMD_RBF"]; action_plan: "none"
- id: "C40-141"; expr: "eps_total ≤ 2.0 ∧ delta_total ≤ 1.0e-5"; tol: {eps_budget: 2.0, delta_budget: 1.0e-5}; severity: "block"; result: "pass"; evidence_ref: ["privacy.dp"]; action_plan: "none"
VII. 样例 B(成像:扩散生成 + 到达时一致)
- 设定
- modality="image",engine.type="diffusion";嵌入度量 FID/KID,声明特征抽取网络与层。
- 路径与到达时两口径并行落盘,delta_form ≤ tol_Tarr。
- 片段
dataset: {name: "synth-imaging", tag: "r2", modality: "image", SRef: "SRef-IMG-2025B", sref_hash: "sha256:..."}
engine: {type: "diffusion", version: "2.3.0", seed: 1337, rng: "philox", spec_uri: "s3://specs/diff.json"}
generation:
condition: {c_schema: "text-prompt", c_payload: {"scene": "factory floor", "illum": "D65"}}
controls: {cfg_scale: 7.5, sampler: "ddim", steps: 30}
timepath:
ts_timezone: "UTC"
tau_mono_origin: "2025-09-01T00:00:00Z"
offset: 0.0007
skew: 7.0e-7
J: 0.0003
c_ref: 2.99792458e8
T_arr_form1: "( 1 / c_ref ) * ( ∫ n_eff d ell )"
T_arr_form2: "( ∫ ( n_eff / c_ref ) d ell )"
delta_form: 9.0e-10
metrics:
- name: "FID"; value: 11.8; u: 1.4; unit: "-"; window: "all"; details: {embed_net: "InceptionV3", layer: "pool3"}
- name: "KID"; value: 0.013; u: 0.003; unit: "-"; window: "all"; details: {estimator: "poly-kernel"}
- name: "coverage"; value: 0.92; u: 0.02; unit: "-"; window: "all"; details: {bins: 64}
contracts:
- id: "C40-022"; expr: "delta_form ≤ tol_Tarr"; tol: {tol_Tarr: 1.0e-9}; severity: "block"; result: "pass"; evidence_ref: ["generation.timepath"]; action_plan: "none"
- id: "C40-121"; expr: "FID ≤ 15 ∧ KID ≤ 0.02"; tol: {FID: 15.0, KID: 0.02}; severity: "warn"; result: "pass"; evidence_ref: ["metrics:FID","metrics:KID"]; action_plan: "none"
VIII. 流式运行增量清单(窗口化)
- 用于在线生成的窗口化摘要,按 runtime.window 写入增量块。
- 片段
runtime:
window: "5m"
rho: 0.81
latency_ms_p99: 510
drop_rate: 0.004
metrics_windowed:
- name: "W1_cur"; value: 0.041; u: 0.008; unit: "feature_unit"; window: "2025-09-01T12:00Z/12:05Z"
- name: "psi_cur"; value: 0.07; u: 0.01; unit: "-"; window: "2025-09-01T12:00Z/12:05Z"
contracts_windowed:
- id: "C40-184"; expr: "W1_cur ≤ 0.06 ∧ psi_cur ≤ 0.1"; tol: {W1_cur: 0.06, psi_cur: 0.1}; severity: "warn"; result: "pass"; evidence_ref: ["metrics_windowed:W1_cur","metrics_windowed:psi_cur"]; action_plan: "none"
IX. 校验、签名与发布流程要点
- 清单冻结前执行
- assert_synth_contract(ds_syn, rules) -> report,将 contracts.* 写满并产出 result.
- 追溯与签名:provenance.blob_hash = hash_sha256(blob);sign.signature 由 sign.method 生成。
- 量纲与公式校核:check_dim( y - f(x) ) = pass;delta_form ≤ tol_Tarr。
- 发布准入写法
pass = ( ∧ contracts.result=pass ) ∧ ( metrics 有效 ) ∧ ( privacy 预算充足 ) ∧ ( sign 完成 )。
X. 交叉引用
- 清洗与契约落盘:见《Methods.Cleaning v1.0》第10章与附录C。
- 成像度量与几何一致:见《Methods.Imaging v1.0》第9/14章与附录D/E。
- 统计阈值、区间与漂移:见《Methods.CrossStats v1.0》附录C/D/E。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/