目录文档-技术白皮书(V5.05)44-EFT.WP.Data.ModelCards v1.0

第9章 预处理与特征工程


I. 章节目的与范围

、参数锁定与环境可复现要求,覆盖训练/推理一致性、数据清洗与标准化、特征构建与选择、泄漏防控与计量校核;确保与《任务与 I/O》《训练数据与采样绑定》《评测协议与指标》及计量章一致。规范性定义固化模型卡中的 preprocess 与特征工程的

II. 术语与依赖


III. 字段与结构(规范性)

preprocess:

pipeline_id: "<string>" # 语义化流水线标识

steps: # 有序、幂等步骤

- name: "<clean|filter|normalize|standardize|resample|impute|encode|tokenize|stft|specaugment|feature_map|pca|custom>"

enabled: true

idempotent: true

params: { ... } # 显式列出,含单位/量纲口径

inputs: ["<field>"]

outputs: ["<field>"]

notes: "<non-normative>"

feature_space: # 特征空间声明(训练/推理一致)

type: "<dense|sparse|sequence|image|audio_spec|tabular|embedding>"

shape: "<(…)>"

dtype: "<float32|int32|...>"

normalization: "<zscore|minmax|robust|unit-norm|none>"

dictionary?: "<path-or-ref>" # 分词/子词/类别字典引用

parameter_lock: true # 发布前冻结参数(含统计量)

randomness:

seed: 1701

libraries: {numpy:"1.26.4", torch:"2.3.1"}

environment:

os: "ubuntu22.04"

toolchain: ["python3.11","fftw3"]

containers: ["ghcr.io/eift/model-prep:1.0.2"]

audits: ["nan-check","range-check","leakage","class-imbalance","drift"]

artifacts:

- {path:"preprocess/logs/step-01.jsonl", sha256:"..."}

- {path:"preprocess/configs/lock.yaml", sha256:"..."}

see:

- "EFT.WP.Core.Metrology v1.0:check_dim"


IV. 训练/推理一致性与泄漏防控


V. 常见操作的规范口径


VI. 特征空间与 I/O 对齐


VII. 计量与单位

  1. 所有含物理/时间/频率量的参数在 params 中给出单位,并由《Metrology v1.0》校核 check_dim=true;
  2. 若特征或目标涉及路径量(如 T_arr),需登记:delta_form、path="gamma(ell)"、measure="d ell",并采用两种等价表达之一进行一致性校验:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell )。

VIII. 机器可读片段(可直接嵌入)

preprocess:

pipeline_id: "img-prep-v1"

steps:

- name: "clean"

enabled: true

idempotent: true

params: {policy:"drop-out-of-range", lo:0, hi:255}

inputs: ["raw_image"]

outputs: ["cln_image"]

- name: "standardize"

enabled: true

idempotent: true

params: {type:"zscore", mean:[0.485,0.456,0.406], std:[0.229,0.224,0.225], stats_from:"train-only"}

inputs: ["cln_image"]

outputs: ["std_image"]

- name: "feature_map"

enabled: true

idempotent: true

params: {type:"hog", cell:8, block:2, bin:9}

inputs: ["std_image"]

outputs: ["feat_hog"]

feature_space:

type: "dense"

shape: "(H', W', C')"

dtype: "float32"

normalization: "zscore"

parameter_lock: true

randomness: {seed:1701, libraries:{numpy:"1.26.4"}}

environment: {os:"ubuntu22.04", containers:["ghcr.io/eift/model-prep:1.0.2"]}

audits: ["nan-check","range-check","leakage","drift"]

artifacts:

- {path:"preprocess/configs/lock.yaml", sha256:"..."}


IX. 与评测协议、优化与超参的一致性


X. 导出清单与审计轨

export_manifest:

artifacts:

- {path:"preprocess/logs/step-*.jsonl", sha256:"..."}

- {path:"preprocess/configs/lock.yaml", sha256:"..."}

- {path:"features/spec.yaml", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

可校验并与模型卡字段对表。必须所有预处理/特征相关工件

XI. 本章合规自检


版权与许可:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(屠广林)享有。
许可方式(CC BY 4.0):在注明作者与来源的前提下,允许复制、转载、节选、改编与再分发。
署名格式(建议):作者:屠广林|作品:《能量丝理论》|来源:energyfilament.org|许可证:CC BY 4.0
验证召集: 作者独立自费、无雇主无资助;下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地,不限国家。欢迎各国媒体与同行抓住窗口组织验证,并与我们联系。
版本信息: 首次发布:2025-11-11 | 当前版本:v6.0+5.05