目录文档-技术白皮书45-EFT.WP.Data.Pipeline v1.0

第8章 特征流水线与重用


I. 章节目的与范围

固化**特征流水线(feature pipeline)**的规范:特征抽取/聚合/对齐、字典与嵌入管理、物化与缓存、跨任务/多模态重用、版本化与依赖映射;确保与数据契约、模型卡特征空间/任务 I-O、计量章与引用锚点一致。

II. 术语与依赖


III. 字段与结构(规范性)

stage:

name: "<feat.map|feat.aggregate|feat.join|feat.encode|feat.embed|feat.materialize>"

type: "feature.<op>"

impl: "I16-4.<impl_id>"

inputs: ["<Σ_in>"]

outputs: ["<Σ_out>"]

params:

key: ["<entity_id>", "<ts?>"] # 实体键/时间键

point_in_time:

enabled: true

lookback: "PT7D|P30D|N/A" # 回看窗口

tolerance: "PT5M" # 对齐容差

dict_ref: "dicts/<name>@vX.Y" # 字典/子词/类别映射

embed:

store: "faiss|annoy|milvus|custom"

dim: 768

metric: "cosine|l2"

index_ref: "embeddings/<name>@vX.Y"

aggregate:

window: "PT1H|P1D"

funcs: ["mean","max","count","std"]

fillna: {"method":"pad|zero|drop"}

join:

on: ["<entity_id>","<ts?>"]

how: "left|inner|asof"

materialize:

mode: "none|cache|persist"

cache: {ttl: "P7D", max_gb: 128}

idempotent: true

schema_ref: "contracts/feat_<name>@vX.Y"

feature_space:

type: "<tabular|sequence|image|audio_spec|embedding>"

shape: "<(…)>"

dtype: "<float32|int32|...>"

normalization: "<zscore|minmax|robust|unit-norm|none>"


IV. 特征算子与口径


V. 重用与依赖映射


VI. 一致性与时态对齐(PIT)


VII. 字典与嵌入管理


VIII. 计量与单位(SI)

  1. 性能:QPS(1/s)、T_inf(ms {p50,p95,p99})、ρ(—);带宽 net_mbps;存储/索引体量 size_bytes。
  2. metrology:{units:"SI", check_dim:true} 为强制;合成/聚合前先做单位归一
  3. 涉路径量(如 T_arr)的特征需登记:delta_form、path="gamma(ell)"、measure="d ell",并采用以下等价式之一并通过 check_dim:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell )。

IX. 机器可读片段(可直接嵌入)

layers:

- name: "feature"

stages:

- name: "feat.map.stats"

type: "feature.map"

impl: "I16-4.feature_map"

inputs: ["std_rows"]

outputs: ["feat_rows"]

params:

key: ["entity_id","ts"]

point_in_time: {enabled:true, lookback:"P30D", tolerance:"PT5M"}

aggregate: {window:"P1D", funcs:["mean","std","count"], fillna:{method:"pad"}}

idempotent: true

schema_ref: "contracts/feat_stats@v1.1"

feature_space: {type:"tabular", shape:"(N,D)", dtype:"float32", normalization:"zscore"}

- name: "feat.encode.cat"

type: "feature.encode"

impl: "I16-4.encode"

inputs: ["feat_rows"]

outputs: ["feat_enc"]

params:

dict_ref: "dicts/category_voc@v2.0"

encode: {vocab_ref:"dicts/category_voc@v2.0", unk:"<UNK>", pad:"<PAD>"}

idempotent: true

schema_ref: "contracts/feat_enc@v1.0"

- name: "feat.materialize"

type: "feature.materialize"

impl: "I16-4.materialize"

inputs: ["feat_enc"]

outputs: ["feat_pkg"]

params:

materialize: {mode:"cache", cache:{ttl:"P7D", max_gb:256}}

idempotent: true

schema_ref: "contracts/feat_pkg@v1.0"


X. Lint 规则(节选,规范性)

lint_rules:

- id: FEAT.FS_REQUIRED

when: "$.layers[*].stages[?(@.type^='feature.')]"

assert: "has_key('feature_space')"

level: error

- id: FEAT.DICT_VERSIONED

when: "$.layers[*].stages[?(@.type=='feature.encode')].params.dict_ref"

assert: "matches('^dicts/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"

level: error

- id: FEAT.PIT_PARAMS

when: "$.layers[*].stages[*].params.point_in_time"

assert: "value.enabled == true -> (has_key('lookback') and has_key('tolerance'))"

level: error

- id: FEAT.MATERIALIZE_POLICY

when: "$.layers[*].stages[?(@.type=='feature.materialize')].params.materialize"

assert: "value.mode in ['none','cache','persist']"

level: error

- id: FEAT.UNITS_CHECKDIM

when: "$.pipeline.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

- id: FEAT.LEAKAGE_GUARDS_FOR_TRAIN_EXPORT

when: "$.layers[*].stages[*].outputs"

assert: "produces_train_eval(outputs) -> has_leakage_guards()"

level: error


XI. 导出清单与审计

export_manifest:

version: "v1.0"

artifacts:

- {path:"features/feat_view.yaml", sha256:"..."}

- {path:"features/dict_category_v2.hash", sha256:"..."}

- {path:"features/feat_pkg.manifest.json", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.ModelCards v1.0:Ch.6"

- "EFT.WP.Data.ModelCards v1.0:Ch.9"


XII. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/