第5章 训练数据与血缘


I. 目标与范围(Purpose & Scope)


II. 输入与依赖(Inputs & Dependencies)


III. 训练数据来源与许可(Sources & Licenses)


IV. 结构与切分对齐(Schema & Splits Alignment)


V. 采样与清洗(Sampling & Cleaning)


VI. 血缘与可追溯(Lineage & Traceability)


VII. 路径量统一口径(Normative Path Forms)

正文显式 gamma(ell) 与 d ell;数据侧记录 delta_form;训练/评测的路径/相位口径必须与数据集卡一致。


VIII. 质量门映射(Gate Mapping)


IX. 机读制品(Machine-Readable Artifacts)
A. data_refs.yaml

version: "1.0.0"

datasets:

- id: "ds-core"

see:

- "Dataset Card v1.0:Ch.3"

- "Dataset Card v1.0:Ch.4"

- "Dataset Card v1.0:Ch.6"

manifest: "DS_EXPORT/manifests/report_manifest.yaml"

splits: "DS_EXPORT/splits/split_manifest.json"

license: "CC-BY-4.0"

checksum: "sha256:..."

sampling:

seed: 20250924

strategy: { stratified: ["device","region","quality.flags"] }

preprocess_spec: "configs/preprocess_spec.yaml"


B. preprocess_spec.yaml

version: "1.0.0"

missing: { numeric: "null", route_to: "quality.flags" }

normalize: { mean: "μ_train", std: "σ_train" }

path_align: { require: true, delta_form: "general", enforce_delta_ell: true }

filters:

- name: "window_guard"

rule: "drop if ts ∉ [ts_start, ts_end]"

audits: { write_to: "reports/audit.jsonl" }


C. lineage_graph.json(节选)

JSON json
{
  "nodes": [
    { "id": "RAW-telemetry", "version": "1.0.0", "checksum": "sha256:..." },
    { "id": "CAL-telemetry", "version": "1.0.1", "checksum": "sha256:..." },
    { "id": "DER-train", "version": "1.0.0", "checksum": "sha256:..." }
  ],
  "edges": [
    { "from": "RAW-telemetry", "to": "CAL-telemetry", "type": "calibrate" },
    { "from": "CAL-telemetry", "to": "DER-train", "type": "derive" }
  ]
}

X. 反例与修正(Anti-Patterns & Fixes)


XI. 交叉引用(Cross-References)


XII. 执行勾选清单(Checklist)