目录文档-技术白皮书(V5.05)45-EFT.WP.Data.Pipeline v1.0

第4章 数据源与摄取


I. 章节目的与范围

固化**数据源与摄取(ingest)**层的规范与工程实践:连接器类型、凭据与安全、幂等/重试/断点续传、去重与去重键、吞吐与时延计量、数据契约衔接(Σ_in/Σ_out)、异常处置与审计导出;确保与数据卡、模型卡、计量章及引用锚点一致。

II. 术语与依赖


III. 字段与结构(规范性)

stage:

name: "<src.kind.name>"

type: "source.<s3|gcs|fs|db|kafka|http|custom>"

impl: "I16-1.<impl_id>"

params:

endpoint: "<url-or-bootstrap>"

bucket_or_db: "<bucket|db>"

prefix_or_table: "<prefix|schema.table>"

query_or_pattern: "<sql|glob>"

credentials_ref: "secrets://path/to/credential"

format: "<json|parquet|csv|avro|binary>"

watermark:

field: "<updated_at|offset|lsn>"

start: "<ISO8601|offset>"

step: "<PT5M|1000>"

checkpoint:

path: "s3://.../chk/<stage>"

mode: "exactly-once|at-least-once"

dedupe_key: ["<pk>", "<ts>"]

outputs: ["raw_blob|raw_rows|events"]

idempotent: true

retries: {max: 3, backoff: "expo", jitter_ms: 200}

timeout_s: 1800

on_fail: "quarantine|skip|block"

schema_ref: "<contracts/raw@vX.Y>"


IV. 连接器类型与规范


V. 幂等、重试与断点续传


VI. 去重与顺序保证


VII. 计量与单位(SI)


VIII. 安全、凭据与合规


IX. 机器可读片段(可直接嵌入)

layers:

- name: "ingest"

stages:

- name: "src.s3.pull"

type: "source.s3"

impl: "I16-1.s3_pull"

params:

endpoint: "https://s3.amazonaws.com"

bucket_or_db: "eift-data"

prefix_or_table: "raw/2025/09/"

query_or_pattern: "*.jsonl"

credentials_ref: "secrets://aws/ingest_ro"

format: "json"

watermark: {field:"updated_at", start:"2025-09-01T00:00:00Z", step:"PT5M"}

checkpoint: {path:"s3://eift-meta/chk/src.s3.pull", mode:"at-least-once"}

dedupe_key: ["id","updated_at"]

outputs: ["raw_blob"]

idempotent: true

retries: {max:3, backoff:"expo", jitter_ms:200}

timeout_s: 1800

on_fail: "quarantine"

schema_ref: "contracts/raw_json@v1.2"


X. Lint 规则(节选,规范性)

lint_rules:

- id: SRC.TYPE_ALLOWED

when: "$.layers[*].stages[*].type"

assert: "value in ['source.s3','source.gcs','source.fs','source.db','source.kafka','source.http','source.custom']"

level: error

- id: SRC.CREDENTIALS_REF

when: "$.layers[*].stages[?(@.type^='source.')].params"

assert: "has_key('credentials_ref') and not has_key('plain_secret')"

level: error

- id: SRC.CHECKPOINT_DEFINED

when: "$.layers[*].stages[?(@.type^='source.')].params"

assert: "has_key('checkpoint') and has_key('watermark')"

level: error

- id: SRC.DEDUPE_OR_EXACTLY_ONCE

when: "$.layers[*].stages[?(@.type^='source.')]"

assert: "has_key('params.dedupe_key') or $.params.checkpoint.mode == 'exactly-once'"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units=='SI' and check_dim==true"

level: error


XI. 导出清单与审计轨

export_manifest:

version: "v1.0"

artifacts:

- {path:"ingest/pulled.manifest.json", sha256:"..."}

- {path:"ingest/checkpoint.meta.json", sha256:"..."}

- {path:"security/audit.log", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.6"


XII. 本章合规自检


版权与许可:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(屠广林)享有。
许可方式(CC BY 4.0):在注明作者与来源的前提下,允许复制、转载、节选、改编与再分发。
署名格式(建议):作者:屠广林|作品:《能量丝理论》|来源:energyfilament.org|许可证:CC BY 4.0
验证召集: 作者独立自费、无雇主无资助;下一阶段将优先在最愿意公开讨论、公开复现、公开挑错的环境中推进落地,不限国家。欢迎各国媒体与同行抓住窗口组织验证,并与我们联系。
版本信息: 首次发布:2025-11-11 | 当前版本:v6.0+5.05