目录 / 文档-技术白皮书 / 45-EFT.WP.Data.Pipeline v1.0
I. 章节目的与范围
的规范:指标与量纲、日志与追踪、仪表盘与告警、SLA/SLO 与错误分类、运行健康度与容量趋势、审计与导出;确保与数据契约、质量门、编排与计量章一致。可观测性(observability) 与 日志(logging)、监控(monitoring)固化流水线II. 术语与依赖
- 术语:metrics、logs、traces、SLA/SLO、error_budget、alert_rules、blackbox/whitebox、p50/p95/p99、AIOps、RCA(根因分析)。
- 依赖:契约与导出(《Core.DataSpec v1.0》);单位/量纲校核(《Core.Metrology v1.0》);质量门(《DatasetCards v1.0》);编排/调度/资源(本卷第10章)。
- 数学与符号:内联符号用反引号(如 QPS、T_inf、ρ、p99、ψ);含除号/积分/复合算符必须加括号;如涉路径量 T_arr,登记 gamma(ell) 与 d ell;公式/符号/定义禁用中文。
III. 字段与结构(规范性)
monitoring:
metrics:
perf:
- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}
- {name:"latency_ms.p50", unit:"ms", agg:"quant", window:"1m"}
- {name:"latency_ms.p95", unit:"ms", agg:"quant", window:"1m"}
- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}
- {name:"utilization_rho",unit:"ratio", agg:"mean", window:"5m"}
quality:
- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}
- {name:"drift.psi", unit:"—", agg:"mean", window:"15m"}
resources:
- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}
- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}
- {name:"net_mbps", unit:"Mbps", agg:"mean", window:"1m"}
- {name:"disk_io_mbps", unit:"Mbps", agg:"mean", window:"1m"}
logs:
level: "info|warn|error"
format: "jsonl"
retention: "P30D"
sinks: ["s3://.../logs/", "kafka://.../topic"]
pii_redaction: true
traces:
enabled: true
sampler: "parent|probabilistic"
ratio: 0.05
propagator: "w3c|b3"
dashboards:
system: ["grafana:/boards/pipeline_overview"]
dq: ["grafana:/boards/dq_quality"]
cost: ["grafana:/boards/costs"]
alert_rules:
- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}
- {name:"dq_drop", rule:"dq.pass_rate<0.98 for 15m", severity:"medium", channel:"slack"}
- {name:"drift_alert", rule:"drift.psi>0.2 for 30m", severity:"low", channel:"email"}
slo:
objectives:
- {name:"latency_p99", target_ms: 20000, window:"30d"}
- {name:"availability", target: 0.999, window:"30d"}
- {name:"dq_pass_rate", target: 0.99, window:"30d"}
error_budget_policy: "freeze_releases|throttle|page_on_call"
IV. 指标体系与口径
- 性能类:QPS(1/s)、latency_ms.{p50,p95,p99}、排队时延与吞吐-延迟曲线;
- 质量类:dq.pass_rate、drift.psi(或 KL/KS)、泄漏计数与比率;
- 资源类:CPU/内存/网络/磁盘 IO、缓存命中率与回源比;
- 聚合与窗口:统一 agg/window 口径,支持 sum|mean|quant|max|min 等;
- 单位与量纲:全部采用 SI,复合指标合成前先做单位归一,并通过 check_dim。
V. 日志与追踪
- 日志:结构化 jsonl,含 ts, level, stage, run_id, trace_id, span_id, error_code, message, artifact_hash;启用敏感信息脱敏或掩码。
- 追踪:分布式追踪携带 trace_id/span_id 与阶段名、I/O 大小、关键参数哈希;采样策略可 parent 继承或 probabilistic 比例采样。
VI. 仪表盘与告警
- 仪表盘:系统/质量/成本三类;每板包含核心时序、TopK 热点与RCA联动(跳转至追踪/日志)。
- 告警:规则采用“指标阈值 + 持续时间”语法;支持抑制与合并(避免告警风暴);严重级别与通知通道固定。
- SLO/错误预算:当 latency_p99、availability 或 dq_pass_rate 违约时,按错误预算策略冻结发布或限流。
VII. 可观测性与健康度
- 健康度评分:以加权方式综合 perf/quality/resources 指标形成 health_score∈[0,1];
- 容量趋势:提供 30/90 天容量与成本趋势,辅助扩缩与预算调整;
- 黑盒/白盒监控:黑盒探针覆盖端到端路径,白盒暴露各阶段内部指标与线程/队列深度。
VIII. 计量与单位(SI)
- 强制:metrology:{units:"SI", check_dim:true};
- 性能/资源:QPS(1/s)、T_inf(ms {p50,p95,p99})、ρ(—)、net_mbps、size_bytes;
- 路径量:如监控/日志涉及 T_arr,需登记:delta_form、path="gamma(ell)"、measure="d ell",并采用以下等价式之一并通过 check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell )。
IX. 机器可读片段(可直接嵌入)
monitoring:
metrics:
perf:
- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}
- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}
quality:
- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}
resources:
- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}
- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}
logs:
level: "info"
format: "jsonl"
retention: "P30D"
sinks: ["s3://eift/logs/", "kafka://obs/logs"]
pii_redaction: true
traces: {enabled:true, sampler:"probabilistic", ratio:0.1, propagator:"w3c"}
dashboards:
system: ["grafana:/boards/pipeline_overview"]
alert_rules:
- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}
slo:
objectives:
- {name:"latency_p99", target_ms:20000, window:"30d"}
error_budget_policy: "freeze_releases"
X. Lint 规则(节选,规范性)
lint_rules:
- id: MON.METRICS_UNIT_SI
when: "$.monitoring.metrics..unit"
assert: "all_units_in_SI(value)"
level: error
- id: LOG.STRUCTURED_JSONL
when: "$.monitoring.logs.format"
assert: "value == 'jsonl'"
level: error
- id: TRACE.SAMPLER_VALID
when: "$.monitoring.traces"
assert: "value.enabled == false or value.sampler in ['parent','probabilistic']"
level: error
- id: ALERT.SYNTAX_VALID
when: "$.monitoring.alert_rules[*].rule"
assert: "matches('^[a-z0-9_\\.]+[><=].+ for \\d+[smhd]$')"
level: error
- id: SLO.OBJECTIVES_DEFINED
when: "$.monitoring.slo.objectives"
assert: "len(value) >= 1"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. 导出清单与审计
export_manifest:
version: "v1.0"
artifacts:
- {path:"monitoring/dashboards.json", sha256:"..."}
- {path:"monitoring/alert_rules.yaml", sha256:"..."}
- {path:"monitoring/slo_objectives.yaml", sha256:"..."}
- {path:"logs/index.manifest.json", sha256:"..."}
- {path:"traces/config.yaml", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
XII. 本章合规自检
- 指标、日志与追踪配置完整,单位采用 SI 并通过 check_dim;聚合与窗口口径一致。
- 仪表盘覆盖系统/质量/成本核心视图;告警规则语法正确、严重级别与通道明确。
- SLO 目标与错误预算策略已定义;违约可触发冻结发布/限流。
- 日志脱敏启用、追踪采样与传播器配置明确;RCA 支持跳转至相关追踪/日志。
- 导出清单列出仪表盘/告警/SLO/日志索引/追踪配置并具 sha256;引用锚点齐全,满足发布门槛。
版权与许可(CC BY 4.0)
版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。
首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/