目录文档-技术白皮书45-EFT.WP.Data.Pipeline v1.0

第12章 监控、日志与可观测性


I. 章节目的与范围

的规范:指标与量纲、日志与追踪、仪表盘与告警、SLA/SLO 与错误分类、运行健康度与容量趋势、审计与导出;确保与数据契约、质量门、编排与计量章一致。可观测性(observability)日志(logging)监控(monitoring)固化流水线

II. 术语与依赖


III. 字段与结构(规范性)

monitoring:

metrics:

perf:

- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}

- {name:"latency_ms.p50", unit:"ms", agg:"quant", window:"1m"}

- {name:"latency_ms.p95", unit:"ms", agg:"quant", window:"1m"}

- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}

- {name:"utilization_rho",unit:"ratio", agg:"mean", window:"5m"}

quality:

- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}

- {name:"drift.psi", unit:"—", agg:"mean", window:"15m"}

resources:

- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}

- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}

- {name:"net_mbps", unit:"Mbps", agg:"mean", window:"1m"}

- {name:"disk_io_mbps", unit:"Mbps", agg:"mean", window:"1m"}

logs:

level: "info|warn|error"

format: "jsonl"

retention: "P30D"

sinks: ["s3://.../logs/", "kafka://.../topic"]

pii_redaction: true

traces:

enabled: true

sampler: "parent|probabilistic"

ratio: 0.05

propagator: "w3c|b3"

dashboards:

system: ["grafana:/boards/pipeline_overview"]

dq: ["grafana:/boards/dq_quality"]

cost: ["grafana:/boards/costs"]

alert_rules:

- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}

- {name:"dq_drop", rule:"dq.pass_rate<0.98 for 15m", severity:"medium", channel:"slack"}

- {name:"drift_alert", rule:"drift.psi>0.2 for 30m", severity:"low", channel:"email"}

slo:

objectives:

- {name:"latency_p99", target_ms: 20000, window:"30d"}

- {name:"availability", target: 0.999, window:"30d"}

- {name:"dq_pass_rate", target: 0.99, window:"30d"}

error_budget_policy: "freeze_releases|throttle|page_on_call"


IV. 指标体系与口径


V. 日志与追踪


VI. 仪表盘与告警


VII. 可观测性与健康度


VIII. 计量与单位(SI)

  1. 强制:metrology:{units:"SI", check_dim:true};
  2. 性能/资源:QPS(1/s)、T_inf(ms {p50,p95,p99})、ρ(—)、net_mbps、size_bytes;
  3. 路径量:如监控/日志涉及 T_arr,需登记:delta_form、path="gamma(ell)"、measure="d ell",并采用以下等价式之一并通过 check_dim:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell )。

IX. 机器可读片段(可直接嵌入)

monitoring:

metrics:

perf:

- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}

- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}

quality:

- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}

resources:

- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}

- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}

logs:

level: "info"

format: "jsonl"

retention: "P30D"

sinks: ["s3://eift/logs/", "kafka://obs/logs"]

pii_redaction: true

traces: {enabled:true, sampler:"probabilistic", ratio:0.1, propagator:"w3c"}

dashboards:

system: ["grafana:/boards/pipeline_overview"]

alert_rules:

- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}

slo:

objectives:

- {name:"latency_p99", target_ms:20000, window:"30d"}

error_budget_policy: "freeze_releases"


X. Lint 规则(节选,规范性)

lint_rules:

- id: MON.METRICS_UNIT_SI

when: "$.monitoring.metrics..unit"

assert: "all_units_in_SI(value)"

level: error

- id: LOG.STRUCTURED_JSONL

when: "$.monitoring.logs.format"

assert: "value == 'jsonl'"

level: error

- id: TRACE.SAMPLER_VALID

when: "$.monitoring.traces"

assert: "value.enabled == false or value.sampler in ['parent','probabilistic']"

level: error

- id: ALERT.SYNTAX_VALID

when: "$.monitoring.alert_rules[*].rule"

assert: "matches('^[a-z0-9_\\.]+[><=].+ for \\d+[smhd]$')"

level: error

- id: SLO.OBJECTIVES_DEFINED

when: "$.monitoring.slo.objectives"

assert: "len(value) >= 1"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


XI. 导出清单与审计

export_manifest:

version: "v1.0"

artifacts:

- {path:"monitoring/dashboards.json", sha256:"..."}

- {path:"monitoring/alert_rules.yaml", sha256:"..."}

- {path:"monitoring/slo_objectives.yaml", sha256:"..."}

- {path:"logs/index.manifest.json", sha256:"..."}

- {path:"traces/config.yaml", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"


XII. 本章合规自检


版权与许可(CC BY 4.0)

版权声明:除另有说明外,《能量丝理论》(含文本、图表、插图、符号与公式)的著作权由作者(“屠广林”先生)享有。
许可方式:本作品采用 Creative Commons 署名 4.0 国际许可协议(CC BY 4.0)进行许可;在注明作者与来源的前提下,允许为商业或非商业目的进行复制、转载、节选、改编与再分发。
署名格式(建议):作者:“屠广林”;作品:《能量丝理论》;来源:energyfilament.org;许可证:CC BY 4.0。

首次发布: 2025-11-11|当前版本:v5.1
协议链接:https://creativecommons.org/licenses/by/4.0/