Observability¶
Status: Pass 1 implementation spec. This doc tells the implementer what to build. The strategic rationale and option analysis live in research/daedalus-observability.md in
raykao/dark-factory. Read that first if you need to understand why this shape; read this doc to understand what to build.
Scope¶
In scope (Pass 1):
- Trace ID end-to-end audit and gap-fill
- Per-agent-type fleet dashboard plus a fleet overview, authored against agent-type labels (not hardcoded to copilot)
- Structural alert rules with no SLO thresholds
- Runbook entries for each alert
- Phase 5 acceptance-criteria coverage section
Out of scope (Pass 2, separate epic):
- SLO threshold alerts (cold-start P99, task latency P99, error rate)
- Rate-of-change alerts (P99 doubled week-over-week, error rate up 3x)
- Trace sampling policy beyond 100% sampling
- Log-retention extension past Loki defaults
Out of scope (different track entirely):
- DLQ policy decision (auto-retry vs escalate vs expire) - separate scoping doc
- Multi-tenant dashboard segregation - waits on Phase 6+ multi-tenancy
- Audit-trail logging - production-hardening track (Option E)
Architecture summary¶
Today the platform has:
- OTel scaffolding in
internal/telemetry/(provider, NATS header propagation, structured logging, span instrumentation at every queue/ACP hop) - PR #30 production overlay deploying
kube-prometheus-stack, Tempo, Loki, an OTel collector, ServiceMonitors, PodMonitors, and Grafana datasources
What is missing is the part that makes the stack useful: dashboards, alerts, runbook entries, and a proof that the trace ID survives the queue boundary under load.
Implementation work items¶
1. Trace ID end-to-end audit¶
- Verify trace context propagates across every hop:
Mattermost ingress -> orchestrator -> NATS publish -> NATS consume
-> ACP
session/new-> ACPsession/prompt-> ACPsession/updatestream -> result publish -> orchestrator collector. - Gap-fill any drop. Most likely candidates:
- Proxy stdout pipe to the agent CLI (no propagation today).
- Result
Artifactenvelope (verifytrace_idfield is populated, not regenerated downstream). - Add an integration test under
test/integration/trace-propagation/: - Publishes 100 tasks concurrently
- Collects all spans from a test OTLP collector
- Asserts every task has exactly one root trace with all expected child spans
- Asserts no orphan spans (parent
trace_idnot present in any other span)
Acceptance: the integration test is green and the README of that test directory documents the expected span tree.
2. Per-agent-type fleet dashboards¶
Author Grafana dashboards as JSON committed under
deploy/helm/daedalus/dashboards/ and provisioned via the Grafana
sidecar config map.
Per-agent dashboard (one file per agent type, parameterized by
agent_type template variable):
| Panel | PromQL sketch |
|---|---|
| Cold-start latency histogram | histogram_quantile(0.50/0.95/0.99, rate(daedalus_cold_start_seconds_bucket{agent_type="$agent_type"}[5m])) |
| Queue depth | nats_jetstream_consumer_num_pending{stream=~"agent_$agent_type.*"} |
| Active workers | kube_job_status_active{job_name=~"daedalus-worker-$agent_type-.*"} |
| Task throughput (success/failure split) | sum by (status) (rate(daedalus_tasks_total{agent_type="$agent_type"}[5m])) |
| Task-to-artifact latency histogram | histogram_quantile(...) over daedalus_task_duration_seconds |
| Error rate (5min rolling) | success/failure derived from daedalus_tasks_total |
| Top 10 slowest tasks (last hour) | Tempo TraceQL link |
Fleet overview dashboard renders the same panels aggregated across
agent types with an agent_type legend breakdown.
Every metric must carry an agent_type label. If a metric does not
carry it today, this is a code-side fix in internal/telemetry/ first,
not a dashboard hack.
Acceptance:
- Dashboards committed as JSON.
- Helm chart provisions them via Grafana sidecar.
- A fresh make deploy-aks-test run renders both dashboards with live
data without manual import.
- The dashboards have an "SLO panels: baseline collection in progress"
marker on any panel intended to gain a threshold in Pass 2.
3. Structural alert rules¶
Add Prometheus / Alertmanager rules under
deploy/helm/daedalus/templates/prometheusrule.yaml:
| Alert | Condition | Severity | Runbook anchor |
|---|---|---|---|
WorkerImagePullBackOff |
kube_pod_container_status_waiting_reason{reason="ImagePullBackOff", pod=~"daedalus-worker-.*"} > 0 for 2m |
page | worker-image-pull-backoff |
WorkerCrashLoopBackOff |
rate(kube_pod_container_status_restarts_total{pod=~"daedalus-worker-.*"}[10m]) > 0 for 10m |
page | worker-crashloop |
NATSConsumerLagUnbounded |
delta(nats_consumer_num_pending{stream="<keda.natsStream>"}[10m]) > 0 AND nats_consumer_num_pending{stream="<keda.natsStream>"} > 100 (description references $labels.consumer_name and $labels.stream to match nats-surveyor's actual label set: per collector_statz.go, the JSZ-derived consumer metric is nats_consumer_num_pending with labels including stream and consumer_name. The stream= filter scopes the alert to the daedalus task stream so a shared nats-surveyor watching unrelated streams does not page this rule.) |
page | nats-consumer-lag |
KEDAScalerError |
rate(keda_scaler_errors_total{namespace="<release-ns>"}[5m]) > 0 (scoped to the chart's release namespace so other teams' ScaledObjects do not page this rule) |
page | keda-scaler-error |
OTelCollectorDown |
absent(up{job="otel-collector"}) OR up{job="otel-collector"} == 0 for 5m. The expr selector is intentionally not scoped by namespace=: the OTel collector typically runs in its own namespace (e.g. monitoring), not the chart's release namespace, so a namespace= matcher would never match and absent() would be permanently true. The alert carries a static namespace="<release-ns>" rule label so the AlertmanagerConfig sub-route routes it to the chart's receiver regardless of where the source up{} series actually lives. |
page | otel-collector-down |
OrchestratorDown |
absent(kube_deployment_status_replicas_available{namespace="<release-ns>", deployment="daedalus-orchestrator"}) OR kube_deployment_status_replicas_available{namespace="<release-ns>", deployment="daedalus-orchestrator"} == 0 for 2m (alert also carries a static namespace="<release-ns>" label so the AlertmanagerConfig sub-route's namespace matcher matches even on the absent() leg) |
page | orchestrator-down |
NATSStreamUnhealthy |
nats_stream_consumer_count{stream="<keda.natsStream>"} == 0 AND nats_stream_total_messages{stream="<keda.natsStream>"} > 0 for 10m. Both legs filter on the daedalus task stream so a shared nats-surveyor cannot fire this rule on an unrelated team's stream. The original spec used nats_jetstream_stream_messages_lost_total > 0 OR a capacity ratio against nats_jetstream_stream_max_bytes / nats_jetstream_stream_storage_bytes; none of those metrics exist in nats-surveyor's exposition (verified against collector_statz.go). The replacement signal catches the same drainage-failure mode using nats_stream_consumer_count and nats_stream_total_messages, which surveyor does emit. |
warn | nats-stream-unhealthy |
Spec corrections from Pass 1 implementation¶
A fresh-eyes cross-check of the rendered PrometheusRule against this
spec table caught five PromQL defects in the original draft. The table
above is the corrected form. Originals and rationale:
- Finding 1 -
WorkerCrashLoopBackOffthreshold (was> 0.3).0.3restart events per second sustained over 10 minutes is 180 restarts in 10 minutes; Kubernetes CrashLoopBackOff caps the gap between restarts at ~5 minutes, so worst-case real flapping reaches ~0.005 r/s. The original threshold was unreachable. Corrected to> 0held for 10m. - Finding 2 -
NATSConsumerLagUnboundedreferenced a non-existent metric. The original AND-leg usedrate(nats_jetstream_consumer_acks_total[10m]) == 0, but nats-surveyor exposes only gauges - that counter does not exist under any name. Rewritten using gauges surveyor reliably exposes:delta(...num_pending[10m]) > 0AND...num_pending > 100. The 100- message floor avoids paging on transient single-digit blips. - Finding 3 -
NATSStreamUnhealthydivided by zero on unlimited streams. Unlimited-storage streams reportmax_bytes == 0; the unguardedstorage_bytes / max_bytesproduced+Inf, and+Inf > 0.9evaluates true, so every unlimited stream became a permanent false positive. The capacity ratio is now guarded bymax_bytes > 0. - Finding 4 -
OrchestratorDowndid not fire on absent deployment. Barekube_deployment_status_replicas_available{...} == 0matches zero series when the deployment does not exist (which is today's state, before the orchestrator deployment lands). OR'd withabsent(<same selector>)so the alert covers both "deployment exists with 0 available replicas" and "deployment is missing entirely". - Finding 5 -
NATSStreamUnhealthyimplicit join. Vector-to-vector division relied on implicit auto-matching, which can silently produce empty results if surveyor adds extra labels (e.g.server_id) on one side. Now uses expliciton(account, stream)matching.
Note (CC2-3 follow-up): this on() clause was for an intermediate
division expression that Cross-check 2 finding CC2-3 later replaced
with `nats_stream_consumer_count == 0 and nats_stream_total_messages
0
. Both metrics carry the same label set, so no explicit matching clause is needed in the current implementation. The original Pass 1 finding text usedstream_name; the actual surveyor label isstream(no_name` suffix) - the corrected label name is reflected here.
Spec corrections from Cross-check 2¶
A second fresh-eyes cross-check, run against nats-surveyor's actual
metric exposition (collector_statz.go @ commit 725f52d) and against
the AlertmanagerConfig sub-route routing semantics, surfaced six
additional defects. The table above is the corrected form.
- CC2-1 (HIGH, was a merge blocker) -
OrchestratorDownlacked a namespace label on theabsent()leg. The Prometheus Operator wraps each AlertmanagerConfig in a sub-route that injects anamespace=<AMC-namespace>matcher. The synthesised series fromabsent()carries only the labels in the inner selector; without one, the series escaped the AMC and fell through to the global Alertmanager default receiver. Both legs now carry an explicitnamespace="<release-ns>"matcher and the alert carries a staticnamespacelabel. - CC2-2 -
NATSConsumerLagUnboundedused a non-existent metric and the wrong consumer label.nats_jetstream_consumer_num_pendingdoes not exist; surveyor's JSZ-derived metric isnats_consumer_num_pending. The consumer label isconsumer_name(with_name), notconsumer. Both expr and description now match surveyor's actual exposition. - CC2-3 -
NATSStreamUnhealthyreferenced three fictional metrics. None ofnats_jetstream_stream_messages_lost_total,nats_jetstream_stream_max_bytes, ornats_jetstream_stream_storage_bytesexist in surveyor's exposition. The rule was silently never firing on either leg. Replaced with a surveyor-real signal that catches the same drainage-failure mode:nats_stream_consumer_count == 0 and nats_stream_total_messages > 0for 10m. The capacity-ratio leg is dropped because no per-stream storage cap is exposed. - CC2-4 -
OTelCollectorDownhad noabsent()guard. Same class of bug as CC2-1 / pre-existing finding 4. If the collector and its ServiceMonitor are missing entirely,up{}returns no series and the bare==0does not fire. Now wraps inabsent()and carries a staticnamespacelabel so the AMC sub-route routes correctly regardless of the sourceup{}series' label set. - CC2-5 -
KEDAScalerErrorwas cluster-scoped. KEDA emitskeda_scaler_errors_totalcluster-wide; without a namespace filter, errors from any other team's ScaledObject paged this rule withdaedalus_component=kedaattribution. Now scoped tonamespace="<release-ns>". - CC2-6 - PagerDuty example secret name diverged across surfaces.
values.yaml,docs/runbook.md, and the chart test used three different example names. Aligned topagerduty-routing-key(the runbook's literalkubectl create secretinvocation). - CC2 in-loop follow-up - same CC2-1 class of bug on both NATS
alerts.
NATSConsumerLagUnboundedandNATSStreamUnhealthyfire onnats_consumer_*/nats_stream_*metrics from nats-surveyor, which typically runs in its own namespace; the source series'namespacelabel is surveyor's pod namespace, not the release namespace, so the AMC sub-route matcher does not match and the alerts escape to the global default. Both alerts now carry a staticnamespace="<release-ns>"rule label. Defense-in-depth: the same static label is applied toWorkerImagePullBackOffandWorkerCrashLoopBackOff(whose kube-state-metrics source label already resolves to the release namespace) so the chart is hardened against any future ServiceMonitor relabeling drift. A new test inalerts_test.shasserts every rendered alert carries a staticnamespacelabel so future additions cannot regress this class.
No SLO threshold alerts in Pass 1. Cold-start, task latency, and error-rate alerts wait for Pass 2.
On-call channel routing is a product decision and is not picked
in this epic. Default routing wires every alert to a null Alertmanager
receiver and exposes a values.yaml flag
(alerting.receiver.{type,config}) that switches it to Mattermost,
GitHub issues, or PagerDuty in three lines of values overrides. The
runbook documents how.
Acceptance:
- All seven rules committed.
- A fresh make deploy-aks-test provisions them.
- An induced failure (e.g. set the worker image to a broken tag) fires
WorkerImagePullBackOff within 3 minutes.
4. Runbook entries¶
For each alert above, append a section to docs/runbook.md matching
the anchor in the alert table. Each section follows the same template:
```
¶
Means: one-sentence description of the failure mode.
Reproduce in test cluster: - exact commands or config change
Diagnose: - which Grafana dashboard panel - which Loki log query (LogQL) - which Tempo trace query (TraceQL)
Mitigate: - short-term action (e.g. roll back image, scale to zero) - ticket-the-fix link (GH issue template URL) ```
Acceptance: every alert has a complete runbook entry. An on-call engineer who has never seen the alert can resolve it from the runbook plus the dashboard.
5. Phase 5 validation appendix¶
Add a section to this doc titled "Phase 5 Acceptance Criteria Coverage" (below) that maps each observability-gated AC to the specific dashboard panel and PromQL query that proves it:
- AC1: deploy in 25 minutes. Query
histogram_quantile(0.99, daedalus_cold_start_seconds_bucket)against the post-deploy window; assert the value is below 1500 seconds. Dashboard panel: cold-start latency histogram on the fleet overview. - AC2: e2e task asserted. Query
daedalus_task_duration_seconds{task_id="<sentinel>"}for the sentinel task published by the AKS e2e harness. Cross-link the Tempo trace bytrace_idfrom the test output. - AC5: workflow_dispatch passes. GHA job uploads a Grafana snapshot URL of the fleet overview dashboard scoped to the run window as a workflow artifact. The snapshot is the proof.
Acceptance: an engineer following this section can validate the three observability-gated Phase 5 ACs without reading any other doc. Once validated, mark epic #33 closeable and tag v0.3.0.
6. On-call channel rollout (deferred decision)¶
Wire all alerts to the null receiver by default. Document the three
lines of Helm values that switch the receiver to Mattermost / GitHub /
PagerDuty when the human picks. Do not pick in this epic.
Open questions for the implementer¶
- Cardinality of
agent_typelabel. Confirm with the orchestrator team thatagent_typeis bounded (one of a known set, not free-form) before publishing it on every metric. If unbounded, propose a normalization layer. - Grafana dashboard authoring tool. Hand-authored JSON for Pass 1
(small surface). If you reach for Grafonnet or
grafana-foundation-sdk, document why in the PR. - Test OTLP collector. The trace-propagation integration test
needs an in-process collector. Recommend the
otelcol/configtelemetrycollectortest harness or a small custom collector backed by an in-memory exporter. Pick one and note in the test README.
References¶
internal/telemetry/{provider.go,nats.go,context.go,logging.go}- existing OTel scaffolding (do not redo)deploy/helm/values-example-production.yaml- the PR #30 overlay that this work consumesdocs/plan.md§ Phase 5 Acceptance Criteriadocs/runbook.md- target for new alert sectionsdocs/phase6-options.md§ Option C - the parent-doc framingresearch/daedalus-observability.md(inraykao/dark-factory) - the research doc this spec implements