[observability] Observability Coverage Report - 2026-03-04 #160
Closed
Replies: 1 comment
-
|
This discussion was automatically closed because it expired on 2026-03-05T09:06:05.747Z.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
This daily observability report covers the last 7 days of workflow activity in the
norrietaylor/tt2repository, analyzing 30 total runs across 85 registered agentic workflows. The period analyzed spans a single active day (2026-03-04), with all runs occurring on that date. Of the 30 runs, 6 executed the full agentic agent stack (firewall + MCP gateway), 3 encountered pre-agent startup failures, 15 were intentionally skipped (event-triggered workflows with no matching criteria), and the remaining were maintenance or CI builds.For all 6 runs that reached the agent execution stage, AWF Firewall observability achieved 100% coverage — every run produced the "Print firewall logs" step successfully and uploaded engine artifacts. MCP Gateway telemetry achieved 100% coverage for runs where the gateway was actually started (5 of 5 successful agentic runs). One agent job failure (Discussion Task Miner) caused the MCP gateway to be skipped, resulting in no MCP session telemetry for that run — classified as a warning.
Overall observability health is HEALTHY with no critical gaps in completed runs. The main concern for the week is 3 startup_failure runs that indicate pre-agent infrastructure failures, preventing any observability data from being collected for those workflows.
Key Alerts and Anomalies
🔴 Critical Issues:
Start MCP Gatewaystep was skipped, resulting in no MCP session telemetry. Firewall logs were still collected. Root cause: upstream failure in agent pre-boot configuration steps.startup_failureconclusions with 0 jobs executed, meaning no observability data is available at all for these runs:ℹ️ Informational:
close-expired-entities) are non-agentic housekeeping jobs; no firewall or MCP applies.Coverage Summary
access.log/ Print firewall logs)gateway.jsonl/rpc-messages.jsonl)Note: Startup failure runs (0 jobs executed) are excluded from coverage percentages as no observability infrastructure could be initialized.
📋 Detailed Run Analysis
Firewall-Enabled Agentic Runs
Startup Failure Runs (No Jobs Executed)
Skipped / N/A Runs (15 total)
Event-triggered workflows (Plan Command, Daily Test Improver, Documentation Unbloat, Grumpy Code Reviewer, Security Review Agent) with 3 separate trigger batches (runs 22656647878–22656691144) were skipped because they did not match their activation criteria. These are expected and not observability gaps.
Artifacts Per Healthy Run
All 5 fully-successful agentic runs uploaded the following standard observability artifacts:
prompt— workflow prompt (expires 1 day)agent-artifacts— containsmcp-logs/,sandbox/firewall/logs/(expires 90 days)agent_outputs— agent output files (expires 90 days)agent-output— structured agent output JSON (expires 90 days)safe-output— safe output manifest (expires 90 days)threat-detection.log— threat detection scan result (expires 90 days)safe-output-items— safe output items manifest (expires 90 days)🔍 Telemetry Quality Analysis
Firewall Log Quality
All 6 runs that executed the agent job had the "Print firewall logs" step complete successfully. Key indicators from the current run's environment:
v0.23.0squid) on172.30.0.10:3128v0.1.5api.github.com,api.githubcopilot.com,raw.githubusercontent.com,registry.npmjs.org/tmp/gh-aw/sandbox/firewall/logs/access.log/tmp/gh-aw/sandbox/firewall/api-proxy-logs/api-proxy.logDirect reading of
access.logis restricted (permission denied from agent container), but the "Print firewall logs" step in the agent job has host-level access and reports success for all 6 runs.MCP Gateway Log Quality
All 5 runs that started the MCP Gateway had "Parse MCP Gateway logs for step summary" succeed:
safeoutputs(Safe Outputs MCP HTTP Server), GitHub MCP Server (lockdown-mode evaluated per run)mcp-logs/directoriesv0.1.5For Discussion Task Miner (failure), the "Parse MCP Gateway logs for step summary" step ran and succeeded, but the gateway was never started — the parse step likely found empty/no gateway logs.
Threat Detection Coverage
All 6 executed runs also ran the threat detection scan (secondary Copilot CLI invocation):
threat-detection.logartifacts{"prompt_injection":false,"secret_leak":false,"malicious_patch":false}Healthy Runs Summary
5 of 6 executed runs achieved full observability: AWF Firewall ✅ + MCP Gateway ✅ + threat detection ✅ + all artifacts uploaded ✅.
Recommended Actions
Investigate the 3 startup_failure runs — Architecture Diagram Generator, Agent Performance Analyzer, and Daily Secrets Analysis Agent all failed before any job executed (0 total jobs). This is likely a runner provisioning or workflow configuration issue. Review the GitHub Actions runner logs for these runs directly in the GitHub UI. If these workflows are high-priority, check for quota limits, runner availability, or recent changes to
.github/workflows/*.lock.ymlfiles.Investigate Discussion Task Miner failure — The agent job failed before the MCP Gateway could start, causing skipped execution of core agent steps. Review the full agent job log for run §22662039389 to identify the root cause. Once fixed, MCP telemetry will be restored for this workflow.
Maintain current artifact retention policy — The current 90-day retention for
agent-artifacts(which contains firewall and MCP logs) is appropriate for debugging. Consider adding a dedicatedfirewall-logsartifact with extended retention if post-incident forensics regularly require logs older than 90 days.Consider alerting on startup_failure — Add a monitoring rule or discussion/issue auto-creation for
startup_failureconclusions to ensure these silent failures don't go unnoticed in future reports.📊 Historical Context
This report covers only runs from 2026-03-04 (the single active day in the 7-day window). All 30 runs occurred on the same date, suggesting scheduled workflows fire on a consistent daily cadence. The 85 registered workflows represent a mature agentic workflow ecosystem. Historical trend data will be available once this report runs on multiple consecutive days.
The AWF framework version
v0.23.0and AWMG versionv0.1.5are consistently deployed across all runs, indicating a stable infrastructure baseline.References:
Analysis window: Last 7 days | Total runs analyzed: 30 | Agent-executed runs: 6 | Date: 2026-03-04
Beta Was this translation helpful? Give feedback.
All reactions