Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
db00026
chore: use alerts for integration
moolen Jan 25, 2026
2e8fa2d
docs: start milestone v1.5 Observatory
moolen Jan 29, 2026
b417a00
docs: define milestone v1.5 requirements
moolen Jan 29, 2026
184cf41
docs: create milestone v1.5 roadmap (3 phases)
moolen Jan 29, 2026
0420177
docs(24): capture phase context
moolen Jan 29, 2026
4d5521c
docs(24): research phase domain
moolen Jan 29, 2026
d82468a
docs(24): create phase plan
moolen Jan 29, 2026
49aa933
feat(24-01): create SignalAnchor types and schema
moolen Jan 29, 2026
bcee61e
feat(24-01): implement layered signal classifier
moolen Jan 29, 2026
120a084
feat(24-01): implement dashboard quality scorer
moolen Jan 29, 2026
e40c8c6
docs(24-01): complete signal types and classification plan
moolen Jan 29, 2026
1babed5
feat(24-02): implement signal extractor with multi-role support
moolen Jan 29, 2026
48eee9c
feat(24-02): implement K8s workload linker with label priority
moolen Jan 29, 2026
01b06f3
docs(24-02): complete signal extraction and workload linkage plan
moolen Jan 29, 2026
53152be
feat(24-03): add BuildSignalGraph with MERGE upsert
moolen Jan 29, 2026
210c4fb
feat(24-03): hook signal extraction into DashboardSyncer
moolen Jan 29, 2026
313d855
docs(24-03): complete signal graph integration plan
moolen Jan 29, 2026
836e0e2
test(24-04): add signal ingestion end-to-end integration test
moolen Jan 29, 2026
03cfb48
docs(24-04): complete signal ingestion integration test plan
moolen Jan 29, 2026
9513a0b
docs(24): complete Data Model & Ingestion phase
moolen Jan 29, 2026
fbdbccf
docs(25): capture phase context
moolen Jan 29, 2026
a0e62b6
docs(25): research baseline & anomaly detection phase
moolen Jan 29, 2026
e88c1ff
docs(25): create phase plan for Baseline & Anomaly Detection
moolen Jan 29, 2026
10e2d93
feat(25-01): add SignalBaseline type and RollingStats computation
moolen Jan 29, 2026
d58fde6
test(25-01): add unit tests for rolling statistics computation
moolen Jan 29, 2026
0948894
test(25-02): add failing tests for anomaly scoring
moolen Jan 29, 2026
0917225
feat(25-02): implement hybrid anomaly scoring
moolen Jan 29, 2026
f4e7b15
docs(25-01): complete SignalBaseline type plan
moolen Jan 29, 2026
f6b52df
docs(25-02): complete hybrid anomaly scorer plan
moolen Jan 29, 2026
072d715
feat(25-03): implement SignalBaseline graph storage
moolen Jan 29, 2026
b3edd5d
feat(25-03): implement BaselineCollector periodic syncer
moolen Jan 29, 2026
845526f
feat(25-04): implement BackfillService for historical baseline
moolen Jan 29, 2026
1b89ebc
docs(25-03): complete graph storage & forward collection plan
moolen Jan 29, 2026
8a32b2e
feat(25-04): implement hierarchical anomaly aggregation
moolen Jan 29, 2026
ffbaec8
docs(25-04): complete historical backfill & anomaly aggregation plan
moolen Jan 29, 2026
20d082f
feat(25-05): wire BaselineCollector into Grafana integration lifecycle
moolen Jan 29, 2026
0d18570
test(25-05): add end-to-end baseline integration tests
moolen Jan 29, 2026
25a0251
docs(25-05): complete integration test & lifecycle plan
moolen Jan 29, 2026
dfefb1f
docs(25): complete Baseline & Anomaly Detection phase
moolen Jan 29, 2026
66e3585
docs(26): capture phase context
moolen Jan 29, 2026
a4d6617
docs(26): research phase domain
moolen Jan 29, 2026
fcba270
docs(26): create phase plan
moolen Jan 29, 2026
ec9f12a
fix(26): revise plans based on checker feedback
moolen Jan 30, 2026
1cf5790
feat(26-02): implement ObservatoryInvestigateService
moolen Jan 30, 2026
067d50c
feat(26-03): implement ObservatoryEvidenceService
moolen Jan 30, 2026
6c220d1
feat(26-01): implement ObservatoryService core
moolen Jan 30, 2026
fe92661
test(26-02): add unit tests for ObservatoryInvestigateService
moolen Jan 30, 2026
4ff41ee
test(26-03): add unit tests for ObservatoryEvidenceService
moolen Jan 30, 2026
785f819
docs(26-02): complete ObservatoryInvestigateService plan
moolen Jan 30, 2026
6c0d531
docs(26-03): complete ObservatoryEvidenceService plan
moolen Jan 30, 2026
a2c7f5a
test(26-01): add unit tests for ObservatoryService
moolen Jan 30, 2026
f924b6c
docs(26-01): complete ObservatoryService core plan
moolen Jan 30, 2026
b16248a
feat(26-07): implement observatory_explain tool
moolen Jan 30, 2026
505dedc
feat(26-04): implement observatory_status tool
moolen Jan 30, 2026
973d34f
feat(26-05): implement ObservatoryScopeTool for Narrow stage
moolen Jan 30, 2026
0923435
feat(26-07): implement observatory_evidence tool
moolen Jan 30, 2026
de5f3a1
feat(26-04): implement observatory_changes tool
moolen Jan 30, 2026
f2f5b12
feat(26-05): implement ObservatorySignalsTool for workload signals
moolen Jan 30, 2026
3d994ab
test(26-05): add unit tests for Narrow stage tools
moolen Jan 30, 2026
1b0b3c7
feat(26-06): implement ObservatorySignalDetailTool
moolen Jan 30, 2026
184e6d4
test(26-04): add unit tests for Orient stage tools
moolen Jan 30, 2026
751ed56
feat(26-06): implement ObservatoryCompareTool
moolen Jan 30, 2026
31040d6
test(26-06): add unit tests for Investigate stage tools
moolen Jan 30, 2026
0f63ed0
test(26-07): add tests for observatory_explain and observatory_eviden…
moolen Jan 30, 2026
7a801a9
docs(26-04): complete Orient stage tools plan
moolen Jan 30, 2026
cf9e303
docs(26-07): complete Hypothesize and Verify stage tools plan
moolen Jan 30, 2026
3077052
docs(26-05): complete Narrow stage MCP tools plan
moolen Jan 30, 2026
43e064d
docs(26-06): complete Investigate stage tools plan
moolen Jan 30, 2026
e4e0524
feat(26-08): create RegisterObservatoryTools function
moolen Jan 30, 2026
8ba7e72
feat(26-08): wire observatory services into Grafana integration lifec…
moolen Jan 30, 2026
6eacbc5
test(26-08): create observatory integration tests
moolen Jan 30, 2026
5d3f2e8
docs(26-08): complete Tool Registration & Lifecycle plan
moolen Jan 30, 2026
0673412
docs(26): complete Observatory API & MCP Tools phase
moolen Jan 30, 2026
49df430
chore: complete v1.5 Observatory milestone
moolen Jan 30, 2026
8c53f74
fix(grafana): improve signal metric classification accuracy
moolen Jan 30, 2026
5995f29
refactor(observatory): extract multi-provider architecture
moolen Jan 30, 2026
501b011
refactor(observatory): consolidate types and implement GetCurrentValue
moolen Jan 30, 2026
a5d4ac1
feat(observatory): embed curated metrics for signal classification
moolen Jan 31, 2026
1057243
feat(grafana): add curated metrics sync for automatic SignalAnchor cr…
moolen Jan 31, 2026
29f2f28
feat(grafana): add scrape target linking for SignalAnchor to workload…
moolen Jan 31, 2026
54f2f52
feat(grafana): add signal validation job for alert-signal correlation
moolen Jan 31, 2026
59db1ef
fix: add integration test
moolen Jan 31, 2026
9090669
fix: API trailing slash handling and FalkorDB persistence improvements
moolen Jan 31, 2026
2fccc0b
fix: SignalValidationJob interface compatibility and Prometheus test
moolen Jan 31, 2026
98599fd
feat(observatory): add node type filter dropdown
moolen Feb 1, 2026
6be2161
fix(grafana): FalkorDB query compatibility for boolean and IN clauses
moolen Feb 1, 2026
a777609
fix(chart): FalkorDB graceful shutdown and persistence improvements
moolen Feb 1, 2026
7f2a879
feat(observatory): add Observatory page for signal visualization
moolen Feb 1, 2026
1132cb8
feat(grafana): link universal container metrics to all workloads
moolen Feb 1, 2026
bbe694e
feat(observatory): improve graph UX and fix workload relationship limits
moolen Feb 1, 2026
cd9ef0e
fix(grafana): prevent duplicate SignalAnchors with composite uid MERGE
moolen Feb 1, 2026
8b22903
feat(graph): optimize sync pipeline with state cache, label index, an…
moolen Feb 1, 2026
57c45ed
fix(ui): add loading and fallback states to Observatory page
moolen Feb 1, 2026
f2ca6a1
fix(ui): initialize loading state to true in useObservatoryGraph hook
moolen Feb 1, 2026
7936f01
fix(observatory): skip edges for SignalAnchors not in limited result set
moolen Feb 1, 2026
7ef8a93
feat(ui): add namespace and workload dropdown filters to Observatory
moolen Feb 1, 2026
7065db5
fix(observatory): filter SignalAnchors by connected workload namespace
moolen Feb 1, 2026
5b59136
fix(ui): prevent graph resize when sidebar expands on hover
moolen Feb 1, 2026
1f54f72
fix(ui): fix Observatory zoom behavior after fit-to-view
moolen Feb 1, 2026
c8fd3bb
feat(ui): add Observatory default node types setting and auto-fit
moolen Feb 1, 2026
2c6d254
fix(grafana): resolve datasource template variables for baseline coll…
moolen Feb 1, 2026
38945b0
fix(graph): use inline Cypher literals for batch queries to fix Falko…
moolen Feb 8, 2026
5d411d6
fix(graph): include structural edges in Phase 2 batch processing
moolen Feb 8, 2026
648dec8
feat(ui): feature gate Observatory and Integrations behind ?beta=true
moolen Feb 8, 2026
e85fb8f
fix(docs): remove Pricing nav link from documentation page
moolen Feb 8, 2026
8cd8231
fix(docs): remove "Free" from Get Started button text
moolen Feb 9, 2026
acf45be
fix(docs): link nav items and rename Integration to Incident Response
moolen Feb 9, 2026
9353e81
feat: hide inactive replicas
moolen Feb 9, 2026
ca08c81
fix(tests): align sidebar test with two-layer layout implementation
moolen Feb 9, 2026
4fdcefe
fix(tests): pin FalkorDB to v4.2.0 to fix integration test crashes
moolen Feb 9, 2026
d77e39d
fix(graph): handle NULL r.deleted on placeholder ResourceIdentity nodes
moolen Feb 9, 2026
1cfc2f1
chore: fix import
moolen Feb 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .planning/MILESTONES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,33 @@
# Project Milestones: Spectre MCP Plugin System

## v1.5 Observatory (Shipped: 2026-01-30)

**Delivered:** Signal intelligence layer that extracts "what matters" from dashboards—role classification, quality scoring, rolling baselines, anomaly detection, and 8 MCP tools for AI-driven incident investigation through progressive disclosure (Orient → Narrow → Investigate → Hypothesize → Verify).

**Phases completed:** 24-26 (17 plans total)

**Key accomplishments:**

- Signal anchors with 7-role taxonomy (Availability, Latency, Errors, Traffic, Saturation, Churn, Novelty) and 5-layer confidence classification (0.95 → 0)
- Dashboard quality scoring (freshness, alerting, ownership, completeness) with alert boost incentive
- Rolling baseline statistics using gonum/stat (median, P50/P90/P99, stddev) with Welford's online algorithm
- Hybrid anomaly detection (z-score + percentile) with sigmoid normalization, alert override, hierarchical MAX aggregation
- 8 Observatory MCP tools: status, changes, scope, signals, signal_detail, compare, explain, evidence
- K8s graph integration for root cause analysis with 2-hop upstream dependency traversal

**Stats:**

- 95 files changed, ~26.7k lines added
- 3 phases, 17 plans, 61 requirements
- 1 day from start to ship (2026-01-29 → 2026-01-30)
- Total: 14 Grafana MCP tools (3 metrics + 3 alerts + 8 observatory)

**Git range:** `0420177` → `0673412`

**What's next:** Cross-signal correlation (alert↔log, alert↔metric anomaly), advanced classification (ML-based), or additional integrations (Datadog, PagerDuty)

---

## v1.4 Grafana Alerts Integration (Shipped: 2026-01-23)

**Delivered:** Alert rule ingestion from Grafana with state tracking, historical analysis, and progressive disclosure MCP tools—overview with flappiness indicators, aggregated with 1h state timelines, details with full 7-day history.
Expand Down
152 changes: 91 additions & 61 deletions .planning/PROJECT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,39 @@

## What This Is

A Kubernetes observability platform with an MCP server for AI assistants. Provides timeline-based event exploration, graph-based reasoning (FalkorDB), and pluggable integrations (VictoriaLogs, Logz.io, Grafana). AI assistants can explore logs progressively and use Grafana dashboards as structured operational knowledge for metrics reasoning.
A Kubernetes observability platform with an MCP server for AI assistants. Provides timeline-based event exploration, graph-based reasoning (FalkorDB), and pluggable integrations (VictoriaLogs, Logz.io, Grafana). AI assistants can explore logs progressively, use Grafana dashboards as structured operational knowledge, and investigate incidents systematically through signal intelligence.

## Core Value

Enable AI assistants to understand what's happening in Kubernetes clusters through a unified MCP interface—timeline queries, graph traversal, log exploration, and metrics analysis in one server.
Enable AI assistants to understand what's happening in Kubernetes clusters through a unified MCP interface—timeline queries, graph traversal, log exploration, metrics analysis, and incident investigation in one server.

## Current State: v1.4 Shipped
## Current State: v1.5 Shipped

**No active milestone.** All planned features through v1.4 have been shipped.

**Cumulative stats:** 23 phases, 66 plans, 146 requirements, ~137k LOC (Go + TypeScript)
**Cumulative stats:** 26 phases, 83 plans, 207 requirements, ~164k LOC (Go + TypeScript)

**Available capabilities:**
- Timeline-based Kubernetes event exploration with FalkorDB graph
- Log exploration via VictoriaLogs and Logz.io with progressive disclosure
- Grafana metrics integration with dashboard sync, anomaly detection, and 3 MCP tools
- Grafana alerts integration with state tracking, flappiness analysis, and 3 MCP tools
- Observatory signal intelligence with 8 MCP tools for incident investigation

## Previous State: v1.5 Observatory (Shipped 2026-01-30)

**Shipped 2026-01-30:**
- Signal anchors with 7-role taxonomy (Availability, Latency, Errors, Traffic, Saturation, Churn, Novelty)
- 5-layer classification with confidence decay (0.95 → 0.85-0.9 → 0.7-0.8 → 0.5 → 0)
- Dashboard quality scoring (freshness, alerting, ownership, completeness) with alert boost
- Rolling baseline statistics using gonum/stat (median, P50/P90/P99, stddev)
- Hybrid anomaly detection (z-score + percentile) with sigmoid normalization, alert override
- Hierarchical MAX aggregation (signals → workloads → namespaces → clusters)
- 8 Observatory MCP tools: status, changes, scope, signals, signal_detail, compare, explain, evidence

## Previous State (v1.4 Shipped)
**Total MCP tools:** 14 Grafana tools (3 metrics + 3 alerts + 8 observatory)

<details>
<summary>v1.4 Grafana Alerts Integration (Shipped 2026-01-23)</summary>

**Shipped 2026-01-23:**
- Alert rule sync via Grafana Alerting API (incremental, version-based)
- Alert nodes in FalkorDB linked to Metrics/Services via PromQL extraction
- STATE_TRANSITION self-edges for 7-day timeline with TTL-based retention
Expand All @@ -33,11 +45,13 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
- `grafana_{name}_alerts_aggregated` — specific alerts with 1h state timelines [F F N N]
- `grafana_{name}_alerts_details` — full 7-day state history with rule definition

**Cumulative stats:** 23 phases, 66 plans, 146 requirements, ~137k LOC (Go + TypeScript)
**Stats:** 4 phases, 10 plans, 22 requirements

## Previous State (v1.3 Shipped)
</details>

<details>
<summary>v1.3 Grafana Metrics Integration (Shipped 2026-01-23)</summary>

**Shipped 2026-01-23:**
- Grafana dashboard ingestion via API (both Cloud and self-hosted)
- Full semantic graph storage in FalkorDB (dashboards→panels→queries→metrics→services)
- Dashboard hierarchy (overview/drill-down/detail) via Grafana tags + config fallback
Expand All @@ -47,41 +61,47 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
- Three MCP tools: metrics_overview, metrics_aggregated, metrics_details
- UI configuration form for Grafana connection (URL, API token, hierarchy mapping)

**Cumulative stats:** 19 phases, 56 plans, 124 requirements, ~132k LOC (Go + TypeScript)
**Stats:** 5 phases, 17 plans, 51 requirements

</details>

## Previous State (v1.2 Shipped)
<details>
<summary>v1.2 Logz.io Integration + Secret Management (Shipped 2026-01-22)</summary>

**Shipped 2026-01-22:**
- Logz.io as second log backend with 3 MCP tools (overview, logs, patterns)
- SecretWatcher with SharedInformerFactory for Kubernetes-native secret hot-reload
- Multi-region API support (US, EU, UK, AU, CA) with X-API-TOKEN authentication
- UI configuration form with region selector and SecretRef fields
- Helm chart documentation for Secret mounting with rotation workflow

**Cumulative stats:** 14 phases, 39 plans, 73 requirements, ~125k LOC (Go + TypeScript)
**Stats:** 5 phases, 8 plans, 21 requirements

## Previous State (v1.1 Shipped)
</details>

<details>
<summary>v1.1 Server Consolidation (Shipped 2026-01-21)</summary>

**Shipped 2026-01-21:**
- Single-port deployment with REST API, UI, and MCP on port 8080 (/v1/mcp endpoint)
- Service layer extracted: TimelineService, GraphService, MetadataService, SearchService
- MCP tools call services directly in-process (no HTTP self-calls)
- 14,676 lines of dead code removed (standalone commands and internal/agent package)
- Helm chart simplified for single-container deployment
- E2E tests validated for consolidated architecture

**Cumulative stats:** 9 phases, 31 plans, 52 requirements, ~121k LOC (Go + TypeScript)
**Stats:** 4 phases, 12 plans, 21 requirements

</details>

<details>
<summary>v1 Shipped Features (2026-01-21)</summary>
<summary>v1.0 MCP Plugin System + VictoriaLogs (Shipped 2026-01-21)</summary>

- Plugin infrastructure with factory registry, config hot-reload, lifecycle management
- REST API + React UI for integration configuration
- VictoriaLogs integration with LogsQL client and backpressure pipeline
- Log template mining using Drain algorithm with namespace-scoped storage
- Three progressive disclosure MCP tools: overview, patterns, logs

**Stats:** 5 phases, 19 plans, 31 requirements, ~17,850 LOC
**Stats:** 5 phases, 19 plans, 31 requirements

</details>

Expand Down Expand Up @@ -114,30 +134,31 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
- ✓ Multi-region API endpoint support (US, EU, UK, AU, CA) — v1.2
- ✓ UI for Logz.io configuration (region selector, SecretRef fields) — v1.2
- ✓ Helm chart updates for secret mounting (extraVolumes example) — v1.2

### v1.3 (Shipped)

- ✓ Grafana API client for dashboard ingestion (both Cloud and self-hosted)
- ✓ FalkorDB graph schema for dashboards, panels, queries, metrics, services
- ✓ Dashboard hierarchy support (overview/drill-down/detail levels)
- ✓ PromQL parser for metric extraction (best-effort)
- ✓ Variable classification (scoping vs entity vs detail)
- ✓ Service inference from metric labels
- ✓ Anomaly detection with 7-day historical baseline
- ✓ MCP tool: metrics_overview (overview dashboards, ranked anomalies)
- ✓ MCP tool: metrics_aggregated (service/cluster focus, correlations)
- ✓ MCP tool: metrics_details (full dashboard, deep expansion)
- ✓ UI form for Grafana configuration (URL, API token, hierarchy mapping)

### v1.4 (Shipped)

- ✓ Alert rule sync via Grafana Alerting API (incremental, version-based)
- ✓ Alert nodes in FalkorDB linked to existing Metrics/Services via PromQL extraction
- ✓ Alert state timeline storage (STATE_TRANSITION edges with 7-day TTL)
- ✓ Flappiness detection with exponential scaling and historical baseline
- ✓ MCP tool: alerts_overview (firing/pending counts by severity with flappiness indicators)
- ✓ MCP tool: alerts_aggregated (specific alerts with 1h state timelines [F F N N])
- ✓ MCP tool: alerts_details (full 7-day state history with rule definition)
- ✓ Grafana API client for dashboard ingestion (both Cloud and self-hosted) — v1.3
- ✓ FalkorDB graph schema for dashboards, panels, queries, metrics, services — v1.3
- ✓ Dashboard hierarchy support (overview/drill-down/detail levels) — v1.3
- ✓ PromQL parser for metric extraction (best-effort) — v1.3
- ✓ Variable classification (scoping vs entity vs detail) — v1.3
- ✓ Service inference from metric labels — v1.3
- ✓ Anomaly detection with 7-day historical baseline — v1.3
- ✓ MCP tool: metrics_overview (overview dashboards, ranked anomalies) — v1.3
- ✓ MCP tool: metrics_aggregated (service/cluster focus, correlations) — v1.3
- ✓ MCP tool: metrics_details (full dashboard, deep expansion) — v1.3
- ✓ UI form for Grafana configuration (URL, API token, hierarchy mapping) — v1.3
- ✓ Alert rule sync via Grafana Alerting API (incremental, version-based) — v1.4
- ✓ Alert nodes in FalkorDB linked to existing Metrics/Services via PromQL extraction — v1.4
- ✓ Alert state timeline storage (STATE_TRANSITION edges with 7-day TTL) — v1.4
- ✓ Flappiness detection with exponential scaling and historical baseline — v1.4
- ✓ MCP tool: alerts_overview (firing/pending counts by severity with flappiness indicators) — v1.4
- ✓ MCP tool: alerts_aggregated (specific alerts with 1h state timelines) — v1.4
- ✓ MCP tool: alerts_details (full 7-day state history with rule definition) — v1.4
- ✓ Signal anchors linking metrics to roles to workloads — v1.5
- ✓ 7-role classification taxonomy (Availability, Latency, Errors, Traffic, Saturation, Churn, Novelty) — v1.5
- ✓ Dashboard quality scoring (freshness, alerting, ownership, completeness) — v1.5
- ✓ Rolling baseline statistics per signal (median, P50/P90/P99, stddev) — v1.5
- ✓ Hybrid anomaly detection (z-score + percentile) with alert override — v1.5
- ✓ Hierarchical anomaly aggregation (signals → workloads → namespaces → clusters) — v1.5
- ✓ 8 Observatory MCP tools for progressive disclosure incident investigation — v1.5

### Out of Scope

Expand All @@ -148,6 +169,8 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
- Standalone MCP server command — consolidated architecture is the deployment model
- Metric value storage — query Grafana on-demand instead of storing time-series locally
- Direct Prometheus/Mimir queries — use Grafana API as proxy for simpler auth
- ML-based role classification — keyword heuristics sufficient, ML deferred to v2
- Real-time streaming anomaly detection — polling-based for v1.5

## Context

Expand All @@ -158,29 +181,23 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
- MCP tools at `internal/mcp/tools/` use services directly (no HTTP)
- Plugin system at `internal/integration/` with factory registry and lifecycle manager
- VictoriaLogs client at `internal/integration/victorialogs/`
- Grafana integration at `internal/integration/grafana/` with dashboard, metrics, alerts, and observatory
- Log processing at `internal/logprocessing/` (Drain algorithm, template storage)
- Config management at `internal/config/` with hot-reload via fsnotify
- REST API handlers at `internal/api/handlers/`
- React UI at `ui/src/pages/`
- Go 1.24+, TypeScript 5.8, React 19

**Architecture (v1.1):**
**Architecture (v1.5):**
- Single `spectre server` command serves everything on port 8080
- MCP tools call TimelineService/GraphService directly in-process
- No standalone MCP/agent commands (removed in v1.1)
- Helm chart deploys single container

**Progressive disclosure model (implemented):**
1. **Overview** — error/warning counts by namespace (QueryAggregation with level filter)
2. **Patterns** — log templates via Drain with novelty detection (compare to previous window)
3. **Logs** — raw logs with limit enforcement (max 500)

**Grafana integration architecture (v1.3 target):**
- Dashboard ingestion: Grafana API → full JSON stored, structure extracted to graph
- Graph schema: Dashboard→Panel→Query→Metric, Service inferred from labels
- Query execution: Via Grafana /api/ds/query endpoint (not direct to Prometheus)
- Variable handling: AI provides scoping variables (cluster, region) per MCP call
- Anomaly detection: Compare current metrics to 7-day rolling average (time-of-day matched)
- MCP tools call TimelineService/GraphService/ObservatoryService directly in-process
- Grafana integration provides 14 MCP tools (3 metrics + 3 alerts + 8 observatory)
- Observatory uses FalkorDB for signal anchors and baselines with TTL-based cleanup

**Progressive disclosure model:**
1. **Overview** — cluster/namespace anomaly summary (Orient stage)
2. **Scope** — namespace/workload focus with ranked signals (Narrow stage)
3. **Detail** — signal baseline, anomaly score, evidence (Investigate/Verify stages)

## Constraints

Expand All @@ -194,6 +211,7 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
- **Grafana API token**: Requires Bearer token with dashboard read permissions
- **PromQL parsing best-effort**: Complex expressions may not fully parse, extract what's possible
- **Graph storage for structure only**: FalkorDB stores dashboard structure, not metric values
- **Baseline collection rate limit**: 10 req/sec forward, 2 req/sec backfill

## Key Decisions

Expand Down Expand Up @@ -232,11 +250,23 @@ Enable AI assistants to understand what's happening in Kubernetes clusters throu
| LOCF interpolation for timelines (v1.4) | Fills gaps realistically in state buckets | ✓ Good |
| Optional filter parameters (v1.4) | Maximum flexibility for AI alert queries | ✓ Good |
| 10-minute timeline buckets (v1.4) | Compact notation [F F N N], 6 buckets per hour | ✓ Good |
| Layered classification with confidence decay (v1.5) | 5 layers from hardcoded to unknown | ✓ Good |
| Quality scoring with alert boost (v1.5) | +0.2 for dashboards with alerts | ✓ Good |
| Composite key for SignalAnchor (v1.5) | metric + namespace + workload + integration | ✓ Good |
| Z-score sigmoid normalization (v1.5) | Maps unbounded to 0-1 range | ✓ Good |
| Hybrid MAX aggregation (v1.5) | Either z-score or percentile can flag anomaly | ✓ Good |
| Alert firing override (v1.5) | Human decision takes precedence, score=1.0 | ✓ Good |
| Hierarchical MAX aggregation (v1.5) | Worst signal bubbles up through hierarchy | ✓ Good |
| Progressive disclosure for incidents (v1.5) | Orient → Narrow → Investigate → Hypothesize → Verify | ✓ Good |

## Tech Debt

- DateAdded field not persisted in integration config (uses time.Now() on each GET request)
- GET /{name} endpoint available but unused by UI (uses list endpoint instead)
- TestComputeDashboardQuality_Freshness has time-dependent failures
- Quality scoring stubs (getAlertRuleCount, getViewsLast30Days return 0)
- Dashboard metadata extraction TODOs (updated time, folder title, description)
- QueryService stub methods (FetchCurrentValue, FetchHistoricalValue use baseline fallback)

---
*Last updated: 2026-01-23 after v1.4 milestone shipped*
*Last updated: 2026-01-30 after v1.5 Observatory milestone shipped*
Loading
Loading