diff --git a/.github/workflows/docs-ci.yml b/.github/workflows/docs-ci.yml new file mode 100644 index 0000000..1a5ba4b --- /dev/null +++ b/.github/workflows/docs-ci.yml @@ -0,0 +1,23 @@ +name: CI + +on: + pull_request: + branches: [main] + +permissions: + contents: read + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-node@v6 + with: + node-version: "20" + cache: npm + + - run: npm ci + + - run: npm run build diff --git a/docs/advanced/parallel-fan-out.md b/docs/advanced/parallel-fan-out.md index 6680107..58a8191 100644 --- a/docs/advanced/parallel-fan-out.md +++ b/docs/advanced/parallel-fan-out.md @@ -87,7 +87,7 @@ A lower-level tool for advanced patterns where the agent needs fine-grained cont ## How it works with autoscaling -Fan-out naturally increases the pending task count on the agent's queue. When [KEDA-based autoscaling](/integrations/autoscaling) is enabled, this triggers pod scale-up: +Fan-out naturally increases the pending task count on the agent's queue. When [KEDA-based autoscaling](/scaling/autoscaling) is enabled, this triggers pod scale-up: 1. Agent submits 10 subtasks via `spawn_and_collect` 2. 10 pending messages appear on the Redis Stream @@ -100,7 +100,7 @@ No changes to the autoscaling configuration are needed. The existing pending-tas ## How it works with budgets -Each subtask is a separate task execution that consumes tokens from the agent's [SwarmBudget](/advanced/budget-management). The originating agent's own token usage for the fan-out/collect cycle is minimal (tool call overhead only). The real cost is in the subtask executions, which are tracked individually. +Each subtask is a separate task execution that consumes tokens from the agent's [SwarmBudget](/scaling/budget-management). The originating agent's own token usage for the fan-out/collect cycle is minimal (tool call overhead only). The real cost is in the subtask executions, which are tracked individually. ## Example diff --git a/docs/faq.md b/docs/faq.md index 4bfcefb..a777aa6 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -32,7 +32,7 @@ No. kubeswarm uses native Kubernetes Secrets only. No wrapper CRD. ## How does budget enforcement work? -The operator tracks rolling 24h token usage per agent. When `spec.guardrails.limits.dailyTokens` is exceeded, replicas are scaled to 0 and a `BudgetExceeded` condition is set. Replicas restore automatically when the window rotates. See [Budget Management](/advanced/budget-management). +The operator tracks rolling 24h token usage per agent. When `spec.guardrails.limits.dailyTokens` is exceeded, replicas are scaled to 0 and a `BudgetExceeded` condition is set. Replicas restore automatically when the window rotates. See [Budget Management](/scaling/budget-management). ## Can agents call other agents? diff --git a/docs/features.md b/docs/features.md index b4a93b2..2029486 100644 --- a/docs/features.md +++ b/docs/features.md @@ -71,7 +71,7 @@ swarm audit tree evt-abc123 Audit events emit to a configurable sink (stdout, Redis Stream, or webhook). Opt-in at cluster, namespace, or agent level. - [Audit Trail](/observability/audit-trail) - full configuration guide, event schema, and CLI reference -- [Budget Management](/advanced/budget-management) - per-action token tracking and cost attribution +- [Budget Management](/scaling/budget-management) - per-action token tracking and cost attribution - [Custom Resources: SwarmRun](/custom-resources/) - run status field reference --- @@ -94,7 +94,7 @@ spec: The operator creates KEDA ScaledObjects automatically. No KEDA YAML to write - just set the fields on your SwarmAgent. -- [Autoscaling (KEDA)](/integrations/autoscaling) - full configuration guide and prerequisites +- [Autoscaling (KEDA)](/scaling/autoscaling) - full configuration guide and prerequisites --- @@ -167,7 +167,7 @@ spec: Budget alerts fire via Slack, email, or webhook before you hit the wall. Per-action token tracking in the audit trail lets you identify which tools and agents drive cost. -- [Budget Management](/advanced/budget-management) - full configuration and enforcement modes +- [Budget Management](/scaling/budget-management) - full configuration and enforcement modes - [Custom Resources: SwarmBudget](/custom-resources/) - budget field reference --- diff --git a/docs/integrations/artifact-storage.md b/docs/integrations/artifact-storage.md new file mode 100644 index 0000000..0e7b926 --- /dev/null +++ b/docs/integrations/artifact-storage.md @@ -0,0 +1,46 @@ +--- +sidebar_position: 8 +sidebar_label: "Artifact Storage" +description: "kubeswarm artifact storage - S3 and GCS backends for storing and passing file artifacts between agent pipeline steps on Kubernetes." +--- + +# kubeswarm Artifact Storage - S3 and GCS for Agent Pipelines + +kubeswarm SwarmTeam pipelines can store and pass file artifacts between steps using S3 or GCS backends on Kubernetes. + +## Supported Backends + +| Backend | Endpoint | Auth | +| ------- | ------------------------------ | -------------------------------- | +| **S3** | Any S3-compatible (AWS, MinIO) | Secret with access key | +| **GCS** | Google Cloud Storage | Secret with service account JSON | + +## Configuration + +```yaml +spec: + artifactStore: + type: s3 + s3: + bucket: swarm-artifacts + region: us-east-1 + endpoint: http://minio.kubeswarm-system:9000 # omit for AWS S3 + credentialsSecret: + name: s3-credentials +``` + +## Pipeline Usage + +Steps declare output artifacts and reference other steps' artifacts: + +```yaml +pipeline: + - role: analyst + outputArtifacts: + - name: report.md + contentType: text/markdown + - role: reviewer + dependsOn: [analyst] + inputArtifacts: + report: "{{ .steps.analyst.artifacts.report.md }}" +``` diff --git a/docs/integrations/index.md b/docs/integrations/index.md index b338c71..27c36c8 100644 --- a/docs/integrations/index.md +++ b/docs/integrations/index.md @@ -13,6 +13,6 @@ kubeswarm connects your Kubernetes agents to external services for LLM inference - [MCP Servers](./mcp-servers) - Model Context Protocol tool servers - [Vector Stores](./vector-stores) - Qdrant, Pinecone, Weaviate - [Notifications](./notifications) - Slack, webhooks -- [Observability](./observability) - OpenTelemetry, Prometheus -- [Autoscaling](./autoscaling) - KEDA +- [Observability](/observability/overview) - OpenTelemetry, Prometheus +- [Autoscaling](/scaling/autoscaling) - KEDA - [Artifact Storage](./artifact-storage) - S3, GCS diff --git a/docs/integrations/notifications.md b/docs/integrations/notifications.md new file mode 100644 index 0000000..4573015 --- /dev/null +++ b/docs/integrations/notifications.md @@ -0,0 +1,64 @@ +--- +sidebar_position: 6 +sidebar_label: "Notifications" +description: "kubeswarm notification integrations - Slack and webhook alerts for agent budget exceeded, degraded and pipeline failure events on Kubernetes." +--- + +# kubeswarm Notifications - Slack and Webhook Alerts for Agents + +kubeswarm sends alerts via the SwarmNotify CRD when agents degrade, budgets are exceeded, or pipeline runs fail on Kubernetes. + +## Supported Channels + +| Channel | Configuration | Use case | +| ----------- | ----------------------- | --------------------------- | +| **Slack** | Webhook URL from Secret | Team chat alerts | +| **Webhook** | Any HTTP endpoint | PagerDuty, Opsgenie, custom | + +## Configuration + +```yaml +apiVersion: kubeswarm.io/v1alpha1 +kind: SwarmNotify +metadata: + name: ops-alerts +spec: + channel: + type: slack + slack: + webhookUrlSecretRef: + name: slack-secrets + key: webhook-url + events: + - type: BudgetExceeded + template: ":warning: Budget exceeded for {{ .agent }}: {{ .totalTokens }} tokens" + - type: AgentDegraded + template: ":red_circle: Agent degraded: {{ .agent }} - {{ .reason }}" + - type: TeamFailed + template: ":x: Pipeline failed: {{ .team }} run {{ .run }}" + - type: TeamSucceeded + template: ":white_check_mark: Pipeline completed: {{ .team }}" + rateLimiting: + windowSeconds: 300 + maxPerWindow: 5 +``` + +## Event Types + +| Event | Trigger | +| ---------------- | ----------------------------------------------- | +| `BudgetExceeded` | Daily token limit reached, replicas scaled to 0 | +| `AgentDegraded` | MCP server unreachable or health check failed | +| `TeamFailed` | Pipeline run reached terminal failure | +| `TeamSucceeded` | Pipeline run completed successfully | +| `TeamTimedOut` | Pipeline run exceeded `timeoutSeconds` | + +## Referencing from Agents + +```yaml +spec: + observability: + healthCheck: + notifyRef: + name: ops-alerts +``` diff --git a/docs/observability/audit-trail.md b/docs/observability/audit-trail.md index 9dd0d34..dc761d5 100644 --- a/docs/observability/audit-trail.md +++ b/docs/observability/audit-trail.md @@ -435,9 +435,9 @@ The audit trail complements - not replaces - existing observability signals. For full observability coverage, use the audit trail alongside OTel tracing and structured logging: - **OTel** for latency analysis and cross-service correlation -- **Structured logging** for runtime debugging (see [Observability](/integrations/observability)) +- **Structured logging** for runtime debugging (see [Observability](/observability/overview)) - **Audit trail** for behavior reconstruction, compliance, and cost attribution -- **SwarmBudget** for aggregate spend limits (see [Budget Management](/advanced/budget-management)) +- **SwarmBudget** for aggregate spend limits (see [Budget Management](/scaling/budget-management)) --- @@ -460,7 +460,7 @@ Example: 50 agents, 10 tasks/hour each, `actions` mode, 7-day retention: - 500 tasks/hour * 5 events * 1.5 KB = 3.75 MB/hour - 3.75 MB * 168h = 630 MB + 30% headroom = ~820 MB -For detailed sizing - including worked examples for verbose mode, split topologies, and the `maxDetailBytes` knob - see the [Redis in Production](/operations/redis-production#capacity-estimation) guide. +For detailed sizing - including worked examples for verbose mode, split topologies, and the `maxDetailBytes` knob - see the [Redis in Production](/scaling/redis-production#capacity-estimation) guide. ---