diff --git a/lfx-platform/.claude-plugin/plugin.json b/lfx-platform/.claude-plugin/plugin.json new file mode 100644 index 0000000..98d7c48 --- /dev/null +++ b/lfx-platform/.claude-plugin/plugin.json @@ -0,0 +1,11 @@ +{ + "name": "lfx-platform", + "description": "Platform engineering tools for the LFX v2 Kubernetes platform. Provides troubleshooting and deployment workflows using Datadog, Kubernetes, AWS, GitHub, and ArgoCD.", + "version": "0.1.0", + "author": { + "name": "Linux Foundation", + "url": "https://github.com/linuxfoundation" + }, + "homepage": "https://github.com/linuxfoundation/lfx-skills", + "license": "MIT" +} diff --git a/lfx-platform/.mcp.json b/lfx-platform/.mcp.json new file mode 100644 index 0000000..29df92f --- /dev/null +++ b/lfx-platform/.mcp.json @@ -0,0 +1,21 @@ +{ + "mcpServers": { + "datadog lfx": { + "type": "connector", + "connector": "datadog" + }, + "github": { + "type": "connector", + "connector": "github" + } + }, + "_extensions": { + "note": "The Kubernetes and AWS MCP servers are distributed as Desktop Extensions (.mcpb) in the extensions/ folder. Install each .mcpb file via Claude Desktop → File → Install Extension, or run /platform-setup which will guide you through it.", + "k8s_dev": "extensions/k8s_dev.mcpb", + "k8s_stag": "extensions/k8s_stag.mcpb", + "k8s_prod": "extensions/k8s_prod.mcpb", + "AWS_lfx_dev": "extensions/aws_lfx_dev.mcpb", + "AWS_lfx_stag": "extensions/aws_lfx_stag.mcpb", + "AWS_lfx_prod": "extensions/aws_lfx_prod.mcpb" + } +} diff --git a/lfx-platform/commands/platform-deploy.md b/lfx-platform/commands/platform-deploy.md new file mode 100644 index 0000000..c925beb --- /dev/null +++ b/lfx-platform/commands/platform-deploy.md @@ -0,0 +1,424 @@ + + + +--- +description: Deploy an LFX v2 service to staging or production. Handles semver release creation on the service repo, ArgoCD version bump PR, deployment monitoring, and rollback. Use when an engineer wants to release a service, cut a new version, promote a build to staging or prod, or deploy a change. Trigger on phrases like "deploy to staging", "release to prod", "cut a release", "new version", "push to production", "promote to staging". +--- + +# LFX Platform Deploy + +This command walks through the full release and deployment workflow for an LFX v2 +service. It uses the `gh` CLI for GitHub operations and the Kubernetes MCP for +deployment monitoring. + +**How it works:** The Helm chart for every LFX v2 service lives in the service's +own GitHub repo under `charts/`. In dev, ArgoCD tracks `HEAD` of the service repo +and deploys automatically. For staging and production, ArgoCD points at a specific +semver git tag — deploying means tagging the service repo and updating that tag +in the ArgoCD repo. + +For full ArgoCD repo structure details, see +[../skills/platform-troubleshoot/references/argocd-structure.md](../skills/platform-troubleshoot/references/argocd-structure.md). + +--- + +## Prerequisites + +This command requires the **GitHub CLI (`gh`)** to be installed and authenticated. + +### Check if you're ready + +```bash +# Verify gh is installed +gh --version + +# Verify you're authenticated and have access to the linuxfoundation org +gh auth status +gh api orgs/linuxfoundation --jq '.login' +``` + +If either command fails, set up `gh` before proceeding: + +### Installing gh + +```bash +# macOS +brew install gh + +# Linux (Debian/Ubuntu) +sudo apt install gh + +# Or download directly: https://cli.github.com +``` + +### Authenticating + +```bash +gh auth login +# Choose: GitHub.com → HTTPS → Login with a web browser +``` + +You need **write access** to the following repos for deploys to work: +- `linuxfoundation/lfx-v2-{service}` — to create tags and releases +- `linuxfoundation/lfx-v2-argocd` — to open the version-bump PR + +If `gh api orgs/linuxfoundation --jq '.login'` returns a 403 or 404, you may need +to authorize the `gh` OAuth app for the `linuxfoundation` SSO organization: + +```bash +# Open GitHub → Settings → Applications → Authorized OAuth Apps +# Find "GitHub CLI" and click "Configure SSO" → Grant linuxfoundation +open https://github.com/settings/applications +``` + +--- + +## Step 1: Identify service and target environment + +Identify the service name and target environment from context. If not clear, ask. + +| User says | Canonical env | +|---|---| +| dev / development | **dev** | +| stag / staging | **stag** | +| prod / production | **prod** | + +**If the target is dev:** stop here. + +> ArgoCD tracks the `main` branch of every service repo and deploys automatically. +> Just merge your PR to `main` — changes will be live in dev within a few minutes. +> To check sync status: look for the Application in the `argocd` namespace using +> the `mcp__k8s_dev__resources_get` tool with `kind: Application`. + +**If the target is stag or prod:** continue. + +--- + +## Step 2: Look up the service repo and current deployed version + +Find the service's GitHub repo from the ArgoCD structure reference. The pattern is +`linuxfoundation/lfx-v2-{service}` for most services. + +Check both pieces of information in parallel: + +**A. Latest GitHub releases on the service repo:** +```bash +gh release list --repo linuxfoundation/lfx-v2-{service} --limit 10 +``` + +**B. Currently deployed version in ArgoCD for the target environment:** + +The ArgoCD repo folder name for staging is `staging` (not `stag`); for prod it is +`prod`. Read `apps/{env-folder}/lfx-v2-applications.yaml` and find the +`targetRevision` for this service: +```bash +gh api repos/linuxfoundation/lfx-v2-argocd/contents/apps/{env-folder}/lfx-v2-applications.yaml \ + --jq '.content' | base64 -d | grep -A5 "name: {service-name}" +``` + +Also read the env-specific values file — this always sets `image.tag` for +staging/prod and must be updated in sync with `targetRevision`: +```bash +gh api repos/linuxfoundation/lfx-v2-argocd/contents/values/{env-folder}/{service-name}.yaml \ + --jq '.content' | base64 -d +``` +(This file may not exist for staging if staging hasn't been configured for this +service yet — that's OK, just note it.) + +Present the user with: +- Current version deployed in `{env}` (from `targetRevision` in the ApplicationSet) +- Available GitHub releases newer than the deployed version + +--- + +## Step 3: Determine the action + +Ask: + +> **{service}** is currently at `{current_version}` in {env}. +> Available newer releases: {list, or "none yet"}. +> +> Would you like to: +> 1. **Create a new release** from the current `main` branch +> 2. **Promote an existing release** to {env} — which version? + +If they choose option 2, skip to Step 5. +If they choose option 1, continue to Step 4. + +--- + +## Step 4: Create a new release + +### 4a. Analyse changes since last release + +```bash +LAST_TAG=$(gh release list --repo linuxfoundation/lfx-v2-{service} \ + --limit 1 --json tagName --jq '.[0].tagName') + +# How much has changed? +gh api "repos/linuxfoundation/lfx-v2-{service}/compare/${LAST_TAG}...main" \ + --jq '{commits: .ahead_by, files: (.files | length)}' + +# What changed? +gh api "repos/linuxfoundation/lfx-v2-{service}/compare/${LAST_TAG}...main" \ + --jq '[.commits[].commit.message] | .[:20][]' +``` + +### 4b. Recommend a version bump + +Parse `{LAST_TAG}` as semver (e.g., `1.4.2` → major=1, minor=4, patch=2). +Tags use bare semver **without** a `v` prefix (e.g., `1.4.3` not `v1.4.3`). + +- **Default: patch bump** → `1.4.3` +- **Suggest minor bump** if you see more than ~20 commits, more than ~10 files + changed, or multiple `feat:` / `add:` commit messages indicating new features +- **Never recommend a major bump** — that's the engineer's call + +Present your recommendation with a one-line rationale and ask for confirmation: + +> Based on {N commits, N files changed} since `{LAST_TAG}`, I recommend a +> **{patch|minor}** release: **`{proposed_tag}`**. Good to go, or prefer a +> different version? + +### 4c. Create the release + +Once confirmed: +```bash +gh release create {new_tag} \ + --repo linuxfoundation/lfx-v2-{service} \ + --title "{new_tag}" \ + --generate-notes \ + --target main +``` + +### 4d. Monitor CI/CD — and start Step 5 in parallel + +The release tag triggers GitHub Actions to build and push: +- Docker image: `ghcr.io/linuxfoundation/lfx-v2-{service}/{binary}:{new_tag}` +- Helm chart: packaged at `charts/lfx-v2-{service}` in the same tag + +Monitor the run: +```bash +# Find the workflow triggered by the tag push +gh run list --repo linuxfoundation/lfx-v2-{service} --limit 5 + +RUN_ID=$(gh run list --repo linuxfoundation/lfx-v2-{service} \ + --limit 1 --json databaseId --jq '.[0].databaseId') +gh run watch $RUN_ID --repo linuxfoundation/lfx-v2-{service} +``` + +Tell the user CI/CD is underway, and **immediately proceed to Step 5** to prepare +the ArgoCD PR — don't wait for CI/CD to finish first. + +--- + +## Step 5: Create the ArgoCD version bump PR + +Clone the ArgoCD repo and make the version changes. + +### 5a. Clone and branch + +```bash +git clone https://github.com/linuxfoundation/lfx-v2-argocd.git /tmp/lfx-v2-argocd +cd /tmp/lfx-v2-argocd +git checkout -b deploy/{service}/{env}/{new_tag} +``` + +### 5b. Update the ApplicationSet — the main change + +Edit `apps/{env-folder}/lfx-v2-applications.yaml` (where `{env-folder}` is +`staging` or `prod`). Find the list element for this service and update +`targetRevision`. Tags are bare semver without a `v` prefix: + +```yaml +# Before: +- name: lfx-v2-{service} + ... + targetRevision: {current_tag} # e.g., 0.4.1 + +# After: +- name: lfx-v2-{service} + ... + targetRevision: {new_tag} # e.g., 0.4.2 +``` + +### 5c. Update the values file — always required for staging/prod + +Edit `values/{env-folder}/lfx-v2-{service}.yaml` and update `image.tag` to +match. This field exists in all staging/prod values files and **must be kept in +sync with `targetRevision`**: + +```yaml +# Before: +image: + tag: {current_tag} # e.g., 0.4.1 + +# After: +image: + tag: {new_tag} # e.g., 0.4.2 +``` + +If the staging values file doesn't exist yet for this service, no change needed +here — note it in the PR. + +> **Note:** Explicit `image.tag` in values files is transitional. Once Helm chart +> version management is fully automated, only `targetRevision` will need updating. +> Until then, always update both. + +### 5d. Commit and push + +```bash +git add apps/{env-folder}/lfx-v2-applications.yaml +git add values/{env-folder}/lfx-v2-{service}.yaml # always for prod; if exists for staging +git commit -m "chore({service}): bump {env} to {new_tag}" +git push origin deploy/{service}/{env}/{new_tag} +``` + +### 5e. Create the PR + +```bash +gh pr create \ + --repo linuxfoundation/lfx-v2-argocd \ + --head deploy/{service}/{env}/{new_tag} \ + --base main \ + --title "chore({service}): bump {env} to {new_tag}" \ + --body "Deploys **{service}** to **{env}**. + +| | | +|---|---| +| Previous version | \`{current_tag}\` | +| New version | \`{new_tag}\` | +| Files changed | \`apps/{env-folder}/lfx-v2-applications.yaml\`, \`values/{env-folder}/lfx-v2-{service}.yaml\` | + +Auto-generated by \`/platform-deploy\`" +``` + +### 5f. Auto-merge if eligible + +The ArgoCD repo has auto-approval for PRs that only change `targetRevision` and +`image.tag` fields. Review the diff: + +```bash +gh pr diff --repo linuxfoundation/lfx-v2-argocd {pr_number} +``` + +**If the diff only touches `targetRevision` and/or `image.tag` values:** proceed +with auto-merge. +```bash +gh pr merge {pr_number} --repo linuxfoundation/lfx-v2-argocd --squash --auto +``` + +**If the diff contains any other changes:** stop and flag it: +> "This PR contains more than a version bump — it needs manual review before +> it can merge. PR: {url}" + +--- + +## Step 6: Confirm CI/CD completed before monitoring + +Before watching the cluster, confirm the CI/CD build from Step 4d succeeded: +```bash +gh run view $RUN_ID --repo linuxfoundation/lfx-v2-{service} +``` + +If it failed, surface the error immediately: +> "⚠️ CI/CD for `{new_tag}` failed — the ArgoCD PR has been created but ArgoCD +> won't be able to pull the new image or chart until the build passes. +> Check the workflow: {url}" + +--- + +## Step 7: Monitor the deployment + +Once the ArgoCD PR is merged, watch for sync and healthy status. +Use `mcp__k8s_stag__*` or `mcp__k8s_prod__*` depending on environment. + +### 7a. Watch ArgoCD sync + +Poll the Application resource (every ~30s): +``` +mcp__k8s_{env}__resources_get + apiVersion: argoproj.io/v1alpha1 + kind: Application + namespace: argocd + name: lfx-v2-{service} +``` + +Watch for: +- `status.sync.status: Synced` — new manifest applied +- `status.health.status: Healthy` — all pods running +- `status.summary.images` — confirm the image tag shows `{new_tag}`, not the old one +- `status.operationState.phase: Succeeded` — last sync operation completed cleanly + +### 7b. Watch pod startup + +``` +mcp__k8s_{env}__pods_list_in_namespace → {service-namespace} +mcp__k8s_{env}__events_list → {service-namespace} +``` + +Surface immediately if you see: +- `CrashLoopBackOff` — the new version is crashing +- `OOMKilled` — pod ran out of memory +- `ImagePullBackOff` — image tag doesn't exist yet (CI/CD may still be running) +- Pods stuck in `Pending` — node capacity or resource issue + +### 7c. Check startup logs + +Once new pods are Running (look for pods with a recent creation timestamp): +``` +mcp__k8s_{env}__pods_log → most recently started pod in the namespace +``` + +Scan for ERROR-level messages in the first 60 seconds. Startup errors usually +indicate misconfiguration, missing secrets, or schema migration issues. + +### 7d. Declare success or escalate + +**Success:** +> ✅ `{service}` `{new_tag}` is live in {env}. All pods healthy. + +**Issue found:** report the specific finding immediately — log lines, event details, +image tag mismatch. Don't wait for the full monitoring window. + +--- + +## Step 8: Rollback (if requested) + +To roll back, create a new PR reverting the version in the ApplicationSet (and +values file if applicable) to the previous tag. Same process as Step 5 — and +equally eligible for auto-merge if only version fields change. + +```bash +git checkout -b rollback/{service}/{env}/{prev_tag} + +# Revert targetRevision to {prev_tag} in the ApplicationSet +# Revert image.tag to {prev_tag} in values file if it was set + +git commit -m "revert({service}): roll back {env} to {prev_tag}" +git push origin rollback/{service}/{env}/{prev_tag} + +gh pr create \ + --repo linuxfoundation/lfx-v2-argocd \ + --head rollback/{service}/{env}/{prev_tag} \ + --base main \ + --title "revert({service}): roll back {env} to {prev_tag}" \ + --body "Emergency rollback of {service} in {env} from {new_tag} to {prev_tag}." +``` + +Then monitor as in Step 7. + +--- + +## Staging considerations + +Staging is still being built out. If you don't find a staging ApplicationSet for +a service: +> "I don't see a staging configuration for `{service}` yet. Would you like to: +> 1. Deploy directly to production only +> 2. Add staging configuration first (requires a separate PR that will need +> manual review)" + +For services where staging exists but the engineer wants to deploy stag and prod +simultaneously, create two PRs (one per environment) and run Step 7 monitoring +for both in parallel. diff --git a/lfx-platform/commands/platform-setup.md b/lfx-platform/commands/platform-setup.md new file mode 100644 index 0000000..7213fb9 --- /dev/null +++ b/lfx-platform/commands/platform-setup.md @@ -0,0 +1,553 @@ + + + +--- +description: Set up a local machine to use the LFX platform plugin. Installs and configures the GitHub CLI, AWS CLI, AWS SSO profiles, and kubeconfig files needed to run platform-deploy and platform-troubleshoot. Safe to re-run — detects what is already configured and skips those steps. Use when an engineer says "set up my machine", "configure platform access", "I can't run platform-deploy", or "getting started with LFX platform tooling". +--- + +# LFX Platform Setup + +This command configures your local machine to use the LFX platform plugin. It +walks through each prerequisite in order, checks whether it is already in place, +and only performs the steps that are actually needed. + +Run it end-to-end on a new machine, or re-run it at any time to verify or repair +your setup. + +--- + +## Before starting + +Tell the user what this command will do: + +> "I'll check and configure the following on your machine: +> 1. Desktop Extensions — install the 6 `.mcpb` bundles for Kubernetes and AWS MCP servers +> 2. GitHub CLI (`gh`) — installed and authenticated with `linuxfoundation` org access +> 3. AWS CLI + `uv` — installed and configured with the three LFX SSO profiles +> 4. Kubeconfig files for dev, staging, and prod clusters +> 5. Cloud connectors — authorize Datadog and GitHub in Claude settings +> +> This will open your browser a few times for GitHub auth, AWS SSO, and connector +> authorization. Everything else is automated. Let me know when you're ready, or +> just say go." + +Wait for confirmation before running any commands. Some steps open browser windows +and the user should be sitting at their machine. + +--- + +## Step 1: Install Desktop Extensions + +The Kubernetes and AWS MCP servers are distributed as Desktop Extensions (`.mcpb` +files) in the plugin's `extensions/` folder. Each extension, once installed, +automatically registers its MCP server in Claude Desktop — no manual config editing +required. + +There are six extensions to install: + +| File | What it provides | +|---|---| +| `k8s_dev.mcpb` | Kubernetes MCP — dev cluster | +| `k8s_stag.mcpb` | Kubernetes MCP — staging cluster | +| `k8s_prod.mcpb` | Kubernetes MCP — production cluster | +| `aws_lfx_dev.mcpb` | AWS API MCP — dev account (788942260905) | +| `aws_lfx_stag.mcpb` | AWS API MCP — staging account (844790888233) | +| `aws_lfx_prod.mcpb` | AWS API MCP — production account (372256339901) | + +### 1a. Check which extensions are already installed + +Try a lightweight call to each MCP server: +``` +mcp__k8s_dev__namespaces_list +mcp__k8s_stag__namespaces_list +mcp__k8s_prod__namespaces_list +mcp__AWS_lfx_dev__suggest_aws_commands (prompt: "list s3 buckets") +mcp__AWS_lfx_stag__suggest_aws_commands (prompt: "list s3 buckets") +mcp__AWS_lfx_prod__suggest_aws_commands (prompt: "list s3 buckets") +``` + +Skip installation for any that respond successfully. + +### 1b. Install missing extensions + +For each extension that isn't responding, install it from the plugin's +`extensions/` folder. There are three ways to install a `.mcpb` file: + +- **Double-click** the file in Finder/Files +- **Drag and drop** the file into Claude Desktop +- **File menu**: Claude Desktop → File → Install Extension → select the file + +Tell the user which files to install and where to find them: + +> "The `extensions/` folder is in the `lfx-platform` plugin directory inside +> your `lfx-skills` repo. Install these files: {list missing extensions}. +> Double-click each one, review the permissions, and click Install." + +### 1c. Configure kubeconfig path during install + +When installing a Kubernetes extension, Claude Desktop will prompt for a +**kubeconfig file path**. The defaults are pre-filled: + +- `k8s_dev.mcpb` → `~/.kube/dev-config` +- `k8s_stag.mcpb` → `~/.kube/staging-config` +- `k8s_prod.mcpb` → `~/.kube/prod-config` + +These files will be generated in Step 4. The user can accept the defaults now +and the paths will work once Step 4 is complete. + +### 1d. Verify extensions loaded + +After installation, Claude Desktop needs to restart to load the new MCP servers. +Ask the user to restart Claude Desktop, then re-run the checks from Step 1a. + +--- + +## Step 2: Detect operating system (used throughout) + +```bash +uname -s +``` + +Store the result: +- `Darwin` → macOS. Use `brew` for package installation. +- `Linux` → Linux. Detect the distro: + ```bash + cat /etc/os-release | grep -E '^ID=' + ``` + - `ubuntu`, `debian`, `pop` → use `apt` + - `fedora`, `rhel`, `centos` → use `dnf` + - anything else → skip package manager steps and provide manual install links + +--- + +## Step 3: GitHub CLI + +### 3a. Check if gh is installed + +```bash +gh --version +``` + +**If installed:** print the version and skip to 2b. + +**If not installed:** + +```bash +# macOS +brew install gh + +# Linux (Debian/Ubuntu) +sudo apt update && sudo apt install -y gh + +# Linux (Fedora/RHEL) +sudo dnf install -y gh + +# Manual install (all platforms) +# https://cli.github.com — download the binary for your platform +``` + +After installing, verify: +```bash +gh --version +``` + +### 3b. Check authentication status + +```bash +gh auth status 2>&1 +``` + +Parse the output: +- Contains `Logged in to github.com` → authenticated. Skip to 2c. +- Any other output or non-zero exit → not authenticated. Continue. + +**If not authenticated:** + +Tell the user: +> "I'll open GitHub in your browser to complete authentication. This will ask +> you to log in and authorize the GitHub CLI app." + +Then run: +```bash +gh auth login --hostname github.com --git-protocol https --web +``` + +This opens a browser. Wait for the user to confirm it completed, then re-run +`gh auth status` to verify. + +### 3c. Check linuxfoundation SSO authorization + +```bash +gh api orgs/linuxfoundation --jq '.login' 2>&1 +``` + +- Returns `linuxfoundation` → authorized. Step 2 complete. +- Returns `Resource not accessible` or a 403/404 → SSO authorization needed. + +**If SSO authorization is needed:** + +> "GitHub SSO is not yet authorized for the `linuxfoundation` organization. +> I'll open the settings page — look for **GitHub CLI** under Authorized OAuth Apps, +> click **Configure SSO**, and grant access to `linuxfoundation`." + +```bash +# macOS +open https://github.com/settings/applications + +# Linux +xdg-open https://github.com/settings/applications 2>/dev/null || \ + echo "Open this URL in your browser: https://github.com/settings/applications" +``` + +After the user confirms they've authorized it, re-run the check: +```bash +gh api orgs/linuxfoundation --jq '.login' +``` + +If it still fails, tell the user: +> "It can take a minute for SSO authorization to take effect. If this keeps +> failing, check that you're a member of the `linuxfoundation` GitHub org and +> that your account has been granted repo access." + +--- + +## Step 4: AWS CLI and uv + +The AWS MCP extensions require **uv** (specifically `uvx`) to run the AWS API +MCP server. The AWS CLI is needed to configure SSO profiles and generate +kubeconfig files. + +### 4a. Check if uv is installed + +```bash +uv --version 2>&1 +``` + +**If installed:** print the version and skip to 4b. + +**If not installed:** + +```bash +# macOS and Linux (official installer) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# macOS (alternatively via brew) +brew install uv +``` + +After installing, add uv to PATH if needed (the installer will print instructions), +then verify: +```bash +uv --version && uvx --version +``` + +### 4b. Check if aws CLI is installed + +```bash +aws --version 2>&1 +``` + +**If installed:** print the version and skip to 3b. + +**If not installed:** + +```bash +# macOS +brew install awscli + +# Linux (Debian/Ubuntu) +sudo apt update && sudo apt install -y awscli + +# Linux (Fedora/RHEL) +sudo dnf install -y awscli +``` + +Verify: +```bash +aws --version +``` + +If the package manager version is below 2.x, the user should install from AWS +directly instead: +> "The version from your package manager is AWS CLI v1. AWS SSO requires v2. +> Download from: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" + +### 4c. Check AWS SSO config + +Check whether the `~/.aws/config` file contains the LFX SSO profiles: + +```bash +grep -l "okta-lfx\|lfx-dev\|lfx-stag\|lfx-prod" ~/.aws/config 2>/dev/null && \ + grep "sso_session\|sso_account_id" ~/.aws/config | head -20 +``` + +**If all four blocks are present** (sso-session okta-lfx, profile lfx-dev, +profile lfx-stag, profile lfx-prod): skip to 3c. + +**If missing or incomplete:** write the config. + +Check if `~/.aws/config` exists: +```bash +ls -la ~/.aws/config 2>/dev/null || echo "not found" +``` + +If the file doesn't exist, create the directory and file: +```bash +mkdir -p ~/.aws +``` + +Then **append** the following block to `~/.aws/config` (use `>>` to avoid +overwriting any existing profiles the user may have for other projects): + +```bash +cat >> ~/.aws/config << 'EOF' + +[sso-session okta-lfx] +sso_start_url = https://lfx.awsapps.com/start +sso_region = us-east-2 +sso_registration_scopes = sso:account:access + +[profile lfx-dev] +sso_session = okta-lfx +sso_account_id = 788942260905 +sso_role_name = PowerUser +region = us-west-2 +output = json + +[profile lfx-stag] +sso_session = okta-lfx +sso_account_id = 844790888233 +sso_role_name = Read-Only +region = us-west-2 +output = json + +[profile lfx-prod] +sso_session = okta-lfx +sso_account_id = 372256339901 +sso_role_name = Read-Only +region = us-west-2 +output = json +EOF +``` + +Confirm the profiles were written: +```bash +aws configure list-profiles | grep lfx +``` + +### 4d. Authenticate via AWS SSO + +Check if the current SSO session is still valid: +```bash +aws sts get-caller-identity --profile lfx-dev 2>&1 +``` + +**If it returns an account ID:** the session is active. Skip to Step 5. + +**If it returns an error** (expired token, not logged in, etc.): + +Tell the user: +> "I'll open the AWS SSO login page in your browser. Sign in with your Okta +> credentials. One login covers all three LFX accounts for 12 hours." + +```bash +aws sso login --profile lfx-dev +``` + +This opens a browser window. Wait for the user to confirm they've signed in, +then verify: +```bash +aws sts get-caller-identity --profile lfx-dev --query 'Account' --output text +aws sts get-caller-identity --profile lfx-stag --query 'Account' --output text +aws sts get-caller-identity --profile lfx-prod --query 'Account' --output text +``` + +Expected output: +``` +788942260905 +844790888233 +372256339901 +``` + +If any profile returns an error, note which ones failed and advise: +> "The `{profile}` profile isn't working. This usually means you don't have +> access to that AWS account yet — ask in #lfx-platform to get access granted." + +--- + +## Step 5: Kubeconfig files + +Kubeconfig files are generated from the EKS cluster using your AWS profiles. +Each environment uses a separate file to keep contexts isolated. + +### 5a. Check which configs already exist + +```bash +ls -la ~/.kube/dev-config ~/.kube/staging-config ~/.kube/prod-config 2>&1 +``` + +For each file that already exists, verify it connects: +```bash +kubectl --kubeconfig ~/.kube/dev-config get nodes --no-headers 2>&1 | head -3 +kubectl --kubeconfig ~/.kube/staging-config get nodes --no-headers 2>&1 | head -3 +kubectl --kubeconfig ~/.kube/prod-config get nodes --no-headers 2>&1 | head -3 +``` + +Skip generation for any environment where `get nodes` succeeds. + +### 5b. Generate missing kubeconfig files + +For any environment where the file is missing or the connection failed: + +```bash +# Create ~/.kube directory if it doesn't exist +mkdir -p ~/.kube + +# Dev +aws eks update-kubeconfig \ + --region us-west-2 --name lfx-v2 --profile lfx-dev \ + --kubeconfig ~/.kube/dev-config + +# Staging +aws eks update-kubeconfig \ + --region us-west-2 --name lfx-v2 --profile lfx-stag \ + --kubeconfig ~/.kube/staging-config + +# Prod +aws eks update-kubeconfig \ + --region us-west-2 --name lfx-v2 --profile lfx-prod \ + --kubeconfig ~/.kube/prod-config +``` + +Only run the commands for environments whose kubeconfig is missing or broken. + +### 5c. Verify connections + +```bash +kubectl --kubeconfig ~/.kube/dev-config get nodes --no-headers 2>&1 | head -3 +kubectl --kubeconfig ~/.kube/staging-config get nodes --no-headers 2>&1 | head -3 +kubectl --kubeconfig ~/.kube/prod-config get nodes --no-headers 2>&1 | head -3 +``` + +Each should return a list of nodes. If any fail with `Unauthorized` or +`Unable to connect`, the AWS SSO token may have expired — re-run Step 3c, +then retry. + +If `kubectl` itself is not installed: + +```bash +# macOS +brew install kubectl + +# Linux (Debian/Ubuntu) +sudo apt update && sudo apt install -y kubectl + +# Or via the official binary: +# https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ +``` + +--- + +## Step 6: Cloud connector authorization + +The platform plugin uses two types of MCP servers: + +**Local servers** (Kubernetes, AWS) — defined in the plugin's `.mcp.json` and +started automatically by Claude when the plugin is loaded. These rely on the +kubeconfig files and AWS profiles configured in Steps 4 and 5. No additional +setup needed here. + +**Cloud connectors** (Datadog, GitHub) — these are OAuth-based connectors that +must be authorized through Claude's connector settings. They don't require any +local installation, but they do need to be connected to the right accounts. + +### 6a. Check if cloud connectors are already authorized + +Try a lightweight call to each: + +- Datadog: try `mcp__datadog_lfx__search_datadog_services` +- GitHub: try `mcp__github__get_file_contents` with `owner: linuxfoundation`, + `repo: lfx-v2-argocd`, `path: README.md` + +If both respond successfully, skip to Step 6. + +### 6b. Connect the Datadog connector + +If the Datadog connector is not responding: + +> "Open Claude → Settings → Connectors. Find **Datadog** and click **Connect**. +> Sign in with your Datadog credentials and authorize access to the LFX +> organization's Datadog account. Once connected, come back here and I'll +> verify it's working." + +After the user confirms, retry `mcp__datadog_lfx__search_datadog_services`. + +**Note:** LFX uses Datadog US1 (`datadoghq.com`). Make sure to connect the +**Datadog** connector, not the Datadog (US3) one. + +### 6c. Connect the GitHub connector + +If the GitHub connector is not responding: + +> "Open Claude → Settings → Connectors. Find **GitHub** and click **Connect**. +> Authorize the connector to access the `linuxfoundation` organization — you +> may need to grant org access separately under GitHub Settings → Applications +> → Authorized OAuth Apps → GitHub (Claude) → Configure SSO." + +After the user confirms, retry the file read from 5a. + +--- + +## Step 7: Health check summary + +Once all steps are complete, print a summary: + +``` +## Platform Setup — Health Check + +| Component | Type | Status | Details | +|---------------------------------|-------------------|---------|---------| +| Extension: k8s_dev | Desktop Extension | ✅ / ⚠️ | Installed via .mcpb | +| Extension: k8s_stag | Desktop Extension | ✅ / ⚠️ | Installed via .mcpb | +| Extension: k8s_prod | Desktop Extension | ✅ / ⚠️ | Installed via .mcpb | +| Extension: aws_lfx_dev | Desktop Extension | ✅ / ⚠️ | Installed via .mcpb | +| Extension: aws_lfx_stag | Desktop Extension | ✅ / ⚠️ | Installed via .mcpb | +| Extension: aws_lfx_prod | Desktop Extension | ✅ / ⚠️ | Installed via .mcpb | +| uv / uvx | Local tool | ✅ / ⚠️ | v{version} — required by AWS extensions | +| GitHub CLI (gh) | Local tool | ✅ / ⚠️ | Authenticated as {username}, linuxfoundation SSO: authorized | +| AWS CLI | Local tool | ✅ / ⚠️ | v{version} | +| AWS SSO — lfx-dev | Local credential | ✅ / ⚠️ | Account 788942260905 | +| AWS SSO — lfx-stag | Local credential | ✅ / ⚠️ | Account 844790888233 | +| AWS SSO — lfx-prod | Local credential | ✅ / ⚠️ | Account 372256339901 | +| Kubeconfig — dev | Local file | ✅ / ⚠️ | {N} nodes reachable | +| Kubeconfig — staging | Local file | ✅ / ⚠️ | {N} nodes reachable | +| Kubeconfig — prod | Local file | ✅ / ⚠️ | {N} nodes reachable | +| MCP: k8s dev | via Extension | ✅ / ⚠️ | Responding to tool calls | +| MCP: k8s stag | via Extension | ✅ / ⚠️ | Responding to tool calls | +| MCP: k8s prod | via Extension | ✅ / ⚠️ | Responding to tool calls | +| MCP: AWS lfx dev | via Extension | ✅ / ⚠️ | Responding to tool calls | +| MCP: AWS lfx stag | via Extension | ✅ / ⚠️ | Responding to tool calls | +| MCP: AWS lfx prod | via Extension | ✅ / ⚠️ | Responding to tool calls | +| MCP: Datadog (cloud connector) | Cloud connector | ✅ / ⚠️ | Authorized via Claude settings | +| MCP: GitHub (cloud connector) | Cloud connector | ✅ / ⚠️ | Authorized via Claude settings | +``` + +Use ✅ for each component that passed, ⚠️ for anything that needs attention. +If everything is green, tell the user: + +> "You're all set. You can now use `/platform-deploy` to release services and +> the `platform-troubleshoot` skill for incident investigation." + +If there are ⚠️ items, summarize what's left and what to do about each one. + +--- + +## Notes on re-running + +This command is safe to run at any time: +- It checks before installing — nothing gets reinstalled unnecessarily +- AWS config is appended, not overwritten — existing profiles for other projects + are preserved +- Kubeconfig files are regenerated only if missing or not connecting +- AWS SSO tokens expire after 12 hours — re-run Step 4d (or just + `aws sso login --profile lfx-dev`) to refresh diff --git a/lfx-platform/skills/platform-troubleshoot/SKILL.md b/lfx-platform/skills/platform-troubleshoot/SKILL.md new file mode 100644 index 0000000..d5dde91 --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/SKILL.md @@ -0,0 +1,368 @@ +--- +name: lfx-platform-troubleshoot +description: > + Structured troubleshooting guide for LFX platform engineering. Use this skill + whenever someone is investigating a service incident, unexpected behavior, crash, + performance problem, or deployment issue in any LFX environment (dev, staging, prod). + This skill orchestrates Datadog, Kubernetes, and AWS MCPs in the correct environment, + enforces safe investigation practices, and links findings to infrastructure-as-code + remediation paths. Trigger on phrases like: "service is down", "pods crashing", + "high error rate", "deployment issue", "something is broken in [env]", "help me + debug", "logs showing errors", or any request to investigate a live LFX service. +allowed-tools: AskUserQuestion, Bash, Read, Glob, Grep, mcp__github__* +--- + + + + +# LFX Platform Troubleshooting + +You are a structured troubleshooting partner for the LFX platform engineering team. +Your job is to help engineers identify the root cause of issues in live environments +using Datadog, Kubernetes, and AWS — always staying within the correct environment +boundary, and always tying findings back to actionable remediation paths. + +--- + +## Step 0: Confirm Environment and Service — MANDATORY BEFORE ANY TOOL CALLS + +**Never query any external data source until you have confirmed both of these.** + +### Environment + +Identify which environment is being discussed. Users may say: + +| What they say | Canonical environment | MCP suffix to use | Datadog `env` tag | +|---|---|---|---| +| dev, development, local-cluster | **dev** | `_dev` | `env:development` | +| stage, staging, stag | **stag** | `_stag` | `env:staging` | +| prod, production | **prod** | `_prod` | `env:production` | + +Datadog uses **full** environment names (`development`, `staging`, `production`) — +not the short forms. Always use the Datadog tag from the table above when querying +logs, metrics, or traces. + +If the environment is ambiguous or unspecified, **stop and ask**: + +> "Which environment are you troubleshooting — dev, staging, or prod?" + +Do not infer the environment from context clues alone. A wrong environment produces +misleading data and leads to bad troubleshooting. When in doubt, ask. + +### Service Name + +Ask the user to name the service they are troubleshooting. Then verify the name +against actual data in Datadog before proceeding further. Look for a **clear, unambiguous +match** — if the name the user gave could match multiple services, or matches nothing, +**stop and ask for clarification** rather than guessing. + +> "I'm not finding a clear match for '[name]' in [env]. Could you double-check the +> service name, or share a Datadog service URL or Kubernetes resource name?" + +Err on the side of caution. Pulling data from the wrong service wastes time and +can be actively misleading. + +--- + +## Environment-to-MCP Mapping + +Once the environment is confirmed, use only the MCP tools for that environment. +Cross-environment queries corrupt the investigation. + +| Environment | Kubernetes MCP | AWS MCP | Datadog filter | +|---|---|---|---| +| dev | `mcp__k8s_dev__*` | `mcp__AWS_lfx_dev__*` | `env:development` | +| stag | `mcp__k8s_stag__*` | `mcp__AWS_lfx_stag__*` | `env:staging` | +| prod | `mcp__k8s_prod__*` | `mcp__AWS_lfx_prod__*` | `env:production` | + +Datadog is a single MCP (`mcp__datadog_lfx__*`) that spans all environments — always +include the `env` tag in every query. + +The GitHub connector (`mcp__github__*`) is environment-agnostic — it reads from +source repos and is safe to use regardless of which environment you're investigating. + +See [references/environment-mapping.md](references/environment-mapping.md) +for the full service-to-namespace mapping, AWS account IDs, and kubeconfig setup. +See [references/argocd-structure.md](references/argocd-structure.md) for ArgoCD repo +layout, file paths, and how versions are controlled. + +--- + +## Investigation Flow + +Work through these layers in order. Don't jump ahead — each layer informs whether +and how to use the next one. + +### Layer 1: Datadog — Start Here + +Datadog is your primary data source. It aggregates service metrics, logs, traces, +and infrastructure data from Kubernetes and AWS in one place. Start here before +touching the Kubernetes or AWS MCPs. + +**1a. Service health snapshot** + +Pull a current picture of the service: +- Is it running? Healthy? Crashing or restarting? +- What is the current error rate and latency? +- When did the issue begin? Is there a clear inflection point? +- When was it last deployed or restarted? + +Relevant Datadog tools: `search_datadog_services`, `get_datadog_metric`, +`search_datadog_monitors`, `search_datadog_events` + +**1b. Service logs and traces** + +Get workload-specific evidence: +- Are there error patterns in logs at or after the inflection point? +- Do traces show increased latency, timeouts, or downstream failures? +- Are errors concentrated in a specific endpoint, consumer, or operation? + +Relevant tools: `search_datadog_logs`, `analyze_datadog_logs`, `search_datadog_spans`, +`get_datadog_trace` + +**1c. Cluster context (zoom out — for context only)** + +Check whether anything changed at the cluster level around the same time: +- Are other services also showing elevated errors or restarts? +- Were there node-level events (scaling, evictions, restarts)? +- Is there evidence of resource pressure (CPU, memory) at the node level? + +This context helps you understand *whether* the issue is workload-specific or +platform-wide. It should not become the answer to the problem by itself. If you +find a cluster-level event (e.g., a node restart, an autoscaling event), that is +a hypothesis — not a conclusion. You must then find workload-specific evidence +to support or refute it (e.g., pod restart logs, OOM events for that specific service, +errors in logs timed to match the event). + +Relevant tools: `search_datadog_hosts`, `get_datadog_metric` (node-level), +`search_datadog_events` + +--- + +### Layer 2: GitHub Connector — Deployment Context + +Use the GitHub connector when you need to understand *what was deployed* and +*how it was configured* at the source level. This is particularly useful when: + +- Behavior changed after a recent deploy and you want to understand what changed +- A pod is running but behaving unexpectedly — check chart defaults vs. env overrides +- You need to correlate a running image tag with the code and config at that commit +- You want to inspect Helm values to understand expected secrets, env vars, or resource limits + +The GitHub connector is read-only and environment-agnostic — use it freely. + +**2a. Check what version is deployed (ArgoCD config)** + +Read the ApplicationSet to find `targetRevision` for the service in the target env: + +``` +mcp__github__get_file_contents + owner: linuxfoundation + repo: lfx-v2-argocd + path: apps/{env-folder}/lfx-v2-applications.yaml # env-folder: staging or prod +``` + +Read the env-specific values file to see what `image.tag` and other overrides +are set for this environment: + +``` +mcp__github__get_file_contents + owner: linuxfoundation + repo: lfx-v2-argocd + path: values/{env-folder}/lfx-v2-{service}.yaml +``` + +Also check global values for baseline configuration: + +``` +mcp__github__get_file_contents + owner: linuxfoundation + repo: lfx-v2-argocd + path: values/global/lfx-v2-{service}.yaml +``` + +The `env-folder` in the ArgoCD repo is `staging` (not `stag`) and `prod`. + +**2b. Check Helm chart defaults (service repo)** + +Read the chart's default values to understand what the service expects from its +environment — env vars, secret references, resource limits, replica counts: + +``` +mcp__github__get_file_contents + owner: linuxfoundation + repo: lfx-v2-{service} + path: charts/lfx-v2-{service}/values.yaml + ref: {targetRevision} # pin to the deployed tag, not HEAD +``` + +If you can see the running image tag from Datadog or K8s, use that as `ref` to +read the chart exactly as it was deployed. + +**2c. Browse recent commits or releases** + +If you're trying to pinpoint when a regression was introduced: + +``` +mcp__github__list_commits + owner: linuxfoundation + repo: lfx-v2-{service} + sha: main + per_page: 20 +``` + +Or check the release history to map image tags to what changed: + +``` +mcp__github__list_releases + owner: linuxfoundation + repo: lfx-v2-{service} +``` + +**What to look for:** +- Does the `targetRevision` in ArgoCD match what Datadog/K8s shows running? + If not, ArgoCD may not have synced yet — or a previous deploy didn't complete. +- Does the values file contain the expected secrets, env vars, and connection strings? + A missing or wrong value here is a common cause of startup failures. +- Did a recent commit change resource limits, secret names, or startup configuration? + +--- + +### Layer 3: Kubernetes MCP — Zoom In + +Use the Kubernetes MCP when Datadog has identified something specific that you +want to examine in more detail, or when Datadog data alone is insufficient to +understand what is happening at the pod level. If you've read the Helm chart +values in Layer 2, use them as context here — e.g., confirm the running pod's +env vars match what the chart expects. + +Good reasons to reach for the Kubernetes MCP: +- Datadog shows the service restarting but you want to see the termination reason + and current pod status +- You want live resource usage for a specific pod (not just aggregated metrics) +- You want to check configuration: environment variables, mounted secrets, replica counts +- You want to see Kubernetes events for a specific resource (CrashLoopBackOff, OOMKilled, etc.) + +Relevant tools: `mcp__k8s___pods_list_in_namespace`, `mcp__k8s___pods_get`, +`mcp__k8s___pods_log`, `mcp__k8s___pods_top`, `mcp__k8s___events_list`, +`mcp__k8s___resources_get` + +**Both the Kubernetes MCP and AWS MCP are read-only.** They are for observation, +not for making changes. Attempting to write through them will fail. + +See [references/kubernetes.md](references/kubernetes.md) for namespace conventions, +important resource types, and how services are organized across environments. + +--- + +### Layer 4: AWS MCP — Zoom Out + +Use the AWS MCP when findings point toward an underlying infrastructure issue +that Kubernetes alone cannot explain. This is not a routine first step — reach +for it when you have a specific hypothesis about an AWS-level failure. + +Good reasons to reach for the AWS MCP: +- Logs show database connection failures → check RDS/Aurora status and parameter groups +- Logs show Secrets Manager access errors → check secret existence and IAM policies +- Service is timing out on external calls → check security groups, VPC routing +- EKS node issues suggest a cluster-level problem → check EKS control plane events +- S3 or OpenSearch errors in logs → check bucket/domain status and access policies + +Relevant tools: `mcp__AWS_lfx___call_aws` (with appropriate service API calls) + +See [references/aws.md](references/aws.md) for the AWS account structure, relevant +services per environment, and common infrastructure patterns. + +--- + +## Suggesting Remediation + +### All environments: explain the fix + +For any root cause you identify, provide: +1. **What is happening** — a clear, specific explanation grounded in the evidence +2. **Why it is happening** — the probable cause, supported by the data you found +3. **How to fix it** — with environment-appropriate guidance (see below) + +### Dev: hands-on changes are acceptable + +In development, engineers can make targeted changes directly. When suggesting a +change: +- Provide the `kubectl` command or AWS CLI command to test the fix +- **Also provide the corresponding IaC change** that would make this permanent + +This matters because our infrastructure as code will regularly reconcile dev, +and any manual changes that are not backed by an IaC commit will be wiped out. +Manual changes in dev are for testing a hypothesis — the real fix always lives +in IaC. + +For larger or more structural changes (e.g., changing resource limits, modifying +service topology, updating Helm values), skip the manual step entirely and point +directly to the IaC repos. The larger the change, the more important it is to +do it right the first time through code. + +See [references/iac-repos.md](references/iac-repos.md) for the relevant infrastructure +repos, their structure, and how to navigate them. (This reference is to be added — +ask the user for repo links to fill this in.) + +### Staging and prod: IaC only + +In staging and production, most engineers do not have permissions to make direct +changes to Kubernetes or AWS resources. All changes go through infrastructure as +code. When suggesting a remediation for staging or prod: +- Do **not** provide direct `kubectl` or AWS CLI commands as the primary fix path +- Do provide the exact IaC change needed and which repo/file it lives in +- Explain what the change does and why, so the engineer can write a PR with confidence + +--- + +## Investigation Output Format + +After completing your investigation, summarize your findings in this structure: + +``` +## Troubleshooting Summary + +**Service:** [service name] +**Environment:** [dev | stag | prod] +**Investigated at:** [timestamp] + +### Current Status +[One paragraph: is the service healthy? What is the current error rate/behavior?] + +### Timeline +[Key events in chronological order — deployments, restarts, metric changes, errors] + +### Root Cause (or Most Likely Hypothesis) +[Specific finding backed by evidence. If uncertain, say so and explain what evidence +supports each hypothesis.] + +### Evidence +[Bullet list of specific data points: log lines, metric values, event timestamps, +pod states. Each point should be traceable back to a specific tool call.] + +### Recommended Fix +[What to change, and how — see environment guidance above] + +### IaC Path +[Which repo, file, and approximate change is needed to make this permanent] + +### What We Ruled Out +[Optional but useful: things that looked relevant but weren't, and why] +``` + +--- + +## Reference Files + +As more environment details are added, read the relevant reference file when you +need specifics about that layer: + +| Reference | What it covers | +|---|---| +| [references/environment-mapping.md](references/environment-mapping.md) | Datadog env tags, service-to-namespace mapping, AWS account IDs, kubeconfig setup, SSO config | +| [references/argocd-structure.md](references/argocd-structure.md) | ArgoCD repo layout, file paths, service-to-ApplicationSet mapping, image tag conventions | +| [references/kubernetes.md](references/kubernetes.md) | Full namespace layout, available tools, investigation sequence, making changes | +| [references/iac-repos.md](references/iac-repos.md) | IaC repos (opentofu, argocd, datadog, secrets), when to use each, dev vs stag/prod fix paths | +| [references/aws.md](references/aws.md) | AWS services in LFX, investigation patterns — expand as needed | +| [references/datadog.md](references/datadog.md) | Datadog tool reference, metrics, dashboards — expand as needed | diff --git a/lfx-platform/skills/platform-troubleshoot/references/argocd-structure.md b/lfx-platform/skills/platform-troubleshoot/references/argocd-structure.md new file mode 100644 index 0000000..3d0eea6 --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/references/argocd-structure.md @@ -0,0 +1,169 @@ + + + +# ArgoCD Structure + +This file documents how the LFX platform is deployed through ArgoCD, including +the exact repo layout, how versions are controlled, and what to change when +deploying a new release. + +--- + +## Architecture Overview + +ArgoCD uses an **app-of-apps** pattern. The root `app-of-apps` Application in each +cluster points at the ArgoCD repo and generates all other Applications. + +Most LFX v2 services are managed by a single **ApplicationSet** named +`lfx-v2-applications`. Each service is a list element in that ApplicationSet. + +**Key repos involved:** + +| Repo | Role | +|---|---| +| `linuxfoundation/lfx-v2-argocd` | ApplicationSet definitions + all values files | +| `linuxfoundation/lfx-v2-{service}` | Helm chart lives here under `charts/` | + +The Helm chart for each service lives **inside the service's own repo** at +`charts/{service-name}/` — not in the ArgoCD repo. This means releasing a new +version is a single operation: tag the service repo, and everything (Docker image ++ Helm chart) comes from the same commit. + +--- + +## How Dev Auto-Deploys + +In dev, every service has `targetRevision: HEAD`. ArgoCD continuously polls the +service repo's main branch and deploys changes automatically. There is no manual +deploy step for dev — merge to main is all that's needed. + +--- + +## How Staging and Production Deploys Work + +For staging and production, the ApplicationSet list element for each service has +`targetRevision` set to a specific semver tag (e.g., `1.5.0`). To deploy a new +version, you change that tag in the ArgoCD repo. + +The values files in the ArgoCD repo also set `image.tag` explicitly for all +staging/production services — this field **must** be updated alongside +`targetRevision`. Tags use bare semver without a `v` prefix (e.g., `1.5.0` not +`v1.5.0`). This is a transitional state: once Helm chart version management is +fully automated, `image.tag` will be removed and only `targetRevision` will need +updating. + +--- + +## ArgoCD Repo File Paths + +### ApplicationSet files (one per environment, defines all services) + +``` +apps/dev/lfx-v2-applications.yaml +apps/staging/lfx-v2-applications.yaml # may not exist yet for some services +apps/prod/lfx-v2-applications.yaml +``` + +Each file contains a `spec.generators[0].list.elements` array. Each element looks like: + +```yaml +- name: lfx-v2-committee-service + namespace: committee-service + path: charts/lfx-v2-committee-service + repoURL: https://github.com/linuxfoundation/lfx-v2-committee-service + targetRevision: 1.5.0 # ← this is what you change for a deploy (bare semver, no v prefix) +``` + +In dev, `targetRevision: HEAD`. + +### Values files (environment-specific overrides) + +``` +values/global/{service-name}.yaml # shared across all environments +values/dev/{service-name}.yaml # dev overrides (sets image.tag: development) +values/staging/{service-name}.yaml # staging overrides +values/prod/{service-name}.yaml # production overrides +``` + +Example: `values/dev/lfx-v2-committee-service.yaml` sets `image.tag: development`, +which is why dev uses the `:development` Docker tag. + +For staging/production, the values file sets `image.tag` explicitly using bare +semver (e.g., `image.tag: 1.5.0`). **Always update this alongside `targetRevision`** +when deploying — both must match for the deployment to use the correct image. + +--- + +## Service-to-ApplicationSet Mapping + +All of these services are entries in `lfx-v2-applications`: + +| ArgoCD Application name | Service repo | Helm chart path | +|---|---|---| +| `lfx-v2-auth-service` | `linuxfoundation/lfx-v2-auth-service` | `charts/lfx-v2-auth-service` | +| `lfx-v2-committee-service` | `linuxfoundation/lfx-v2-committee-service` | `charts/lfx-v2-committee-service` | +| `lfx-v2-mailing-list-service` | `linuxfoundation/lfx-v2-mailing-list-service` | `charts/lfx-v2-mailing-list-service` | +| `lfx-v2-meeting-service` | `linuxfoundation/lfx-v2-meeting-service` | `charts/lfx-v2-meeting-service` | +| `lfx-v2-member-service` | `linuxfoundation/lfx-v2-member-service` | `charts/lfx-v2-member-service` | +| `lfx-v2-project-service` | `linuxfoundation/lfx-v2-project-service` | `charts/lfx-v2-project-service` | +| `lfx-v2-query-service` | `linuxfoundation/lfx-v2-query-service` | `charts/lfx-v2-query-service` | +| `lfx-v2-survey-service` | `linuxfoundation/lfx-v2-survey-service` | `charts/lfx-v2-survey-service` | +| `lfx-v2-voting-service` | `linuxfoundation/lfx-v2-voting-service` | `charts/lfx-v2-voting-service` | +| `lfx-v2-ui` | `linuxfoundation/lfx-v2-ui` | `charts/lfx-v2-ui` | +| `lfx-v1-sync-helper` | `linuxfoundation/lfx-v1-sync-helper` | `charts/lfx-v1-sync-helper` | +| `lfx-changelog` | `linuxfoundation/lfx-changelog` | `charts/lfx-changelog` | +| `lfx-platform` | `linuxfoundation/lfx-v2-helm` | `charts/lfx-platform` | + +--- + +## Individually Defined Applications (not in the ApplicationSet) + +These are defined as standalone Application files in the ArgoCD repo: + +| ArgoCD Application name | Notes | +|---|---| +| `app-of-apps` | Root app — do not modify directly | +| `identity-cookie-helper` | Part of `lfx-platform` group | +| `lfit-litellm` | Internal LF tooling, not an LFX v2 service | +| `lfx-mcp` | MCP server | + +--- + +## Monitoring a Deployment + +To check if ArgoCD has synced after merging a version bump PR, read the Application +resource directly from the cluster: + +``` +# Using K8s MCP: +mcp__k8s_{env}__resources_get + apiVersion: argoproj.io/v1alpha1 + kind: Application + namespace: argocd + name: {service-name} # e.g., lfx-v2-committee-service +``` + +Look for: +- `status.sync.status: Synced` — ArgoCD has applied the new manifest +- `status.health.status: Healthy` — all resources are healthy +- `status.summary.images` — shows the current image tag running; confirm it matches the new version +- `status.operationState.phase: Succeeded` — last sync operation completed + +--- + +## Image Tag Pattern + +``` +ghcr.io/linuxfoundation/{repo-name}/{binary-name}:{tag} +``` + +Examples: +- Dev: `ghcr.io/linuxfoundation/lfx-v2-committee-service/committee-api:development` +- Versioned: `ghcr.io/linuxfoundation/lfx-v2-committee-service/committee-api:1.5.0` + +Note: production tags use bare semver without a `v` prefix. + +The binary name within the image path (e.g., `committee-api`) is defined by the +service's CI/CD and Helm chart — it's not always predictable from the repo name. +The current image can always be found at `status.summary.images` in the ArgoCD +Application resource. diff --git a/lfx-platform/skills/platform-troubleshoot/references/aws.md b/lfx-platform/skills/platform-troubleshoot/references/aws.md new file mode 100644 index 0000000..be0cd1f --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/references/aws.md @@ -0,0 +1,71 @@ + + + +# AWS Reference + +This file documents AWS-specific patterns for troubleshooting LFX services. + +--- + +## When to Use the AWS MCP + +The AWS MCP is a zoom-out tool — not a starting point. Reach for it when: +- Logs or traces point to a specific AWS service failure (database, Secrets Manager, S3, OpenSearch) +- You suspect EKS-level issues (control plane, node groups) +- There's evidence of IAM/permission problems +- Network-level issues suggest VPC, security group, or routing problems + +Do not reach for AWS as a general first step. Start with Datadog, escalate to +Kubernetes, then AWS if the evidence points there. + +--- + +## Environment-to-MCP Mapping + +| Environment | MCP prefix | +|---|---| +| dev | `mcp__AWS_lfx_dev__` | +| stag | `mcp__AWS_lfx_stag__` | +| prod | `mcp__AWS_lfx_prod__` | + +Both the AWS and Kubernetes MCPs are **read-only**. Use them for investigation only. + +--- + +## Key AWS Services in LFX + +**TODO:** Fill in the specific AWS services used by the LFX platform. + +| Service | What it does in LFX | How to investigate | +|---|---|---| +| EKS | Kubernetes cluster hosting | Check control plane logs, node group status | +| RDS / Aurora | | Connection errors, slow queries, failover events | +| OpenSearch | Search index (query service) | Cluster health, index status, shard errors | +| Secrets Manager | Service credentials | Secret existence, version, IAM access | +| S3 | | Access errors, lifecycle policies | +| | | | + +--- + +## AWS Account Structure + +**TODO:** Fill in account IDs and any cross-account patterns. + +| Environment | Account | Notes | +|---|---|---| +| dev | | | +| stag | | | +| prod | | | + +--- + +## Common Investigation Patterns + +**TODO:** Document common AWS investigation patterns specific to LFX. + +Examples to fill in: +- How to check RDS instance status and recent events +- How to verify Secrets Manager secret exists and is accessible +- How to check EKS node group status +- How to review security group rules for connectivity issues +- How to check OpenSearch cluster health diff --git a/lfx-platform/skills/platform-troubleshoot/references/datadog.md b/lfx-platform/skills/platform-troubleshoot/references/datadog.md new file mode 100644 index 0000000..56245ce --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/references/datadog.md @@ -0,0 +1,71 @@ + + + +# Datadog Reference + +This file documents Datadog-specific patterns for troubleshooting LFX services. + +--- + +## Key Datadog Tools + +| Tool | When to use | +|---|---| +| `search_datadog_services` | Find a service by name, verify it exists in an environment | +| `search_datadog_monitors` | Check if any alerts are firing for the service or environment | +| `search_datadog_events` | Find deployment events, restarts, and cluster-level changes | +| `get_datadog_metric` | Pull specific metric values (error rate, latency, CPU, memory) | +| `search_datadog_logs` | Query logs for a specific service in a time window | +| `analyze_datadog_logs` | Let Datadog analyze a log query for patterns and anomalies | +| `search_datadog_spans` | Find distributed trace spans for a service or operation | +| `get_datadog_trace` | Get a specific trace by ID for deep inspection | +| `search_datadog_hosts` | Find hosts/nodes and their status | +| `get_datadog_metric_context` | Get surrounding metric context for a specific time | + +--- + +## Service Naming + +**TODO:** Document how LFX services are named in Datadog. + +Questions to answer: +- Do service names in Datadog match Kubernetes deployment names exactly? +- Is there a prefix or suffix convention (e.g., `lfx-v2-committee-service`)? +- Are there separate entries for different components (API, worker, etc.)? + +--- + +## Key Metrics to Check + +**TODO:** Fill in the standard metrics for LFX services. + +Suggested starting points: +- Error rate: `` +- Request latency (p99): `` +- Pod restart count: `` +- CPU utilization: `` +- Memory utilization: `` + +--- + +## Important Dashboards + +**TODO:** Add links to key Datadog dashboards used by the platform team. + +| Dashboard | URL | When to use | +|---|---|---| +| Cluster overview | | Node health, resource pressure | +| Service health | | Per-service error rates and latency | +| | | | + +--- + +## Log Query Patterns + +**TODO:** Document common log query patterns used in LFX. + +Example patterns to document: +- Find all errors for a service in a time window +- Find pod restart events +- Find failed NATS message processing +- Find database connection errors diff --git a/lfx-platform/skills/platform-troubleshoot/references/environment-mapping.md b/lfx-platform/skills/platform-troubleshoot/references/environment-mapping.md new file mode 100644 index 0000000..1fd24f5 --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/references/environment-mapping.md @@ -0,0 +1,150 @@ + + + +# Environment Mapping + +This file defines how the three LFX environments (dev, stag, prod) map to specific +tool configurations across Datadog, Kubernetes, and AWS. + +--- + +## MCP Tool Name Reference + +MCP server names from `claude_desktop_config.json` are translated to tool prefixes +by replacing spaces with underscores and prepending `mcp__`. + +| Environment | Kubernetes tools | AWS tools | +|---|---|---| +| dev | `mcp__k8s_dev__*` | `mcp__AWS_lfx_dev__*` | +| stag | `mcp__k8s_stag__*` | `mcp__AWS_lfx_stag__*` | +| prod | `mcp__k8s_prod__*` | `mcp__AWS_lfx_prod__*` | + +Datadog spans all environments: `mcp__datadog_lfx__*`. Always filter by `env` tag. + +--- + +## Datadog Environment Tags + +Datadog uses full environment names. Always include the `env` tag in every query. + +| User says | Canonical env | Datadog `env` tag | +|---|---|---| +| dev, development | dev | `env:development` | +| stag, staging | stag | `env:staging` | +| prod, production | prod | `env:production` | + +**Verified:** `env:development` confirmed live. `env:staging` and `env:production` +follow the same full-name convention. + +Host naming also follows this pattern: `{instance-id}-lfx-v2-{environment}` +(e.g., `i-0f96d19334698dacb-lfx-v2-development`). + +--- + +## Datadog Service Names + +Services are named `lfx-v2-{service}` in Datadog. Full list of known LFX v2 services: + +| Datadog service name | Kubernetes namespace | Notes | +|---|---|---| +| `lfx-v2-auth-service` | `auth-service` | | +| `lfx-v2-committee-service` | `committee-service` | | +| `lfx-v2-fga-sync` | `lfx` | Shared platform component | +| `lfx-v2-indexer-service` | `lfx` | Shared platform component | +| `lfx-v2-mailing-list-service` | `mailing-list-service` | | +| `lfx-v2-meeting-service` | `meeting-service` | | +| `lfx-v2-member-service` | `member-service` | | +| `lfx-v2-project-service` | `project-service` | | +| `lfx-v2-query-service` | `query-service` | | +| `lfx-v2-ui` | `ui` | PR previews in `ui-pr-{number}` namespaces | +| `lfx-v2-access-check` | `lfx` | Shared platform component | +| `lfx-platform-heimdall.lfx` | `lfx` | Shared platform component | +| `lfx-platform-openfga` | `lfx` | Shared platform component | +| `lfx-changelog` | `changelog` | | +| `lfx-mcp-server` | `mcp-server` | | +| `lfx-v1-sync-helper` | `v1-sync-helper` | | + +**Namespace naming pattern for v2 services:** the Kubernetes namespace matches the +service component name (e.g., `lfx-v2-committee-service` → namespace `committee-service`). +Shared platform components (fga-sync, indexer, access-check, heimdall, openfga) live +in the `lfx` namespace. + +--- + +## Kubernetes Clusters and Kubeconfig Files + +EKS cluster name is `lfx-v2` in every account. + +| Environment | Kubeconfig path | EKS cluster | AWS profile | +|---|---|---|---| +| dev | `~/.kube/dev-config` | `lfx-v2` | `lfx-dev` | +| stag | `~/.kube/staging-config` | `lfx-v2` | `lfx-stag` | +| prod | `~/.kube/prod-config` | `lfx-v2` | `lfx-prod` | + +**Generating kubeconfig files for a new machine** (run after AWS SSO login): + +```bash +aws eks update-kubeconfig \ + --region us-west-2 --name lfx-v2 --profile lfx-dev \ + --kubeconfig ~/.kube/dev-config + +aws eks update-kubeconfig \ + --region us-west-2 --name lfx-v2 --profile lfx-stag \ + --kubeconfig ~/.kube/staging-config + +aws eks update-kubeconfig \ + --region us-west-2 --name lfx-v2 --profile lfx-prod \ + --kubeconfig ~/.kube/prod-config +``` + +Each file is generated with a single cluster and the default context set automatically. + +--- + +## AWS Accounts + +All LFX accounts are in `us-west-2`. + +| Environment | MCP server | AWS SSO profile | Account ID | Role | +|---|---|---|---|---| +| dev | `AWS lfx dev` | `lfx-dev` | `788942260905` | PowerUser | +| stag | `AWS lfx stag` | `lfx-stag` | `844790888233` | Read-Only | +| prod | `AWS lfx prod` | `lfx-prod` | `372256339901` | Read-Only | + +All AWS MCPs run with `READ_OPERATIONS_ONLY=true`. + +**AWS SSO setup** — see [aws-sso-setup.md](aws-sso-setup.md) for the full walkthrough. +The short version: run `aws configure sso` with SSO start URL `https://lfx.awsapps.com/start` +and session name `okta-lfx`. Or copy the example config block below directly into +`~/.aws/config`: + +```ini +[sso-session okta-lfx] +sso_start_url = https://lfx.awsapps.com/start +sso_region = us-east-2 +sso_registration_scopes = sso:account:access + +[profile lfx-dev] +sso_session = okta-lfx +sso_account_id = 788942260905 +sso_role_name = PowerUser +region = us-west-2 +output = json + +[profile lfx-stag] +sso_session = okta-lfx +sso_account_id = 844790888233 +sso_role_name = Read-Only +region = us-west-2 +output = json + +[profile lfx-prod] +sso_session = okta-lfx +sso_account_id = 372256339901 +sso_role_name = Read-Only +region = us-west-2 +output = json +``` + +Then authenticate: `aws sso login --profile lfx-dev` (one login covers all three +profiles for 12 hours). diff --git a/lfx-platform/skills/platform-troubleshoot/references/iac-repos.md b/lfx-platform/skills/platform-troubleshoot/references/iac-repos.md new file mode 100644 index 0000000..9926e32 --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/references/iac-repos.md @@ -0,0 +1,82 @@ + + + +# Infrastructure as Code Repos + +Changes to the LFX platform that are not committed to one of these repos will be +reconciled away — particularly in dev, which is continuously managed by ArgoCD. +When suggesting a fix, always provide the corresponding IaC change alongside any +manual test step. + +--- + +## Repos + +| Repo | What it controls | When to use it for fixes | +|---|---|---| +| [lfx-v2-opentofu](https://github.com/linuxfoundation/lfx-v2-opentofu) | AWS infrastructure — EKS clusters, RDS, OpenSearch, S3, IAM, networking | AWS-level issues: resource limits, database config, security groups, secret references | +| [lfx-v2-argocd](https://github.com/linuxfoundation/lfx-v2-argocd) | Kubernetes workload deployments — Helm values, replica counts, environment variables, resource limits | Most Kubernetes-level service config changes | +| [lfx-monitoring-terraform](https://github.com/linuxfoundation/lfx-monitoring-terraform) | Datadog dashboards, monitors, and alerting | Adding/fixing monitors, dashboards, alert thresholds | +| [lfx-secrets-management](https://github.com/linuxfoundation/lfx-secrets-management) | AWS Secrets Manager definitions and External Secrets sync config | Adding new secrets, rotating references, fixing secret access | + +--- + +## Repo Details + +### lfx-v2-argocd — Most Common for Service Issues + +This is where most service-level configuration lives: environment variables, +resource requests/limits, replica counts, Helm chart values per environment. +When a fix involves changing how a service runs in Kubernetes, this is usually +the right repo. + +ArgoCD continuously reconciles the cluster against this repo. Any manual `kubectl` +changes in dev that are not reflected here will be overwritten on the next sync. + +### lfx-v2-opentofu — AWS Infrastructure + +Use this when the issue is at the AWS layer: EKS node group configuration, +RDS parameter groups, OpenSearch domain settings, IAM roles and policies, +VPC/security group rules, or S3 bucket configuration. + +### lfx-secrets-management — Secrets + +When troubleshooting secret access issues, check this repo. The `secrets/lfx` +folder contains the LFX-specific secret definitions. Note that this repo also +contains non-LFX resources — stay within `secrets/lfx` when making LFX changes. + +### lfx-monitoring-terraform — Datadog + +Datadog monitors and dashboards are defined here as Terraform. LFX v2 standards +for monitor structure are still being established — check with the team before +adding new monitors to ensure consistency. + +--- + +## Applying a Fix: Dev vs Stag/Prod + +### In dev — test first, then commit + +``` +1. Identify the fix (e.g., wrong env var, insufficient memory limit) +2. Apply manually with kubectl for immediate testing: + kubectl --kubeconfig ~/.kube/dev-config -n \ + set env deployment/ KEY=VALUE +3. Verify the fix resolved the issue +4. Commit the equivalent change to lfx-v2-argocd (or opentofu for AWS changes) +5. Open a PR — ArgoCD will deploy from the merged change +``` + +For larger changes (topology, multiple services, new infrastructure), skip the +manual step entirely and go straight to an IaC PR to avoid inconsistency. + +### In stag/prod — IaC only + +Direct changes require elevated permissions most engineers don't have. The path is: + +``` +1. Identify the fix from investigation +2. Open a PR against the appropriate IaC repo +3. After review and merge, ArgoCD promotes the change +4. Verify in Datadog after deployment +``` diff --git a/lfx-platform/skills/platform-troubleshoot/references/kubernetes.md b/lfx-platform/skills/platform-troubleshoot/references/kubernetes.md new file mode 100644 index 0000000..314a733 --- /dev/null +++ b/lfx-platform/skills/platform-troubleshoot/references/kubernetes.md @@ -0,0 +1,138 @@ + + + +# Kubernetes Reference + +--- + +## Environment-to-MCP Mapping + +| Environment | MCP prefix | Kubeconfig | +|---|---|---| +| dev | `mcp__k8s_dev__` | `~/.kube/dev-config` | +| stag | `mcp__k8s_stag__` | `~/.kube/staging-config` | +| prod | `mcp__k8s_prod__` | `~/.kube/prod-config` | + +All Kubernetes MCPs run with `--read-only`. Write operations will fail. + +--- + +## Available Tools + +| Tool | When to use | +|---|---| +| `namespaces_list` | First step when you don't already know the service's namespace | +| `pods_list_in_namespace` | List pods in a namespace — check status and restart counts | +| `pods_get` | Detailed pod info: conditions, events, termination reason | +| `pods_log` | Get logs from a specific pod or container | +| `pods_top` | Live CPU/memory usage per pod | +| `events_list` | Kubernetes events — OOMKilled, CrashLoopBackOff, scheduling failures | +| `resources_get` | Inspect a specific resource (Deployment, ConfigMap, HPA, etc.) | +| `resources_list` | List resources of a type in a namespace | +| `nodes_top` | Node-level CPU/memory — use when checking resource pressure | +| `nodes_stats_summary` | Detailed node stats | +| `configuration_view` | View the kubeconfig being used (useful to confirm environment) | + +--- + +## Namespace Layout + +### Service Namespaces (one namespace per service) + +Each lfx-v2 service has its own namespace. The namespace name matches the service +component name — `lfx-v2-` prefix is dropped. + +| Namespace | Datadog service | Notes | +|---|---|---| +| `auth-service` | `lfx-v2-auth-service` | | +| `committee-service` | `lfx-v2-committee-service` | | +| `mailing-list-service` | `lfx-v2-mailing-list-service` | | +| `meeting-service` | `lfx-v2-meeting-service` | | +| `member-service` | `lfx-v2-member-service` | | +| `project-service` | `lfx-v2-project-service` | | +| `query-service` | `lfx-v2-query-service` | | +| `survey-service` | `lfx-v2-survey-service` | | +| `voting-service` | `lfx-v2-voting-service` | | +| `ui` | `lfx-v2-ui` | Main UI deployment | +| `ui-pr-{number}` | `lfx-v2-ui` | PR preview deployments | +| `changelog` | `lfx-changelog` | | +| `mcp-server` | `lfx-mcp-server` | | +| `v1-sync-helper` | `lfx-v1-sync-helper` | | +| `intercom-auth` | — | | + +### The `lfx` Namespace — Shared Platform Components + +The `lfx` namespace hosts shared infrastructure components that support all services. +It is **not** the first place to look when troubleshooting a specific service issue, +but is commonly checked when investigating platform-wide problems or when a service +shows symptoms that suggest a shared dependency is failing. + +Components in `lfx`: Heimdall (auth gateway), OpenFGA (authorization), fga-sync, +indexer-service, access-check. + +### Infrastructure Namespaces + +These are platform-level components. Rarely the starting point for service +troubleshooting, but relevant when the issue points to infrastructure. + +| Namespace | Purpose | +|---|---| +| `argocd` | GitOps — manages deployments from IaC repos | +| `cert-manager` | TLS certificate management | +| `datadog` | Datadog agent (metrics/logs/traces collection) | +| `external-secrets` | Syncs secrets from AWS Secrets Manager | +| `traefik` | Ingress controller | +| `reloader` | Watches ConfigMaps/Secrets and restarts pods on change | +| `kube-system` | Kubernetes system components | + +### Other Namespaces + +`backstage`, `pcc`, `lfit-litellm` — internal tooling, not LFX platform services. + +### Default Namespace + +Nothing should be deployed here. If you see pods in `default`, that is unexpected. + +--- + +## Finding a Service's Namespace + +If the user gives you a service name but not a namespace, the fastest path is: + +1. Check the service-to-namespace table above +2. If not listed, run `namespaces_list` and look for a name matching the service + +Do not assume a service is in a namespace without verifying — when in doubt, +list namespaces first. + +--- + +## Common Investigation Sequence + +``` +1. namespaces_list → confirm namespace exists +2. pods_list_in_namespace → check pod status and restart count +3. events_list (namespace scope) → look for OOMKilled, CrashLoopBackOff, etc. +4. pods_log → get recent log output from crashing pod +5. pods_get → termination reason, resource requests/limits +6. pods_top → live resource usage (if memory/CPU suspected) +7. resources_get Deployment → replica count, image version, env vars +``` + +--- + +## Making Changes in Dev + +The Kubernetes MCP is read-only. To test a fix directly in dev, use `kubectl` +with the appropriate kubeconfig: + +```bash +kubectl --kubeconfig ~/.kube/dev-config -n +``` + +Always pair any manual change with an IaC commit — dev is reconciled continuously +and manual changes will be overwritten. See [iac-repos.md](iac-repos.md) for +where to make the permanent change. + +For staging and production, direct changes are not permitted. All changes must +go through ArgoCD via an IaC pull request.