Skip to content

Agent install reports success but doesn't always persist (intermittent silent partial failure) #1801

@chubes4

Description

@chubes4

Summary

wp datamachine agent install <bundle> sometimes returns success: true with a fully populated summary (agent_id, pipelines_imported, flows_imported, no conflicts, no runtime_drift) but the agent does not appear in agent list afterwards. A subsequent install call with no changes succeeds and persists. The behavior is intermittent and silently destructive — operators trust the success message and only later discover nothing was created.

Repro

Site: intelligence-chubes4 (Studio), Data Machine head includes the transcript fix from #1797 and DMC head includes PRs #265#268.

Bundle: a freshly authored target-agnostic wc-static-site-agent bundle. Bundle JSON validates (jq empty), all manifest references resolve.

Sequence:

  1. Delete prior agent + pipeline + files cleanly:
    • flow delete <id> → success
    • pipeline delete <id> --force → success
    • agent delete wc-static-site-agent --delete-files --yes → success
  2. Re-install:
    • wp datamachine agent install <bundle> --owner=admin --yes --format=json
  3. Reported output:
    {
      "success": true,
      "message": "Agent \"wc-static-site-agent\" imported successfully (ID: 8, 1 pipeline(s), 1 flow(s)).",
      "summary": {
        "agent_slug": "wc-static-site-agent",
        "bundle_version": "2",
        "files": 1,
        "pipelines": 1,
        "flows": 1,
        "agent_id": 8,
        "pipelines_imported": 1,
        "flows_imported": 1,
        "conflicts": [],
        "runtime_drift": []
      }
    }
  4. Immediately after: wp datamachine agent list does not include wc-static-site-agent. wp datamachine agent installed does not include it. Pipelines and flows for it are absent.
  5. Second invocation of the identical install command (same bundle, same args) returns success with a different agent_id (11 instead of 8 or 10) and this time the agent is actually present in agent list and agent installed.

I observed this pattern at least twice in a row before the third call persisted. The agent_id keeps incrementing across attempts, suggesting the agent row is briefly written then rolled back without surfacing the failure.

A separate but possibly related path: when re-running wp datamachine agent upgrade against an agent whose live pipeline/flow have been edited via flow update/pipeline update, upgrade fails with Agent slug "wc-static-site-agent" already exists. Use --slug=<new-slug> to rename on import. rather than staging PendingActions for the modified artifacts. The diff command correctly identifies them as local_modified needing approval, but the upgrade entrypoint refuses to proceed.

Expected

  • agent install either persists everything its summary claims or returns success: false with a clear error.
  • agent upgrade against local_modified artifacts stages PendingActions instead of rejecting the install with a misleading "already exists" message.

Observed

  • Intermittent: identical inputs produce different outcomes (persisted vs not) across runs.
  • Silent: success payload is identical in both cases.
  • Misleading: upgrade rejects with a slug-collision error even when the package is genuinely already installed and the right answer is to stage approvals.

Impact

  • Operators install bundles, see success, run flows, and only realize the agent doesn't exist when downstream commands fail with Agent "..." not found.
  • Recovery requires ad-hoc retries with no understanding of why the second attempt worked. There's no diagnostic to explain the failure.
  • For agents whose live config was edited (legitimate operator action), there's no clean upgrade path; the workaround is delete + reinstall, which is destructive.

Suggested investigation

  1. Check whether the install path runs inside a transaction that can roll back without surfacing the rollback as a failure (silent exception, deferred constraint check, transient SQLite lock under Studio's SQLite-backed runtime).
  2. Verify agent delete --delete-files actually clears the bundle registry row plus any related transient entries, so a subsequent install isn't classified incorrectly as an upgrade.
  3. Make agent upgrade automatically stage PendingActions for local_modified artifacts when the same bundle is being reinstalled by an authenticated operator with --yes.

Site/runtime context

  • Studio site: intelligence-chubes4
  • Data Machine plugin header: 0.104.0
  • Data Machine Code plugin header: 0.30.1
  • SQLite-backed (Studio default)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions