Skip to content

Fix silent agent install rollback and misleading upgrade rejection (#1801)#1806

Merged
chubes4 merged 1 commit intomainfrom
fix/1801-agent-install-rollback
May 6, 2026
Merged

Fix silent agent install rollback and misleading upgrade rejection (#1801)#1806
chubes4 merged 1 commit intomainfrom
fix/1801-agent-install-rollback

Conversation

@chubes4
Copy link
Copy Markdown
Member

@chubes4 chubes4 commented May 6, 2026

Fixes #1801.

Summary

  • Eliminates the silent partial-success path in AgentBundler::import(): any failure after the
    agent row is claimed now returns success: false with a typed error_code and rolls back every
    row this call inserted.
  • Makes agent upgrade against local_modified artifacts succeed and surface conflicts instead of
    rejecting with "Agent slug already exists" when the bundle's artifacts lack portable_slug.
  • Defensively wipes any datamachine_bundle_artifacts rows for an agent on
    datamachine_agent_deleted so a subsequent install is classified as a fresh install, not a
    stale upgrade.

Root cause

import() did the following in sequence with no transaction and no post-write verification:

  1. agents_repo->update_agent() or create_if_missing() to claim agent_id.
  2. pipelines_repo->create_pipeline() / update_pipeline() per pipeline.
  3. flows_repo->create_flow() / update_flow() per flow.
  4. agents_repo->update_agent() again to write the artifact registry into agent_config.
  5. Return success: true with the populated summary.

Most sub-step return values were dropped on the floor. Under SQLite contention on Studio (the
repro environment in the issue), step 4 silently rolled back without surfacing an error. The
in-memory bundler thought everything wrote; a fresh SELECT showed no agent row. Because the
return value still claimed agent_id: 8 (or 10, 11, …), operators saw success but agent list
came up empty. The next install bumped the auto-increment without persisting either, until eventually
the engine accepted the writes and the agent finally appeared.

A separate but related shape: agent upgrade of an agent whose live pipeline/flow had been edited
(legit operator action) errored with "Agent slug ... already exists. Use --slug= to
rename on import." The upgrade entrypoint passed straight through to import(), which hit the
install-collision guard whenever --slug was supplied or the bundle's artifacts lacked
portable_slug (which made bundle_has_portable_artifacts() return false). Operators had no
clean upgrade path; the workaround was destructive delete + reinstall.

Changes

  • inc/Core/Agents/AgentBundler.php

    • Wrap the post-claim mutation block in START TRANSACTION / COMMIT with a try/catch \Throwable that runs ROLLBACK and a manual cleanup of rows this call inserted.
    • Throw \RuntimeException on every previously-silent sub-step failure (update_agent → false,
      create_pipeline → 0, update_flow → false, etc.).
    • Re-fetch the agent row after the final write and verify the artifact registry actually
      persisted before declaring success. SQLite under Studio has been observed silently dropping
      these mutations; the verify-then-commit step closes the silent-rollback door.
    • Add is_upgrade option. When set, the slug-collision guard treats an existing row as the
      upgrade target instead of erroring. The bundle-slug mismatch guard still fires so we don't
      overwrite an unrelated agent that happens to share the slug.
    • Typed errors: install_invalid_bundle, install_slug_collision, install_bundle_slug_mismatch,
      install_post_claim_failure. The error message on install_slug_collision now points
      operators at agent upgrade as the right next step.
    • Two new action seams (datamachine_bundle_import_post_claim_started,
      datamachine_bundle_import_pre_commit) so tests can throw inside the critical section without
      relying on a live SQLite race.
  • inc/Cli/Commands/AgentBundleCommand.php

    • upgrade() passes is_upgrade => true to import() so the install-collision guard does not
      fire on legitimate upgrades against local_modified artifacts.
  • inc/Core/Database/BundleArtifacts/InstalledBundleArtifacts.php

    • New register() method hooks datamachine_agent_deleted to wipe any tracked artifact rows
      for the deleted agent_id. The importer does not write to that table today, but extensions
      can — and a stale row would mis-classify a fresh install as an upgrade against a
      non-existent agent.
  • data-machine.php

    • Wire InstalledBundleArtifacts::register() alongside the other bundle-layer registrations.
  • tests/Unit/Core/Agents/AgentBundlerImportTest.php (new)

    • test_post_claim_failure_rolls_back_and_reports_typed_error — fault-injects via
      datamachine_bundle_import_pre_commit, asserts response is success: false with
      error_code: install_post_claim_failure and that no agent / pipeline / flow rows remain.
    • test_upgrade_against_existing_agent_does_not_error_on_slug_collision — installs a clean
      bundle, mutates the live pipeline, re-imports with is_upgrade => true, asserts success and
      that the existing agent_id is reused.
    • test_agent_delete_clears_bundle_artifact_registry — inserts an InstalledBundleArtifacts
      row, fires datamachine_agent_deleted, asserts the row is gone.

Reproduction notes

The repro in the issue is two pieces:

  1. Repeatedly agent install <bundle> — observe success: true with a populated agent_id while
    agent list doesn't show the agent. After this change, that scenario either persists fully or
    returns success: false with error_code: install_post_claim_failure. The new test exercises
    the failure mode deterministically by injecting a throw inside the critical section instead of
    waiting for the SQLite race.

  2. After editing live pipeline/flow rows for an installed agent, re-run agent upgrade <bundle>.
    Previously: "Agent slug ... already exists." After this change: success, with conflicts surfaced
    so the CLI can stage them as PendingActions.

Verified behavior

  • php tests/agent-bundle-portable-update-smoke.php — passes.
  • php tests/agent-bundle-format-smoke.php — passes.
  • php tests/agent-bundle-runtime-drift-smoke.php — passes.
  • php tests/import-agent-ability-smoke.php — passes.
  • php tests/export-agent-ability-smoke.php — passes.
  • php -l clean across all touched files; phpcs -p reports no errors on changed files.
  • The pre-existing failure in tests/agent-bundle-upgrade-planner-smoke.php (workspace scope error
    inside the planner's PendingActionStore path) is unchanged from main and unrelated to this
    change — verified by git stash-ing the diff and rerunning.
  • Unit tests under tests/Unit/Core/Agents/AgentBundlerImportTest.php are runnable via the
    existing WP_UnitTestCase harness; they require a live MySQL test DB so they execute in CI /
    via composer test.

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Code (Sonnet 4.5)
  • Used for: Code authoring, test authoring, and PR copy. Chris reviewed the diff, the typed
    error semantics, and the rollback contract; the AI did not run the live install repro on Studio.

…1801)

The bundler's import path used to claim an agent_id, mutate pipelines/flows/agent_config across
multiple unwrapped wpdb writes, and only check sub-step results for the agent insert. Under SQLite
contention on Studio that produced "success: true" responses with a populated agent_id summary while
the underlying rows were silently rolled back, leaving operators with a `Agent "..." not found` on
the very next CLI call.

This change wraps the post-claim work in a transaction with try/catch + manual rollback, throws on
any sub-step write failure, re-fetches the agent row after the final update_agent and verifies the
artifact registry actually persisted, and surfaces typed errors (`install_post_claim_failure`,
`install_invalid_bundle`, `install_slug_collision`, `install_bundle_slug_mismatch`) instead of bare
strings. The manual_rollback() path is the safety net for engines that silently no-op ROLLBACK; it
deletes only rows this call inserted, never pre-existing ones.

`agent upgrade` now passes `is_upgrade => true` so an existing agent row is treated as the upgrade
target instead of returning "Agent slug already exists. Use --slug=<new-slug> to rename on import."
when a bundle's artifacts lack `portable_slug` and the live pipelines/flows have been edited
(`local_modified`). Conflicts come back through `result.conflicts` and the CLI hands them to the
existing PendingActions staging path, matching what `agent diff` already reports.

Defensive cleanup: `InstalledBundleArtifacts::register()` listens on `datamachine_agent_deleted`
and wipes any tracked rows for the deleted agent_id. The importer does not write to that table
today, but extensions can — and a stale row would mis-classify a fresh install as an upgrade
against a non-existent agent.

Also adds two test seam actions, `datamachine_bundle_import_post_claim_started` and
`datamachine_bundle_import_pre_commit`, so tests can throw inside the critical section without
needing a live SQLite race to reproduce the bug.

Tests: tests/Unit/Core/Agents/AgentBundlerImportTest.php covers all three regression shapes:
silent partial success → rollback + typed error, upgrade-against-local-modified → success with
conflicts, agent delete → bundle artifact registry cleared.
@homeboy-ci
Copy link
Copy Markdown
Contributor

homeboy-ci Bot commented May 6, 2026

Homeboy Results — data-machine

Lint

lint — failed

  • other — 279 finding(s)
  • formatting — 6 finding(s)
  • Total: 285 finding(s)

ℹ️ Auto-fix: homeboy lint data-machine --path /home/runner/work/data-machine/data-machine --changed-since e2fb295 --fix (or homeboy refactor data-machine --path /home/runner/work/data-machine/data-machine --changed-since e2fb295 --from lint --write)
ℹ️ Some issues may require manual fixes
ℹ️ Full options: homeboy docs commands/lint
Deep dive: homeboy lint data-machine --changed-since e2fb295

Test

test — failed

  • 1 failed out of 847 total
  • 3 skipped
  • DataMachine\Tests\Unit\Abilities\ImageGenerationPromptRefinementTest::test_refine_prompt_includes_post_context_when_provided

ℹ️ To run specific tests: homeboy test data-machine -- --filter=TestName
ℹ️ Auto-fix lint issues: homeboy refactor data-machine --from lint --write
ℹ️ Collect coverage: homeboy test data-machine --coverage
ℹ️ Save test baseline: homeboy test data-machine --baseline
ℹ️ Analyze failures: homeboy test data-machine --analyze
ℹ️ Pass args to test runner: homeboy test -- [args]
ℹ️ Full options: homeboy docs commands/test
Deep dive: homeboy test data-machine --changed-since e2fb295

Audit

audit — passed

  • test_coverage — 329 finding(s)
  • dead_code — 180 finding(s)
  • requested_detectors — 66 finding(s)
  • intra-method-duplication — 65 finding(s)
  • parallel-implementation — 36 finding(s)
  • repeated_literal_shape — 13 finding(s)
  • dead_guard — 12 finding(s)
  • field_patterns — 5 finding(s)
  • structural — 5 finding(s)
  • Abilities — 4 finding(s)
  • Total: 730 finding(s)

Deep dive: homeboy audit data-machine --changed-since e2fb295

Tooling versions
  • Homeboy CLI: homeboy 0.157.1+17f602bf
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: 221b9c8
  • Action: Extra-Chill/homeboy-action@v2

@chubes4 chubes4 merged commit 630e4fc into main May 6, 2026
1 of 3 checks passed
@chubes4 chubes4 deleted the fix/1801-agent-install-rollback branch May 6, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent install reports success but doesn't always persist (intermittent silent partial failure)

1 participant