Skip to content

Add native ComfyUI provider for image and video generation#29

Open
martimramos wants to merge 3 commits intocalesthio:mainfrom
martimramos:comfyui-adapter
Open

Add native ComfyUI provider for image and video generation#29
martimramos wants to merge 3 commits intocalesthio:mainfrom
martimramos:comfyui-adapter

Conversation

@martimramos
Copy link
Copy Markdown

@martimramos martimramos commented Apr 16, 2026

Summary

  • Adds comfyui_image and comfyui_video tools that delegate GPU work to a running ComfyUI server via its REST API
  • Includes a shared client (tools/_comfyui/client.py) and 3 bundled workflow templates (FLUX 2 txt2img, WAN 2.2 i2v 4-step, WAN 2.2 t2v 4-step)
  • Model discovery via ComfyUI's /object_info endpoint — tools check that required models are installed before generating, and give actionable error messages when they're missing
  • Actionable COMFYUI_SERVER_URL configuration guidance when server isn't reachable

Why

OpenMontage's existing local GPU tools use HuggingFace diffusers directly. This breaks on hardware where the PyTorch ecosystem hasn't caught up — notably NVIDIA Blackwell / DGX Spark (aarch64, CUDA 13.0, sm_121) where there are no stable PyTorch wheels. ComfyUI already solves these compatibility issues and NVIDIA ships official optimized containers for it.

This adapter lets OpenMontage delegate GPU generation to ComfyUI, avoiding the need to install PyTorch/diffusers directly. Same models, better hardware portability.

What's included

Component File Lines
Shared REST client tools/_comfyui/client.py ~180
Image generation tools/graphics/comfyui_image.py ~140
Video generation (t2v + i2v) tools/video/comfyui_video.py ~190
FLUX 2 Dev workflow tools/_comfyui/workflows/flux2-txt2img.json
WAN 2.2 I2V 4-step workflow tools/_comfyui/workflows/wan22-i2v-4step.json
WAN 2.2 T2V 4-step workflow tools/_comfyui/workflows/wan22-t2v-4step.json
Contract tests tests/contracts/test_comfyui_tools.py ~200
Design document docs/comfyui-adapter-plan.md

Zero changes to existing files — tools auto-register via the existing discovery mechanism. Selectors pick them up via capability match.

What's NOT included (and why)

Music generation (comfyui_music) — We explored this with ACE-Step 3.5B. The model runs fine in ComfyUI, but the node interface isn't standardized. Different custom node packs use different class names (AceStepModelLoader vs native TextEncodeAceStepAudio), so a bundled workflow would break for most users. Documented in the plan doc as an open question. Happy to discuss approaches — maybe a workflow_json override-only pattern, or waiting for node convergence.

Tested on

  • NVIDIA DGX Spark (GB10 Blackwell, aarch64, CUDA 13.0, 128GB unified memory)
  • ComfyUI 0.19.1 on NGC PyTorch 25.10 container
  • FLUX 2 Dev NVFP4 image generation (~115s per 1024x1024)
  • WAN 2.2 14B I2V with LightX2V 4-step LoRA (~360s per 5s clip)
  • Full end-to-end test via Claude Code orchestration (preflight → image → i2v → output)

Test plan

  • 45 contract tests pass (no ComfyUI server required)
  • Full existing test suite passes (264 passed, 6 skipped)
  • Live image generation test (FLUX 2)
  • Live i2v video generation test (WAN 2.2)
  • Model discovery against running server
  • Error paths: server down, missing models, wrong URL
  • Test on consumer GPU (RTX 3090/4090, x86)

🤖 Generated with Claude Code

martimramos and others added 3 commits April 16, 2026 23:59
…ration

Adds three new BaseTool providers that delegate GPU work to a running
ComfyUI server via its REST API.  This avoids the need to install
PyTorch/diffusers directly, which is critical on hardware where the
ecosystem hasn't caught up (e.g. NVIDIA Blackwell / DGX Spark, aarch64
+ CUDA 13.0).

New files:
- tools/_comfyui/client.py — shared REST client (submit/poll/download)
- tools/_comfyui/workflows/ — 4 bundled workflow templates
- tools/graphics/comfyui_image.py — FLUX 2 Dev NVFP4 text-to-image
- tools/video/comfyui_video.py — WAN 2.2 14B t2v + i2v (4-step LightX2V)
- tools/audio/comfyui_music.py — ACE-Step 3.5B music generation
- tests/contracts/test_comfyui_tools.py — 41 contract tests
- docs/comfyui-adapter-plan.md — design document

Zero changes to existing tools, selectors, registry, or pipelines.
Tools are auto-discovered and selectors pick them up via capability match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Client queries ComfyUI /object_info to discover installed models
  (checkpoints, diffusion models, VAE, CLIP, LoRAs)
- Each tool declares its required models and checks them on execute()
- get_status() returns DEGRADED when server is up but models are missing
- Clear error messages tell the user exactly which models to download
- When COMFYUI_SERVER_URL is not set, error message tells the user to
  configure it in .env instead of silently failing on localhost:8188
- 8 new tests covering URL config, error messages, and model requirements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed comfyui_music and its workflow. The ACE-Step model runs in
ComfyUI but the node class names differ across custom node packs
(AceStepModelLoader vs native TextEncodeAceStepAudio, etc.), so a
bundled workflow would break for most users.

Documented the reasoning in the plan doc and listed it as an open
question for future work. Users with ACE-Step working can still use
the workflow_json override on any tool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@martimramos martimramos requested a review from calesthio as a code owner April 16, 2026 23:35
@calesthio
Copy link
Copy Markdown
Owner

This is a fantastic initiative and, directionally, I think it would be a very strong addition to OpenMontage.

The biggest win here is not just ?another provider?, but a much better local backend abstraction. Using ComfyUI as the execution layer makes a lot of sense for hardware portability, especially for setups where direct diffusers / PyTorch support is rough or lagging. It also fits the existing image/video selector architecture well, so the overall shape of the proposal feels right.

That said, after a technical + governance pass, I think there are a few issues worth addressing before merge:

  1. The workflow_json override is currently advertised as fully custom / drop-in, but the implementation still hardcodes output nodes.

    • comfyui_image still downloads from node 13
    • comfyui_video still forces the custom workflow path through output node 16
    • This means arbitrary community workflows will fail unless they happen to use the bundled node IDs.
    • I think this needs an explicit output_node input (or equivalent contract) if custom workflows are meant to be first-class.
  2. comfyui_video.get_status() overstates availability when only one mode is actually usable.

    • Right now the tool returns AVAILABLE as long as either T2V or I2V models exist.
    • In practice that means selector routing can surface ComfyUI for an image_to_video request even when only T2V is installed, and the failure only happens at execution time.
    • I?d recommend either reporting this more precisely or filtering/ranking by operation-specific readiness.
  3. The new tools currently publish agent_skills = [].

    • In OpenMontage this is more than metadata: selectors propagate those skills into the agent context, and AGENT_GUIDE.md explicitly expects Layer 3 skills to be read before generation tools are used.
    • So even if the provider works technically, this weakens the prompt/governance path compared with the other generation providers.
    • I think these tools should expose at least one relevant skill so they participate properly in the existing agent flow.
  4. Provenance becomes misleading when workflow_json is used.

    • The tools still report fixed model names (flux2-dev-nvfp4, wan2.2-14b-fp8-4step) even if the caller supplies a completely different custom workflow.
    • Since OpenMontage uses this metadata downstream in manifests / publishing / auditability, it would be better to either surface custom workflow provenance explicitly or mark the model/workflow as user-supplied.
  5. The RFC / doc currently drifts a bit from the repo reality.

    • It references music_selector, which doesn?t exist in the current codebase.
    • It also describes some workflow/output behavior that doesn?t fully match the shipped implementation.
    • I don?t think that blocks the image/video adapter itself, but I would tighten the doc so it matches the actual OpenMontage architecture.

On the open questions, my current take would be:

  • Workflow versioning: keep blessed workflows in-repo for reproducibility, but add a stricter override contract (workflow_json/workflow_path + output_node + optional provenance fields).
  • Async generation: polling is fine for merge; websocket progress can come later as a UX improvement.
  • Multi-server: worth supporting later via per-capability env vars like image/video server URLs, but not necessary for the first merge.
  • Music generation: I would revisit the doc language here. ComfyUI?s ACE-Step story looks more mature now than the RFC currently suggests, so I?d frame this less as ?wait indefinitely? and more as ?follow up once we decide the OpenMontage music-selection integration shape.?

Overall: I?m very supportive of the direction. If the custom workflow contract, partial-availability reporting, and agent-skill/provenance integration are tightened up, I think this would be a genuinely valuable addition to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants