Skip to content

Releases: FearL0rd/ComfyUI-ParallelAnything

Fix LoRA

23 Feb 13:54

Choose a tag to compare

BREAKING CHANGE: Replaces model.forward with parallel_forward; stores original in
_original_forward. Requires model patch_model() to be called before setup if using LoRA.

Implement Model Parallelism (Pipeline) for batch_size=1 & Fix Multi-GPU Device Mismatches

16 Feb 14:17
36f1d67

Choose a tag to compare

Feature (Pipeline Parallelism): implemented logic to handle batch_size=1 by splitting transformer blocks across GPUs.

Introduced ParallelBlock wrapper to manage execution across devices.

Data now flows in a pipeline sequence (e.g., GPU0 → GPU1 → ...) and automatically returns to the Lead Device for final processing.

Maintains original Data Parallel (batch splitting) behavior for batch_size > 1.

Fix (Device Mismatch): Resolved RuntimeError: Expected all tensors to be on the same device.

Updated ParallelBlock to recursively identify and move all input arguments (tensors, nested lists/tuples, dicts, and dataclasses) to the target device, rather than just the first argument.

Ensured auxiliary inputs (like timesteps and context) are correctly transferred alongside the main hidden states.

Fix (Syntax): Fixed a SyntaxError in clone_dataclass_or_object where a try block was missing its corresponding except.

Refactor: Improved tensor movement logic to be "deep" and recursive, preventing crashes when models use complex input structures.
test

Bug Fixes

12 Feb 20:16

Choose a tag to compare

Bug Fixes and an increase in performance

Bug Fixes

12 Feb 20:10

Choose a tag to compare

Bug Fixes and an increase in performance

Adds `is_float8_dtype()` helper to detect any FP8 format (e4m3fn, e5m2, e4m3fnuz, e5m2fnuz) and automatically convert to FP16 when targeting devices without native FP8 support (< SM90).

07 Feb 17:56

Choose a tag to compare

This fixes compatibility with models like LTX-Video, Wan, and others that may use different FP8 encoding schemes (particularly e5m2) on mixed GPU setups.

Changes:

  • Add is_float8_dtype() to detect all PyTorch FP8 variants
  • Update clone_module_simple() to handle FP8 in layer reconstruction and generic fallback
  • Update safe_model_clone() to post-process all FP8 variants during incremental loading
  • Update move_to_device() to convert FP8 inputs at runtime during parallel inference
  • Ensure consistent FP8→FP16 conversion across weights, buffers, and activation tensors

Preserves FP8 performance on Hopper/Ada (SM90+) while ensuring compatibility with Ampere and older architectures that only support FP16/FP32 computation.

Fixes: Support for LTX-Video, Wan, and other FP8 models using e5m2 or alternative FP8 formats on mixed-GPU setups

Multiple Fixes and WAN Support

06 Feb 21:10

Choose a tag to compare

Fixed infinite recursion: Added check if hasattr(replica, '_original_forward') and replica is target_model in the worker to use the original forward method when the replica is the original model, preventing parallel_forward from calling itself recursively.

Device reference safety: Converted devices_ref and weights_ref to tuples to prevent accidental modification after assignment.

Explicit device comparison: Used replica is target_model to detect when we're using the original model vs a clone.

Consistent fallback handling: Both the single-device fallback and the OOM fallback now check for _original_forward to use the correct method.

Better logging: Added explicit print at the end showing which devices are actually being used.

This ensures that when CUDA:0 is the original device and CUDA:1 is a clone, both are used correctly without the original device causing a recursive loop.

fix: Resolve first-run CUDA OOM and optimize VRAM usage during multi-GPU cloning

06 Feb 17:30

Choose a tag to compare

Critical memory optimization for Parallel Anything node to prevent CUDA out-of-memory
errors on first execution and improve stability across multi-GPU setups.

Problem:

  • First-run OOM errors occurred due to CUDA memory fragmentation and holding
    duplicate model copies (CPU + GPU) simultaneously during cloning
  • No handling for devices that failed during cloning due to insufficient VRAM
  • Original model restoration caused temporary 2x VRAM usage spikes

Solution:

  • Skip cloning when target device matches source device (use reference)
  • Incremental state_dict loading to minimize peak memory usage
  • Aggressive ComfyUI model cache unloading before cloning operations
  • Progressive cleanup between device clones with synchronized CUDA operations
  • Graceful degradation: skip OOM devices and redistribute workload

Technical Changes:

  • Added aggressive_cleanup() helper for forced GC and CUDA cache clearing
  • Modified safe_model_clone() with incremental parameter loading and
    early-exit if model already on target device
  • Updated setup_parallel() to use reference counting for original device
  • Implemented successful_devices tracking for OOM fallback handling
  • Added periodic torch.cuda.empty_cache() during large model transfers
  • Enhanced error handling with device-level try/except blocks
  • Fixed state dict parsing for nested modules in incremental loading

Memory optimizations:

  • Unload ComfyUI models (unload_all_models) before cloning
  • Delete state_dict entries immediately after transfer
  • Force CPU transition only when necessary (not when reusing original)
  • Synchronize CUDA devices before/after memory operations

Breaking Changes: None
Backwards Compatible: Yes

Added VRAM purge toggles to INPUT_TYPES:

04 Feb 22:25
6a8a99f

Choose a tag to compare

Added VRAM purge toggles to INPUT_TYPES:

purge_cache: Clears CUDA cache during cleanup (default: True)
purge_models: Aggressively unloads all models from VRAM using comfy.model_management.unload_all_models() (default: False)
Updated setup_parallel method signature to accept purge_cache and purge_models parameters

Store purge preferences on the model: target_model._parallel_purge_cache and target_model._parallel_purge_models

Enhanced cleanup_parallel_model function to:

Read the purge flags from the model attributes
Conditionally run comfy.model_management.unload_all_models() if purge_models is True
Conditionally clear CUDA cache and soft empty cache based on purge_cache setting
Multi-GPU cache clearing (iterates through all CUDA devices)
The purge settings persist with the model and are applied when the parallel processing resources are cleaned up (either when the node is re-run with new settings or when the model is garbage collected).

Bug Fixes and improvement

04 Feb 15:46
6362d3c

Choose a tag to compare

Fixed Memory Leaks: Added explicit cleanup of partial replicas if cloning fails mid-process
Better Error Handling: Exception handling in worker() now properly returns exceptions rather than raising, allowing cleanup of other threads
Thread Safety: Removed unused imports and ensured proper synchronization points for CUDA/XPU devices
Robust Splitting: Fixed split_batch to handle mixed lists of tensors and non-tensors correctly
Cleanup Safety: cleanup_parallel_model now handles cases where the model was partially initialized
VRAM Management: Added soft_empty_cache calls between cloning operations to prevent OOM when cloning to multiple devices
Management: Added auto balance to improve performance
Attribute Handling: Improved clone_module_simple to better handle missing attributes in layer reconstruction
Config Extraction: Added serialization check to extract_model_config to prevent storing non-serializable objects

Bug Fixes / Improvements / Optimizations

03 Feb 21:36
8a3085d

Choose a tag to compare

Critical Bug Fixes:
Memory Leak: weakref.finalize now uses weakref.ref(target_model) instead of direct reference
Race Condition: Worker threads now return exceptions instead of raising them, allowing proper aggregation and cleanup
Batch Validation: get_batch_size() now validates consistent batch dimensions across tensor lists
Double Device Transfer: Removed .to(device) from split_kwargs(), now only happens in worker()
Attribute Handling: Cleaner separation of in_features vs in_channels in layer cloning
Improvements Added:
CUDA Streams: Each GPU now uses its own CUDA stream for true parallel execution
Auto VRAM Balance: New auto_vram_balance parameter adjusts splits based on available VRAM (70% user preference / 30% availability)
Gradient Checkpointing: Automatically disabled on replicas to save VRAM
Accelerate Hooks: Clears _hf_hook and hooks attributes for compatibility with accelerate offloading
Enhanced Cache Clearing: Added rope_cache and freqs_cis_cache to FLUX cleanup
Thread Safety: Proper executor context management and result validation
Performance Optimizations:
Non-blocking transfers where safe
Synchronization only on CUDA/XPU devices
Batch size auto-adjustment prevents OOM on uneven GPU setups