Skip to content

RTX 5080 (Blackwell) + 16GB VRAM: Kimodo fails with CUDA OOM due to LLM fallback loading on GPU even when CPU mode is requested #27

@kavikode

Description

@kavikode

Environment:

  • OS: Ubuntu 24.04
  • GPU: NVIDIA GeForce RTX 5080 (16 GB VRAM)
  • Driver: 590.48.01
  • CUDA (driver): 13.1
  • PyTorch: 2.11.0+cu126
  • Kimodo: latest (pip install kimodo[all])

Summary:

Kimodo consistently fails with CUDA out-of-memory errors on a 16GB GPU due to the LLM (Meta-Llama-3-8B-Instruct) being loaded onto GPU memory during fallback, even when CPU execution is explicitly requested.


Steps to Reproduce:

  1. Activate environment:
conda activate kimodo
  1. Attempt to force CPU usage:
export KIMODO_TEXT_ENCODER_DEVICE=cpu
export TRANSFORMERS_DEVICE=cpu
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  1. Run:
kimodo_demo

Observed Behavior:

  • The text encoder service is not running, so Kimodo falls back to local LLM2Vec:
Text encoder service is unreachable → falling back to local LLM2Vec encoder
  • Despite CPU settings, the fallback loads Llama onto GPU:
~14.7 GiB allocated by PyTorch
  • This leaves insufficient VRAM for the motion model:
CUDA out of memory. Tried to allocate 20.00 MiB
  • Final result: model fails to load

Expected Behavior:

  • When KIMODO_TEXT_ENCODER_DEVICE=cpu is set, the fallback LLM should remain entirely on CPU
    OR
  • Kimodo should fail early with a clear message requiring kimodo_textencoder service instead of silently falling back

Additional Notes:

  • Running the text encoder as a separate service resolves the issue:
kimodo_textencoder   # Terminal 1
kimodo_demo          # Terminal 2
  • However, this requirement is not clearly enforced or documented, leading to confusing OOM failures.

  • This issue is especially problematic on GPUs with 16GB VRAM, where:

    • Llama 8B consumes ~14–15GB
    • Motion model requires additional memory
    • Combined load exceeds capacity

Suggested Improvements:

  1. Respect KIMODO_TEXT_ENCODER_DEVICE=cpu for fallback path
  2. Add explicit warning or error if text encoder service is not running
  3. Provide a lightweight / quantized LLM fallback option
  4. Document GPU memory requirements clearly (≥24GB recommended)

Impact:

This prevents Kimodo from running on otherwise capable GPUs (e.g., RTX 5080 16GB), even though the motion model itself would fit if the LLM were isolated.


Question:

Is there a recommended way to:

  • force LLM fallback to CPU reliably
  • or use a smaller / quantized text encoder
  • Any other workarounds

Thanks for the excellent work on Kimodo — this is a very promising framework for human motion generation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions