Skip to content

fix: offloading model layers to gpu and using flash_attn for gpu#75

Open
HeisenberG2575 wants to merge 1 commit intoneuphonic:mainfrom
HeisenberG2575:main
Open

fix: offloading model layers to gpu and using flash_attn for gpu#75
HeisenberG2575 wants to merge 1 commit intoneuphonic:mainfrom
HeisenberG2575:main

Conversation

@HeisenberG2575
Copy link
Copy Markdown

Context:

The _load_backbone functions loads the neutts-air backbone for inference. While this function works as intended for CPU inference, GPU inference is slowed down due to faults in code

Problem:

Considering that backbone_device is used to export the model to the appropriate device, "gpu" isn't a valid string for .to() hence users resort to the standard "cuda" or strings having "cuda" such as "cuda:0" as a prefix. The current if conditions

n_gpu_layers=-1 if backbone_device == "gpu" else 0
and
flash_attn=True if backbone_device == "gpu" else False
do not have the intended effect of offloading the model to GPU and using flash-attention since the string is checked for "gpu"

Fix:

  • check string for "cuda" and "gpu" using python's inbuilt startswith method for strings ensures usage of "cuda" as backbone_device leads to intended behaviour
  • checking for "gpu" alongside "cuda" ensures the change is not code-breaking for existing users

TODOs:

  • include other device strings that may be utilized for GPU inference and support usage of flash_attn and use of n_gpu_layers with gguf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant