Skip to content

Add support for inference with vLLM or other OpenAI-compatible server#77

Open
AbrahamSanders wants to merge 1 commit intoneuphonic:mainfrom
AbrahamSanders:main
Open

Add support for inference with vLLM or other OpenAI-compatible server#77
AbrahamSanders wants to merge 1 commit intoneuphonic:mainfrom
AbrahamSanders:main

Conversation

@AbrahamSanders
Copy link
Copy Markdown

Addresses #76 by adding support for inference where the backend is an OpenAI client. This allows vLLM or any other OpenAI-compatible server to be used, since neuphonic/neutts-air is a standard Qwen2ForCausalLM backbone supported by most inference engines today out of the box.

First run vLLM to serve the desired backbone model:

vllm serve neuphonic/neutts-air

Then run the streaming example script:

python -m examples.vllm_streaming_example \
  --input_text "My name is Dave, and um, I'm from London" \
  --ref_codes samples/dave.pt \
  --ref_text samples/dave.txt \
  --backbone neuphonic/neutts-air \
  --vllm_url http://localhost:8000/v1 \
  --vllm_api_key empty

@AbrahamSanders AbrahamSanders mentioned this pull request Nov 30, 2025
@ashavish
Copy link
Copy Markdown

ashavish commented Dec 7, 2025

@AbrahamSanders Tried running this. It seems to generate repeated "!" and finally errors out like this - ValueError: zero-size array to reduction operation minimum which has no identity.
I am running the vllm on a Nvidia T4 machine.

@ashavish
Copy link
Copy Markdown

ashavish commented Dec 9, 2025

This got sorted. It works. Just needed to change the dtype to float32 on T4

@AbrahamSanders
Copy link
Copy Markdown
Author

This got sorted. It works. Just needed to change the dtype to float32 on T4

@ashavish glad to hear it works for you. Just to clarify you mean you needed to serve the model in full precision on vLLM on your T4 machine otherwise it generated gibberish?

@ashavish
Copy link
Copy Markdown

ashavish commented Dec 11, 2025

This got sorted. It works. Just needed to change the dtype to float32 on T4

@ashavish glad to hear it works for you. Just to clarify you mean you needed to serve the model in full precision on vLLM on your T4 machine otherwise it generated gibberish?

Yes. On a T4, since it doesnt support Bfloat16, vllm automatically converts it to float16 which causes issues in generation. I was repeatedly getting a "!" character in the output. When loading the model directly using HF I didnt face this issue, possibly because I hadnt specified the dtype and it loads in float32 by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants