Add support for inference with vLLM or other OpenAI-compatible server by AbrahamSanders · Pull Request #77 · neuphonic/neutts

AbrahamSanders · 2025-11-30T09:18:42Z

Addresses #76 by adding support for inference where the backend is an OpenAI client. This allows vLLM or any other OpenAI-compatible server to be used, since neuphonic/neutts-air is a standard Qwen2ForCausalLM backbone supported by most inference engines today out of the box.

First run vLLM to serve the desired backbone model:

vllm serve neuphonic/neutts-air

Then run the streaming example script:

python -m examples.vllm_streaming_example \
  --input_text "My name is Dave, and um, I'm from London" \
  --ref_codes samples/dave.pt \
  --ref_text samples/dave.txt \
  --backbone neuphonic/neutts-air \
  --vllm_url http://localhost:8000/v1 \
  --vllm_api_key empty

ashavish · 2025-12-07T04:51:36Z

@AbrahamSanders Tried running this. It seems to generate repeated "!" and finally errors out like this - ValueError: zero-size array to reduction operation minimum which has no identity.
I am running the vllm on a Nvidia T4 machine.

ashavish · 2025-12-09T09:28:57Z

This got sorted. It works. Just needed to change the dtype to float32 on T4

AbrahamSanders · 2025-12-09T20:55:01Z

This got sorted. It works. Just needed to change the dtype to float32 on T4

@ashavish glad to hear it works for you. Just to clarify you mean you needed to serve the model in full precision on vLLM on your T4 machine otherwise it generated gibberish?

ashavish · 2025-12-11T03:28:14Z

This got sorted. It works. Just needed to change the dtype to float32 on T4

@ashavish glad to hear it works for you. Just to clarify you mean you needed to serve the model in full precision on vLLM on your T4 machine otherwise it generated gibberish?

Yes. On a T4, since it doesnt support Bfloat16, vllm automatically converts it to float16 which causes issues in generation. I was repeatedly getting a "!" character in the output. When loading the model directly using HF I didnt face this issue, possibly because I hadnt specified the dtype and it loads in float32 by default.

Add support for inference with vLLM or other OpenAI-compatible server

e74e171

AbrahamSanders mentioned this pull request Nov 30, 2025

vllm serving #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for inference with vLLM or other OpenAI-compatible server#77

Add support for inference with vLLM or other OpenAI-compatible server#77
AbrahamSanders wants to merge 1 commit intoneuphonic:mainfrom
AbrahamSanders:main

AbrahamSanders commented Nov 30, 2025

Uh oh!

ashavish commented Dec 7, 2025 •

edited

Loading

Uh oh!

ashavish commented Dec 9, 2025

Uh oh!

AbrahamSanders commented Dec 9, 2025

Uh oh!

ashavish commented Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AbrahamSanders commented Nov 30, 2025

Uh oh!

ashavish commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashavish commented Dec 9, 2025

Uh oh!

AbrahamSanders commented Dec 9, 2025

Uh oh!

ashavish commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ashavish commented Dec 7, 2025 •

edited

Loading

ashavish commented Dec 11, 2025 •

edited

Loading