Add support for inference with vLLM or other OpenAI-compatible server#77
Add support for inference with vLLM or other OpenAI-compatible server#77AbrahamSanders wants to merge 1 commit intoneuphonic:mainfrom
Conversation
|
@AbrahamSanders Tried running this. It seems to generate repeated "!" and finally errors out like this - ValueError: zero-size array to reduction operation minimum which has no identity. |
|
This got sorted. It works. Just needed to change the dtype to float32 on T4 |
@ashavish glad to hear it works for you. Just to clarify you mean you needed to serve the model in full precision on vLLM on your T4 machine otherwise it generated gibberish? |
Yes. On a T4, since it doesnt support Bfloat16, vllm automatically converts it to float16 which causes issues in generation. I was repeatedly getting a "!" character in the output. When loading the model directly using HF I didnt face this issue, possibly because I hadnt specified the dtype and it loads in float32 by default. |
Addresses #76 by adding support for inference where the backend is an OpenAI client. This allows vLLM or any other OpenAI-compatible server to be used, since
neuphonic/neutts-airis a standardQwen2ForCausalLMbackbone supported by most inference engines today out of the box.First run vLLM to serve the desired backbone model:
Then run the streaming example script:
python -m examples.vllm_streaming_example \ --input_text "My name is Dave, and um, I'm from London" \ --ref_codes samples/dave.pt \ --ref_text samples/dave.txt \ --backbone neuphonic/neutts-air \ --vllm_url http://localhost:8000/v1 \ --vllm_api_key empty