Skip to content

Commit 2eaa3a5

Browse files
edlee123pre-commit-ci[bot]xiguiwlvliang-intelashahba
authored
Feature: OpenAI-compatible endpoint for text generation (#1395)
* Instructions using openAI style remote endpoint Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Readme for openai style remote endpoint Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Adding remote textgen service, openai standard Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Code and test for openai style endpoint Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clarified instructions in README_endpoint_openai.md Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Commented out stop_containers at beginning. Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Add a little code comment for clarity Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the curl to text gen service s it doesn't need a key Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modify unit test since vLLM 0.8.3 changed docker files path Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaned up comments Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Adding a suitable vllm block-size for cpu Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Allow text-generation service.py to work with openai compatible endpoints that do not allow null or None as input e.g. openrouter Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Updated README fixed small typos and make it easier to paste example curl Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated test_llms_textgen_endpoit_openai.sh Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Uncomment build_vllm_image Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Fix the WORKPATH Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Generalize OpeaTextGenService to be usable with other open ai compatible endpoints in addition to tgi and vllm Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Add testing for both openai api chat completion and regular completions Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Generalize the OpeaTextGenService so it can be used for openai like APIs beyond TGI and vLLM eg openrouter.ai Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added logging import Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Go back to relative path for ChatTemplate Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Fixed two argument error and omit language arg for chatcompletions Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix unit tests Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert and simplify Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix stri interp bug Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * More logger fstring to fix Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert to old unit test. Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * To fix the test test_llms_text-generation_service_vllm_on_intel_hpu.sh The path of docker files used to build image from vllm-fork changed recently Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Pin supported version of transformers 4.45.2 for gaudi 1.20.1 and use separate requirements_hpu.txt for building Dockerfile.intel_hpu_phi4 Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Update llama-index-core requirements to align with recent PRs Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Revert back path to Dockerfile.hpu Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Pin version range of numpy to be compatible with transformers and torch Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added logging if vllm-gaudi-server fails Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Seeing if omitting transformers and numpy will help hpu CI unit tests by not overwriting dependencies from the Gaudi container Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Add more logging ot text-generation_service_vllm_on_intel_hpu and pin transformers and numpy Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Refactored ALLOWED_CHATCOMPLETION_ARGS and ALLOWED_COMPLETION_ARGS Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Trying depedencies that are known to work with Gaudi 1.20.1 Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * Revert back to main hpu test and text gen hpu Dockerfile Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Ed Lee <16417837+edlee123@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: xiguiw <111278656+xiguiw@users.noreply.github.com> Co-authored-by: Liang Lv <liang1.lv@intel.com> Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> Co-authored-by: Rachel R <rroumeliotis@gmail.com>
1 parent 3240c96 commit 2eaa3a5

File tree

5 files changed

+348
-56
lines changed

5 files changed

+348
-56
lines changed

comps/cores/proto/api_protocol.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1016,5 +1016,42 @@ class FineTuningJobCheckpoint(BaseModel):
10161016
"""The step number that the checkpoint was created at."""
10171017

10181018

1019+
# Args allowed in openai-like chat completions API calls in OpeaTextGenService
1020+
ALLOWED_CHATCOMPLETION_ARGS = (
1021+
"model",
1022+
"messages",
1023+
"frequency_penalty",
1024+
"max_tokens",
1025+
"n",
1026+
"presence_penalty",
1027+
"response_format",
1028+
"seed",
1029+
"stop",
1030+
"stream",
1031+
"stream_options",
1032+
"temperature",
1033+
"top_p",
1034+
"user",
1035+
)
1036+
1037+
# Args allowed in openai-like regular completion API calls in OpeaTextGenService
1038+
ALLOWED_COMPLETION_ARGS = (
1039+
"model",
1040+
"prompt",
1041+
"echo",
1042+
"frequency_penalty",
1043+
"max_tokens",
1044+
"n",
1045+
"presence_penalty",
1046+
"seed",
1047+
"stop",
1048+
"stream",
1049+
"suffix",
1050+
"temperature",
1051+
"top_p",
1052+
"user",
1053+
)
1054+
1055+
10191056
class RouteEndpointDoc(BaseModel):
10201057
url: str = Field(..., description="URL of the chosen inference endpoint")

comps/llms/deployment/docker_compose/compose_text-generation.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,15 @@ services:
118118
LLM_COMPONENT_NAME: ${LLM_COMPONENT_NAME:-OpeaTextGenPredictionguard}
119119
PREDICTIONGUARD_API_KEY: ${PREDICTIONGUARD_API_KEY}
120120

121+
textgen-service-endpoint-openai:
122+
extends: textgen
123+
container_name: textgen-service-endpoint-openai
124+
environment:
125+
LLM_COMPONENT_NAME: OpeaTextGenService
126+
LLM_ENDPOINT: ${LLM_ENDPOINT} # an endpoint that uses OpenAI API style e.g., https://openrouter.ai/api
127+
OPENAI_API_KEY: ${OPENAI_API_KEY} # the key associated with the endpoint
128+
LLM_MODEL_ID: ${LLM_MODEL_ID:-google/gemma-3-1b-it:free}
129+
121130
textgen-native-gaudi:
122131
extends: textgen-gaudi
123132
container_name: textgen-native-gaudi
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Introduction
2+
3+
This OPEA text generation service can connect to any OpenAI-compatible API endpoint, including local deployments (like vLLM or TGI) and remote services (like OpenRouter.ai).
4+
5+
## 1 Prepare TextGen docker image.
6+
7+
```bash
8+
# Build the microservice docker
9+
10+
git clone https://github.com/opea-project/GenAIComps
11+
cd GenAIComps
12+
13+
docker build \
14+
--no-cache \
15+
--build-arg https_proxy=$https_proxy \
16+
--build-arg http_proxy=$http_proxy \
17+
-t opea/llm-textgen:latest \
18+
-f comps/llms/src/text-generation/Dockerfile .
19+
```
20+
21+
## 2 Setup Environment Variables
22+
23+
The key environment variable is `LLM_ENDPOINT`, which specifies the URL of the OpenAI-compatible API. This can be a local address (e.g., for vLLM or TGI) or a remote address.
24+
25+
```
26+
export host_ip=$(hostname -I | awk '{print $1}')
27+
export LLM_MODEL_ID="" # e.g. "google/gemma-3-1b-it:free"
28+
export LLM_ENDPOINT="" # e.g., "http://localhost:8000" (for local vLLM) or "https://openrouter.ai/api" (please make sure to omit /v1 suffix)
29+
export OPENAI_API_KEY=""
30+
```
31+
32+
## 3 Run the Textgen Service
33+
34+
```
35+
export service_name="textgen-service-endpoint-openai"
36+
docker compose -f comps/llms/deployment/docker_compose/compose_text-generation.yaml up ${service_name} -d
37+
```
38+
39+
To observe logs:
40+
41+
```
42+
docker logs textgen-service-endpoint-openai
43+
```
44+
45+
## 4 Test the service
46+
47+
You can first test the remote/local endpoint with `curl`. If you're using a service like OpenRouter, you can test it directly first:
48+
49+
```
50+
curl https://openrouter.ai/api/v1/chat/completions \
51+
-H "Content-Type: application/json" \
52+
-H "Authorization: Bearer $OPENAI_API_KEY" \
53+
-d '{
54+
"model": "'${LLM_MODEL_ID}'",
55+
"messages": [
56+
{
57+
"role": "user",
58+
"content": "Tell me a joke?"
59+
}
60+
]
61+
}'
62+
```
63+
64+
Then you can test the OPEA text generation service that wrapped the endpoint, with the following:
65+
66+
```
67+
curl http://localhost:9000/v1/chat/completions \
68+
-H "Content-Type: application/json" \
69+
-d '{"model":"'${LLM_MODEL_ID}'","messages":[{"role":"user","content":"Tell me a joke?"}]}'
70+
```

comps/llms/src/text-generation/integrations/service.py

Lines changed: 56 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22
# SPDX-License-Identified: Apache-2.0
33

44
import asyncio
5+
import logging
56
import os
7+
from pprint import pformat
68
from typing import Union
79

810
from fastapi.responses import StreamingResponse
@@ -11,12 +13,18 @@
1113

1214
from comps import CustomLogger, LLMParamsDoc, OpeaComponent, OpeaComponentRegistry, SearchedDoc, ServiceType
1315
from comps.cores.mega.utils import ConfigError, get_access_token, load_model_configs
14-
from comps.cores.proto.api_protocol import ChatCompletionRequest
16+
from comps.cores.proto.api_protocol import ALLOWED_CHATCOMPLETION_ARGS, ALLOWED_COMPLETION_ARGS, ChatCompletionRequest
1517

1618
from .template import ChatTemplate
1719

1820
logger = CustomLogger("opea_llm")
19-
logflag = os.getenv("LOGFLAG", False)
21+
22+
# Configure advanced logging based on LOGFLAG environment variable
23+
logflag = os.getenv("LOGFLAG", "False").lower() in ("true", "1", "yes")
24+
if logflag:
25+
logger.logger.setLevel(logging.DEBUG)
26+
else:
27+
logger.logger.setLevel(logging.INFO)
2028

2129
# Environment variables
2230
MODEL_NAME = os.getenv("LLM_MODEL_ID")
@@ -96,36 +104,39 @@ async def send_simple_request():
96104
def align_input(
97105
self, input: Union[LLMParamsDoc, ChatCompletionRequest, SearchedDoc], prompt_template, input_variables
98106
):
107+
"""Aligns different input types to a standardized chat completion format.
108+
109+
Args:
110+
input: SearchedDoc, LLMParamsDoc, or ChatCompletionRequest
111+
prompt_template: Optional template for formatting prompts
112+
input_variables: Variables expected by the prompt template
113+
"""
99114
if isinstance(input, SearchedDoc):
100-
if logflag:
101-
logger.info("[ SearchedDoc ] input from retriever microservice")
115+
logger.debug(f"Processing SearchedDoc input from retriever microservice:\n{pformat(vars(input), indent=2)}")
102116
prompt = input.initial_query
103117
if input.retrieved_docs:
104118
docs = [doc.text for doc in input.retrieved_docs]
105-
if logflag:
106-
logger.info(f"[ SearchedDoc ] combined retrieved docs: {docs}")
119+
logger.debug(f"Retrieved documents:\n{pformat(docs, indent=2)}")
107120
prompt = ChatTemplate.generate_rag_prompt(input.initial_query, docs, MODEL_NAME)
121+
logger.debug(f"Generated RAG prompt:\n{prompt}")
108122

109-
## use default ChatCompletionRequest parameters
110-
new_input = ChatCompletionRequest(messages=prompt)
123+
# Convert to ChatCompletionRequest with default parameters
124+
new_input = ChatCompletionRequest(messages=prompt)
125+
logger.debug(f"Final converted input:\n{pformat(vars(new_input), indent=2)}")
111126

112-
if logflag:
113-
logger.info(f"[ SearchedDoc ] final input: {new_input}")
114-
115-
return prompt, new_input
127+
return prompt, new_input
116128

117129
elif isinstance(input, LLMParamsDoc):
118-
if logflag:
119-
logger.info("[ LLMParamsDoc ] input from rerank microservice")
130+
logger.debug(f"Processing LLMParamsDoc input from rerank microservice:\n{pformat(vars(input), indent=2)}")
120131
prompt = input.query
121132
if prompt_template:
122133
if sorted(input_variables) == ["context", "question"]:
123134
prompt = prompt_template.format(question=input.query, context="\n".join(input.documents))
124135
elif input_variables == ["question"]:
125136
prompt = prompt_template.format(question=input.query)
126137
else:
127-
logger.info(
128-
f"[ LLMParamsDoc ] {prompt_template} not used, we only support 2 input variables ['question', 'context']"
138+
logger.warning(
139+
f"Prompt template not used - unsupported variables. Template: {prompt_template}\nOnly ['question', 'context'] or ['question'] are supported"
129140
)
130141
else:
131142
if input.documents:
@@ -145,8 +156,7 @@ def align_input(
145156
return prompt, new_input
146157

147158
else:
148-
if logflag:
149-
logger.info("[ ChatCompletionRequest ] input in opea format")
159+
logger.debug(f"Processing ChatCompletionRequest input:\n{pformat(vars(input), indent=2)}")
150160

151161
prompt = input.messages
152162
if prompt_template:
@@ -179,8 +189,7 @@ async def invoke(self, input: Union[LLMParamsDoc, ChatCompletionRequest, Searche
179189
input_variables = prompt_template.input_variables
180190

181191
if isinstance(input, ChatCompletionRequest) and not isinstance(input.messages, str):
182-
if logflag:
183-
logger.info("[ ChatCompletionRequest ] input in opea format")
192+
logger.debug("[ ChatCompletionRequest ] input in opea format")
184193

185194
if input.messages[0]["role"] == "system":
186195
if "{context}" in input.messages[0]["content"]:
@@ -200,22 +209,11 @@ async def invoke(self, input: Union[LLMParamsDoc, ChatCompletionRequest, Searche
200209

201210
input.messages.insert(0, {"role": "system", "content": system_prompt})
202211

203-
chat_completion = await self.client.chat.completions.create(
204-
model=MODEL_NAME,
205-
messages=input.messages,
206-
frequency_penalty=input.frequency_penalty,
207-
max_tokens=input.max_tokens,
208-
n=input.n,
209-
presence_penalty=input.presence_penalty,
210-
response_format=input.response_format,
211-
seed=input.seed,
212-
stop=input.stop,
213-
stream=input.stream,
214-
stream_options=input.stream_options,
215-
temperature=input.temperature,
216-
top_p=input.top_p,
217-
user=input.user,
218-
)
212+
# Create input params directly from input object attributes
213+
input_params = {**vars(input), "model": MODEL_NAME}
214+
filtered_params = self._filter_api_params(input_params, ALLOWED_CHATCOMPLETION_ARGS)
215+
logger.debug(f"Filtered chat completion parameters:\n{pformat(filtered_params, indent=2)}")
216+
chat_completion = await self.client.chat.completions.create(**filtered_params)
219217
"""TODO need validate following parameters for vllm
220218
logit_bias=input.logit_bias,
221219
logprobs=input.logprobs,
@@ -226,22 +224,10 @@ async def invoke(self, input: Union[LLMParamsDoc, ChatCompletionRequest, Searche
226224
parallel_tool_calls=input.parallel_tool_calls,"""
227225
else:
228226
prompt, input = self.align_input(input, prompt_template, input_variables)
229-
chat_completion = await self.client.completions.create(
230-
model=MODEL_NAME,
231-
prompt=prompt,
232-
echo=input.echo,
233-
frequency_penalty=input.frequency_penalty,
234-
max_tokens=input.max_tokens,
235-
n=input.n,
236-
presence_penalty=input.presence_penalty,
237-
seed=input.seed,
238-
stop=input.stop,
239-
stream=input.stream,
240-
suffix=input.suffix,
241-
temperature=input.temperature,
242-
top_p=input.top_p,
243-
user=input.user,
244-
)
227+
input_params = {**vars(input), "model": MODEL_NAME, "prompt": prompt}
228+
filtered_params = self._filter_api_params(input_params, ALLOWED_COMPLETION_ARGS)
229+
logger.debug(f"Filtered completion parameters:\n{pformat(filtered_params, indent=2)}")
230+
chat_completion = await self.client.completions.create(**filtered_params)
245231
"""TODO need validate following parameters for vllm
246232
best_of=input.best_of,
247233
logit_bias=input.logit_bias,
@@ -251,15 +237,29 @@ async def invoke(self, input: Union[LLMParamsDoc, ChatCompletionRequest, Searche
251237

252238
async def stream_generator():
253239
async for c in chat_completion:
254-
if logflag:
255-
logger.info(c)
240+
logger.debug(c)
256241
chunk = c.model_dump_json()
257242
if chunk not in ["<|im_end|>", "<|endoftext|>"]:
258243
yield f"data: {chunk}\n\n"
259244
yield "data: [DONE]\n\n"
260245

261246
return StreamingResponse(stream_generator(), media_type="text/event-stream")
262247
else:
263-
if logflag:
264-
logger.info(chat_completion)
248+
logger.debug(chat_completion)
265249
return chat_completion
250+
251+
def _filter_api_params(self, input_params: dict, allowed_args: tuple) -> dict:
252+
"""Filters input parameters to only include allowed non-None arguments.
253+
254+
Only allow allowed args, and and filter non-None default arguments because
255+
some open AI-like APIs e.g. OpenRouter.ai will disallow None parameters.
256+
Works for both chat completion and regular completion API calls.
257+
258+
Args:
259+
input_params: Dictionary of input parameters
260+
allowed_args: Tuple of allowed argument names
261+
262+
Returns:
263+
Filtered dictionary containing only allowed non-None arguments
264+
"""
265+
return {arg: input_params[arg] for arg in allowed_args if arg in input_params and input_params[arg] is not None}

0 commit comments

Comments
 (0)