Align streaming response with OpenAI API and remove double latency#45
Align streaming response with OpenAI API and remove double latency#45stuartleeks wants to merge 1 commit intomainfrom
Conversation
stuartleeks
commented
Jul 4, 2024
- Fix differences between streaming response and OpenAI API content/format
- Avoid adding latency on response for streamin as each chunk has added latency
- Fix differences between streaming response and OpenAI API content/format - Avoid adding latency on response for streamin as each chunk has added latency
4333e09 to
75c1c49
Compare
lucashuet93
left a comment
There was a problem hiding this comment.
My apologies for missing this one!
|
|
||
| The `aoai-simulator.latency.full` metric measures the full latency of the simulator. This is the time taken to process a request _including_ any added latency. | ||
|
|
||
| NOTE: Added latency for streaming requests is not included in this metric. |
There was a problem hiding this comment.
I had to read this a couple of times to understand what was being said. That might be just me.
Is it saying...?
- "For streaming requests, the added latency is not included in this metric" and if so, what does this metric show for streaming requests?
- That this metric is meaningless for streaming requests, and should be ignored?
- That this metric is not reported for streaming requests
There was a problem hiding this comment.
ok, I'll re-word. It's option 2
| response_id = "chatcmpl-" + nanoid.non_secure_generate(size=29) | ||
| words = generated_content.split(" ") | ||
| # determine the per-token latency to use in seconds from config | ||
| per_token_latency_s = context.config.latency.open_ai_chat_completions.get_value() / 1000 |
There was a problem hiding this comment.
dumb question: what's the 1000 doing here?
There was a problem hiding this comment.
It's converting milliseconds to seconds
| async def send_words(): | ||
|
|
||
| # Send preamble chunks | ||
| chunk_string = json.dumps( |
There was a problem hiding this comment.
What's the thinking behind this being an inline function?
At this point it's making create_chat_completion_response quite long.
Also, each of the "yielded blocks of JSON" appear to be mostly static, with a small number of dynamic values.
More of a question than a comment, but have you considered refactoring these "JSON generators" into a set of methods, and then orchestrating them (calling them and yielding the results) rather than doing it all inline?
| }, | ||
| separators=(",", ":"), | ||
| ) | ||
| yield "data: " + chunk_string + "\n" |
There was a problem hiding this comment.
I think an f-string would be neater here.
| yield "data: " + chunk_string + "\n" | |
| yield f"data: {chunk_string}\n" |
There was a problem hiding this comment.
other code (including pre-existing) uses the same pattern, so feel free to globally ignore, or accept
| "violence": {"filtered": False, "severity": "safe"}, | ||
| }, | ||
| }, | ||
| "delta": {"content": space + word}, |
There was a problem hiding this comment.
I was just wondering if you could simply add the space to the end (i.e. word + space) and avoid having to set it each time around the loop with space = " ". I guess the real API prepends spaces, right?
There was a problem hiding this comment.
Yeah the actual service prepends spaces