Align streaming response with OpenAI API and remove double latency by stuartleeks · Pull Request #45 · stuartleeks/aoai-simulated-api

stuartleeks · 2024-07-04T16:04:31Z

Fix differences between streaming response and OpenAI API content/format
Avoid adding latency on response for streamin as each chunk has added latency

- Fix differences between streaming response and OpenAI API content/format - Avoid adding latency on response for streamin as each chunk has added latency

lucashuet93

My apologies for missing this one!

martinpeck · 2024-08-06T14:44:40Z


 The `aoai-simulator.latency.full` metric measures the full latency of the simulator. This is the time taken to process a request _including_ any added latency.

+NOTE: Added latency for streaming requests is not included in this metric.


I had to read this a couple of times to understand what was being said. That might be just me.

Is it saying...?

"For streaming requests, the added latency is not included in this metric" and if so, what does this metric show for streaming requests?

That this metric is meaningless for streaming requests, and should be ignored?

That this metric is not reported for streaming requests

ok, I'll re-word. It's option 2

martinpeck · 2024-08-06T14:46:21Z

+        response_id = "chatcmpl-" + nanoid.non_secure_generate(size=29)
+        words = generated_content.split(" ")
+        # determine the per-token latency to use in seconds from config
+        per_token_latency_s = context.config.latency.open_ai_chat_completions.get_value() / 1000


dumb question: what's the 1000 doing here?

It's converting milliseconds to seconds

martinpeck · 2024-08-06T14:57:35Z

        async def send_words():
+
+            # Send preamble chunks
+            chunk_string = json.dumps(


What's the thinking behind this being an inline function?
At this point it's making create_chat_completion_response quite long.
Also, each of the "yielded blocks of JSON" appear to be mostly static, with a small number of dynamic values.

More of a question than a comment, but have you considered refactoring these "JSON generators" into a set of methods, and then orchestrating them (calling them and yielding the results) rather than doing it all inline?

martinpeck · 2024-08-06T15:02:54Z

+                },
+                separators=(",", ":"),
+            )
+            yield "data: " + chunk_string + "\n"


I think an f-string would be neater here.

Suggested change

yield "data: " + chunk_string + "\n"

yield f"data: {chunk_string}\n"

other code (including pre-existing) uses the same pattern, so feel free to globally ignore, or accept

martinpeck · 2024-08-06T15:17:54Z

+                                    "violence": {"filtered": False, "severity": "safe"},
                                },
-                            },
+                                "delta": {"content": space + word},


I was just wondering if you could simply add the space to the end (i.e. word + space) and avoid having to set it each time around the loop with space = " ". I guess the real API prepends spaces, right?

Yeah the actual service prepends spaces

stuartleeks requested a review from lucashuet93 July 4, 2024 16:04

Align streaming response with OpenAI API and remove double latency

75c1c49

- Fix differences between streaming response and OpenAI API content/format - Avoid adding latency on response for streamin as each chunk has added latency

stuartleeks force-pushed the stuart/streaming branch from 4333e09 to 75c1c49 Compare July 4, 2024 16:05

lucashuet93 approved these changes Jul 30, 2024

View reviewed changes

martinpeck reviewed Aug 6, 2024

View reviewed changes

stuartleeks mentioned this pull request Aug 22, 2024

Remove differences in simulated streaming responses microsoft/aoai-api-simulator#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align streaming response with OpenAI API and remove double latency#45

Align streaming response with OpenAI API and remove double latency#45
stuartleeks wants to merge 1 commit intomainfrom
stuart/streaming

stuartleeks commented Jul 4, 2024

Uh oh!

lucashuet93 left a comment

Uh oh!

martinpeck Aug 6, 2024

Uh oh!

stuartleeks Aug 14, 2024

Uh oh!

martinpeck Aug 6, 2024

Uh oh!

stuartleeks Aug 14, 2024 •

edited

Loading

Uh oh!

martinpeck Aug 6, 2024

Uh oh!

martinpeck Aug 6, 2024

Uh oh!

martinpeck Aug 6, 2024

Uh oh!

martinpeck Aug 6, 2024

Uh oh!

stuartleeks Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		The `aoai-simulator.latency.full` metric measures the full latency of the simulator. This is the time taken to process a request _including_ any added latency.

		NOTE: Added latency for streaming requests is not included in this metric.

	yield "data: " + chunk_string + "\n"
	yield f"data: {chunk_string}\n"

Conversation

stuartleeks commented Jul 4, 2024

Uh oh!

lucashuet93 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartleeks Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stuartleeks Aug 14, 2024 •

edited

Loading