Skip to content
This repository was archived by the owner on Aug 30, 2024. It is now read-only.

Align streaming response with OpenAI API and remove double latency#45

Open
stuartleeks wants to merge 1 commit intomainfrom
stuart/streaming
Open

Align streaming response with OpenAI API and remove double latency#45
stuartleeks wants to merge 1 commit intomainfrom
stuart/streaming

Conversation

@stuartleeks
Copy link
Copy Markdown
Owner

  • Fix differences between streaming response and OpenAI API content/format
  • Avoid adding latency on response for streamin as each chunk has added latency

@stuartleeks stuartleeks requested a review from lucashuet93 July 4, 2024 16:04
- Fix differences between streaming response and OpenAI API content/format
- Avoid adding latency on response for streamin as each chunk has added latency
Copy link
Copy Markdown
Collaborator

@lucashuet93 lucashuet93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for missing this one!

Comment thread docs/metrics.md

The `aoai-simulator.latency.full` metric measures the full latency of the simulator. This is the time taken to process a request _including_ any added latency.

NOTE: Added latency for streaming requests is not included in this metric.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read this a couple of times to understand what was being said. That might be just me.

Is it saying...?

  1. "For streaming requests, the added latency is not included in this metric" and if so, what does this metric show for streaming requests?
  2. That this metric is meaningless for streaming requests, and should be ignored?
  3. That this metric is not reported for streaming requests

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll re-word. It's option 2

response_id = "chatcmpl-" + nanoid.non_secure_generate(size=29)
words = generated_content.split(" ")
# determine the per-token latency to use in seconds from config
per_token_latency_s = context.config.latency.open_ai_chat_completions.get_value() / 1000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question: what's the 1000 doing here?

Copy link
Copy Markdown
Owner Author

@stuartleeks stuartleeks Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's converting milliseconds to seconds

async def send_words():

# Send preamble chunks
chunk_string = json.dumps(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thinking behind this being an inline function?
At this point it's making create_chat_completion_response quite long.
Also, each of the "yielded blocks of JSON" appear to be mostly static, with a small number of dynamic values.

More of a question than a comment, but have you considered refactoring these "JSON generators" into a set of methods, and then orchestrating them (calling them and yielding the results) rather than doing it all inline?

},
separators=(",", ":"),
)
yield "data: " + chunk_string + "\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an f-string would be neater here.

Suggested change
yield "data: " + chunk_string + "\n"
yield f"data: {chunk_string}\n"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other code (including pre-existing) uses the same pattern, so feel free to globally ignore, or accept

"violence": {"filtered": False, "severity": "safe"},
},
},
"delta": {"content": space + word},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering if you could simply add the space to the end (i.e. word + space) and avoid having to set it each time around the loop with space = " ". I guess the real API prepends spaces, right?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the actual service prepends spaces

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants