Skip to content

Redo: https://github.com/hack-r/realtime-hud/pull/3 #4

@hack-r

Description

@hack-r

#3

That was a bad PR. The client disconnects immediately. Did you follow the docs I shared with you? Ensure that you are using realtime-1.5 and that you use the correct syntax.

Make sure you surface or log any OpenAI errors. Understand that although this is optimized for speech-to-speech we are more interested in its high speed and updates and will mostly use it as image-totext (we show it screenshots and it produces notes).

Realtime API

Build low-latency, multimodal LLM applications with the Realtime API.

The OpenAI Realtime API enables low-latency communication with models that natively support speech-to-speech interactions as well as multimodal inputs (audio, images, and text) and outputs (audio and text). These APIs can also be used for realtime audio transcription.
Voice agents

One of the most common use cases for the Realtime API is building voice agents for speech-to-speech model interactions in the browser. Our recommended starting point for these types of applications is the Agents SDK for TypeScript, which uses a WebRTC connection to the Realtime model in the browser, and WebSocket when used on the server.

import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";

const agent = new RealtimeAgent({
name: "Assistant",
instructions: "You are a helpful assistant.",
});

const session = new RealtimeSession(agent);

// Automatically connects your microphone and audio output
await session.connect({
apiKey: "",
});

Voice Agent Quickstart

Follow the voice agent quickstart to build Realtime agents in the browser.

To use the Realtime API directly outside the context of voice agents, check out the other connection options below.
Connection methods

While building voice agents with the Agents SDK is the fastest path to one specific type of application, the Realtime API provides an entire suite of flexible tools for a variety of use cases.

There are three primary supported interfaces for the Realtime API:
WebRTC connection

Ideal for browser and client-side interactions with a Realtime model.
WebSocket connection

Ideal for middle tier server-side applications with consistent low-latency network connections.
SIP connection

Ideal for VoIP telephony connections.

Depending on how you’d like to connect to a Realtime model, check out one of the connection guides above to get started. You’ll learn how to initialize a Realtime session, and how to interact with a Realtime model using client and server events.
API Usage

Once connected to a realtime model using one of the methods above, learn how to interact with the model in these usage guides.

[Prompting guide](https://developers.openai.com/api/docs/guides/realtime-models-prompting): learn tips and best practices for prompting and steering Realtime models.
[Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations): Learn about the Realtime session lifecycle and the key events that happen during a conversation.
[Webhooks and server-side controls](https://developers.openai.com/api/docs/guides/realtime-server-controls): Learn how you can control a Realtime session on the server to call tools and implement guardrails.
[Managing costs](https://developers.openai.com/api/docs/guides/realtime-costs): Learn how to monitor and optimize your usage of the Realtime API.
[Realtime audio transcription](https://developers.openai.com/api/docs/guides/realtime-transcription): Transcribe audio streams in real time over a WebSocket connection.

Beta to GA migration

There are a few key differences between the interfaces in the Realtime beta API and the recently released GA API. Expand the topics below for more information about migrating from the beta interface to GA.
Beta header

For REST API requests, WebSocket connections, and other interfaces with the Realtime API, beta users had to include the following header with each request:

OpenAI-Beta: realtime=v1

This header should be removed for requests to the GA interface. To retain the behavior of the beta API, you should continue to include this header.
Generating ephemeral API keys

In the beta interface, there were multiple endpoints for generating ephemeral keys for either Realtime sessions or transcription sessions. In the GA interface, there is only one REST API endpoint used to generate keys - POST /v1/realtime/client_secrets.

To create a session and receive a client secret you can use to initialize a WebRTC or WebSocket connection on a client, you can request one like this using the appropriate session configuration:

const sessionConfig = JSON.stringify({
session: {
type: "realtime",
model: "gpt-realtime",
audio: {
output: { voice: "marin" },
},
},
});

const response = await fetch(
"https://api.openai.com/v1/realtime/client_secrets",
{
method: "POST",
headers: {
Authorization: Bearer ${apiKey},
"Content-Type": "application/json",
},
body: sessionConfig,
}
);

const data = await response.json();
console.log(data.value); // e.g. ek_68af296e8e408191a1120ab6383263c2

These tokens can safely be used in client environments like browsers and mobile applications.
New URL for WebRTC SDP data

When initializing a WebRTC session in the browser, the URL for obtaining remote session information via SDP is now /v1/realtime/calls:

const baseUrl = "https://api.openai.com/v1/realtime/calls";
const model = "gpt-realtime";
const sdpResponse = await fetch(baseUrl, {
method: "POST",
body: offer.sdp,
headers: {
Authorization: Bearer YOUR_EPHEMERAL_KEY_HERE,
"Content-Type": "application/sdp",
},
});

const sdp = await sdpResponse.text();
const answer = { type: "answer", sdp };
await pc.setRemoteDescription(answer);

New event names and shapes
New conversation item events

For response.output_item, the API has always had both .added and .done events, but for conversation level items the API previously only had .created, which by convention is emitted at the start when the item added.

We have added a .added and .done event to allow better ergonomics for developers when receiving events that need some loading time (such as MCP tool listing or input audio transcriptions if these were to be modeled as items in the future).

Current event shape for conversation items added:

{
"event_id": "event_1920",
"type": "conversation.item.created",
"previous_item_id": "msg_002",
"item": Item
}

New events to replace the above:

{
"event_id": "event_1920",
"type": "conversation.item.added",
"previous_item_id": "msg_002",
"item": Item
}

{
"event_id": "event_1920",
"type": "conversation.item.done",
"previous_item_id": "msg_002",
"item": Item
}

Input and output item changes
All Items

Realtime API sets an object=realtime.item param on all items in the GA interface.
Function Call Output

status : Realtime now accepts a no-op status field for the function call output item param. This aligns with the Responses API implementation.
Message

Assistant Message Content

The type properties of output assistant messages now align with the Responses API:

type=text → type=output_text (no change to text field name)
type=audio → type=output_audio (no change to audio field name)

Managing costs

Understanding and managing token costs with the Realtime API.

This document describes how Realtime API billing works and offer strategies for optimizing costs. Costs are accrued as input and output tokens of different modalities: text, audio, and image. Token costs vary per model, with prices listed on the model pages (e.g. for gpt-realtime and gpt-realtime-mini).

Conversational Realtime API sessions are a series of turns, where the user adds input that triggers a Response to produce the model output. The server maintains a Conversation, which is a list of Items that form the input for the next turn. When a Response is returned the output is automatically added to the Conversation.
Per-Response costs

Realtime API costs are accrued when a Response is created, and is charged based on the numbers of input and output tokens (except for input transcription costs, see below). There is no cost currently for network bandwidth or connections. A Response can be created manually or automatically if voice activity detection (VAD) is turned on. VAD will effectively filter out empty input audio, so empty audio does not count as input tokens unless the client manually adds it as conversation input.

The entire conversation is sent to the model for each Response. The output from a turn will be added as Items to the server Conversation and become the input to subsequent turns, thus turns later in the session will be more expensive.

Text token costs can be estimated using our tokenization tools. Audio tokens in user messages are 1 token per 100 ms of audio, while audio tokens in assistant messages are 1 token per 50ms of audio. Note that token counts include special tokens aside from the content of a message which will surface as small variations in these counts, for example a user message with 10 text tokens of content may count as 12 tokens.
Example

Here’s a simple example to illustrate token costs over a multi-turn Realtime API session.

For the first turn in the conversation we’ve added 100 tokens of instructions, a user message of 20 audio tokens (for example added by VAD based on the user speaking), for a total of 120 input tokens. Creating a Response generates an assistant output message (20 audio, 10 text tokens).

Then we create a second turn with another user audio message. What will the tokens for turn 2 look like? The Conversation at this point includes the initial instructions, first user message, the output assistant message from the first turn, plus the second user message (25 audio tokens). This turn will have 110 text and 64 audio tokens for input, plus the output tokens of another assistant output message.

tokens on successive conversation turns

The messages from the first turn are likely to be cached for turn 2, which reduces the input cost. See below for more information on caching.

The tokens used for a Response can be read from the response.done event, which looks like the following.

{
"type": "response.done",
"response": {
...
"usage": {
"total_tokens": 253,
"input_tokens": 132,
"output_tokens": 121,
"input_token_details": {
"text_tokens": 119,
"audio_tokens": 13,
"image_tokens": 0,
"cached_tokens": 64,
"cached_tokens_details": {
"text_tokens": 64,
"audio_tokens": 0,
"image_tokens": 0
}
},
"output_token_details": {
"text_tokens": 30,
"audio_tokens": 91
}
}
}
}

Input transcription costs

Aside from conversational Responses, the Realtime API bills for input transcriptions, if enabled. Input transcription uses a different model than the speech2speech model, such as whisper-1 or gpt-4o-transcribe, and thus are billed from a different rate card. Transcription is performed when audio is written to the input audio buffer and then committed, either manually or by VAD.

Input transcription token counts can be read from the conversation.item.input_audio_transcription.completed event, as in the following example.

{
"type": "conversation.item.input_audio_transcription.completed",
...
"transcript": "Hi, can you hear me?",
"usage": {
"type": "tokens",
"total_tokens": 26,
"input_tokens": 17,
"input_token_details": {
"text_tokens": 0,
"audio_tokens": 17
},
"output_tokens": 9
}
}

Caching

Realtime API supports prompt caching, which is applied automatically and can dramatically reduce the costs of input tokens during multi-turn sessions. Caching applies when the input tokens of a Response match tokens from a previous Response, though this is best-effort and not guaranteed.

The best strategy for maximizing cache rate is keep a session’s history static. Removing or changing content in the conversation will “bust” the cache up to the point of the change — the input no longer matches as much as before. Note that instructions and tool definitions are at the beginning of a conversation, thus changing these mid-session will reduce the cache rate for subsequent turns.
Truncation

When the number of tokens in a conversation exceeds the model’s input token limit the conversation be truncated, meaning messages (starting from the oldest) will be dropped from the Response input. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.

Clients can set a smaller token window than the model’s maximum, which is a good way to control token usage and cost. This is controlled with the token_limits.post_instructions configuration (if you configure truncation with a retention_ratio type as shown below). As the name indicates, this controls the maximum number of input tokens for a Response, except for the instruction tokens. Setting post_instructions to 1,000 means that items over the 1,000 input token limit will not be sent to the model for a Response.

Truncation busts the cache near the beginning of the conversation, and if truncation occurs on every turn then cache rate will be very low. To mitigate this issue clients can configure truncation to drop more messages than necessary, which will extend the headroom before another truncation is needed. This can be controlled with the session.truncation.retention_ratio setting. The server defaults to a value of 1.0 , meaning truncation will remove only the items necessary. A value of 0.8 means a truncation would retain 80% of the maximum, dropping an additional 20%.

If you’re attempting to reduce Realtime API cost per session (for a given model), we recommend reducing limiting the number of tokens and setting a retention_ratio less than 1, as in the following example. Remember that there may be a tradeoff here in terms of lower cost but lower model memory for a given turn.

{
"event": "session.update",
"session": {
"truncation": {
"type": "retention_ratio",
"retention_ratio": 0.8,
"token_limits": {
"post_instructions": 8000
}
}
}
}

Truncation can also be completely disabled, as shown below. When disabled an error will be returned if the Conversation is too long to create a Response. This may be useful if you intend to manage the Conversation size manually.

{
"event": "session.update",
"session": {
"truncation": "disabled"
}
}

Other optimization strategies
Using a mini model

The Realtime speech2speech models come in a “normal” size and a mini size, which is significantly cheaper. The tradeoff here tends to be intelligence related to instruction following and function calling, which will not be as effective in the mini model. We recommend first testing applications with the larger model, refining your application and prompt, then attempting to optimize using the mini model.
Editing the Conversation

While truncation will occur automatically on the server, another cost management strategy is to manually edit the Conversation. A principle of the API is to allow full client control of the server-side Conversation, allowing the client to add and remove items at will.

{
"type": "conversation.item.delete",
"item_id": "item_CCXLecNJVIVR2HUy3ABLj"
}

Clearing out old messages is a good way to reduce input token sizes and cost. This might remove important content, but a common strategy is to replace these old messages with a summary. Items can be deleted from the Conversation with a conversation.item.delete message as above, and can be added with a conversation.item.create message.
Estimating costs

Given the complexity in Realtime API token usage it can be difficult to estimate your costs ahead of time. A good approach is to use the Realtime Playground with your intended prompts and functions, and measure the token usage over a sample session. The token usage for a session can be found under the Logs tab in the Realtime Playground next to the session id.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions