Skip to content

Commit aa7a222

Browse files
mdrxyccurme
andauthored
oss(py): update openai chat page (#1450)
Closes #455 Closes #479 Closes #532 --------- Co-authored-by: ccurme <chester.curme@gmail.com>
1 parent 5941c83 commit aa7a222

File tree

1 file changed

+107
-0
lines changed

1 file changed

+107
-0
lines changed

src/oss/python/integrations/chat/openai.mdx

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,14 @@ llm = ChatOpenAI(
8989

9090
See the @[`ChatOpenAI`] API Reference for the full set of available model parameters.
9191

92+
<Note>
93+
**Token parameter deprecation**
94+
95+
OpenAI deprecated `max_tokens` in favor of `max_completion_tokens` in September 2024. While `max_tokens` is still supported for backwards compatibility, it's automatically converted to `max_completion_tokens` internally.
96+
</Note>
97+
98+
---
99+
92100
## Invocation
93101

94102
```python
@@ -115,6 +123,8 @@ print(ai_msg.text)
115123
J'adore la programmation.
116124
```
117125

126+
---
127+
118128
## Streaming usage metadata
119129

120130
OpenAI's Chat Completions API does not stream token usage statistics by default (see API reference [here](https://platform.openai.com/docs/api-reference/completions/create#completions-create-stream_options)).
@@ -127,6 +137,8 @@ from langchain_openai import ChatOpenAI
127137
llm = ChatOpenAI(model="gpt-4.1-mini", stream_usage=True) # [!code highlight]
128138
```
129139

140+
---
141+
130142
## Using with Azure OpenAI
131143

132144
<Info>
@@ -222,6 +234,8 @@ When using an async callable for the API key, you must use async methods (`ainvo
222234

223235
</Accordion>
224236

237+
---
238+
225239
## Tool calling
226240

227241
OpenAI has a [tool calling](https://platform.openai.com/docs/guides/function-calling) (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.
@@ -463,6 +477,8 @@ Name: do_math
463477

464478
</Accordion>
465479

480+
---
481+
466482
## Responses API
467483

468484
<Info>
@@ -1066,6 +1082,16 @@ for block in response.content_blocks:
10661082
The user is asking about 3 raised to the power of 3. That's a pretty simple calculation! I know that 3^3 equals 27, so I can say, "3 to the power of 3 equals 27." I might also include a quick explanation that it's 3 multiplied by itself three times: 3 × 3 × 3 = 27. So, the answer is definitely 27.
10671083
```
10681084

1085+
<Tip>
1086+
**Troubleshooting: Empty responses from reasoning models**
1087+
1088+
If you're getting empty responses from reasoning models like `gpt-5-nano`, this is likely due to restrictive token limits. The model uses tokens for internal reasoning and may not have any left for the final output.
1089+
1090+
Ensure `max_tokens` is set to `None` or increase the token limit to allow sufficient tokens for both reasoning and output generation.
1091+
</Tip>
1092+
1093+
---
1094+
10691095
## Fine-tuning
10701096

10711097
You can call fine-tuned OpenAI models by passing in your corresponding `modelName` parameter.
@@ -1084,6 +1110,8 @@ fine_tuned_model.invoke(messages)
10841110
AIMessage(content="J'adore la programmation.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 31, 'total_tokens': 39}, 'model_name': 'ft:gpt-3.5-turbo-0613:langchain::7qTVM5AR', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-0f39b30e-c56e-4f3b-af99-5c948c984146-0', usage_metadata={'input_tokens': 31, 'output_tokens': 8, 'total_tokens': 39})
10851111
```
10861112

1113+
---
1114+
10871115
## Multimodal Inputs (images, PDFs, audio)
10881116

10891117
OpenAI has models that support multimodal inputs. You can pass in images, PDFs, or audio to these models. For more information on how to do this in LangChain, head to the [multimodal inputs](/oss/langchain/messages#multimodal) docs.
@@ -1196,6 +1224,8 @@ content_block = {
11961224
```
11971225
</Accordion>
11981226

1227+
---
1228+
11991229
## Predicted output
12001230

12011231
<Info>
@@ -1268,6 +1298,7 @@ public class User
12681298
```
12691299

12701300
Note that currently predictions are billed as additional tokens and may increase your usage and costs in exchange for this reduced latency.
1301+
---
12711302

12721303
## Audio Generation (Preview)
12731304

@@ -1326,6 +1357,82 @@ history = [
13261357
second_output_message = llm.invoke(history)
13271358
```
13281359

1360+
---
1361+
1362+
## Prompt caching
1363+
1364+
OpenAI's [prompt caching](https://platform.openai.com/docs/guides/prompt-caching) feature automatically caches prompts longer than 1024 tokens to reduce costs and improve response times. This feature is enabled for all recent models (`gpt-4o` and newer).
1365+
1366+
### Manual caching
1367+
1368+
You can use the `prompt_cache_key` parameter to influence OpenAI's caching and optimize cache hit rates:
1369+
1370+
```python
1371+
from langchain_openai import ChatOpenAI
1372+
1373+
llm = ChatOpenAI(model="gpt-4o")
1374+
1375+
# Use a cache key for repeated prompts
1376+
messages = [
1377+
{"role": "system", "content": "You are a helpful assistant that translates English to French."},
1378+
{"role": "user", "content": "I love programming."},
1379+
]
1380+
1381+
response = llm.invoke(
1382+
messages,
1383+
prompt_cache_key="translation-assistant-v1"
1384+
)
1385+
1386+
# Check cache usage
1387+
cache_read_tokens = response.usage_metadata.input_token_details.cache_read
1388+
print(f"Cached tokens used: {cache_read_tokens}")
1389+
```
1390+
1391+
<Warning>
1392+
Cache hits require the prompt prefix to match exactly
1393+
</Warning>
1394+
1395+
### Cache key strategies
1396+
1397+
You can use different cache key strategies based on your application's needs:
1398+
1399+
```python
1400+
# Static cache keys for consistent prompt templates
1401+
customer_response = llm.invoke(
1402+
messages,
1403+
prompt_cache_key="customer-support-v1"
1404+
)
1405+
1406+
support_response = llm.invoke(
1407+
messages,
1408+
prompt_cache_key="internal-support-v1"
1409+
)
1410+
1411+
# Dynamic cache keys based on context
1412+
user_type = "premium"
1413+
cache_key = f"assistant-{user_type}-v1"
1414+
response = llm.invoke(messages, prompt_cache_key=cache_key)
1415+
```
1416+
1417+
### Model-level caching
1418+
1419+
You can also set a default cache key at the model level using `model_kwargs`:
1420+
1421+
```python
1422+
llm = ChatOpenAI(
1423+
model="gpt-4o-mini",
1424+
model_kwargs={"prompt_cache_key": "default-cache-v1"}
1425+
)
1426+
1427+
# Uses default cache key
1428+
response1 = llm.invoke(messages)
1429+
1430+
# Override with specific cache key
1431+
response2 = llm.invoke(messages, prompt_cache_key="override-cache-v1")
1432+
```
1433+
1434+
---
1435+
13291436
## Flex processing
13301437

13311438
OpenAI offers a variety of [service tiers](https://platform.openai.com/docs/guides/flex-processing). The "flex" tier offers cheaper pricing for requests, with the trade-off that responses may take longer and resources might not always be available. This approach is best suited for non-critical tasks, including model testing, data enhancement, or jobs that can be run asynchronously.

0 commit comments

Comments
 (0)