Enable valid OpenAI `response_format` specification by liamcripwell · Pull Request #1069 · argilla-io/distilabel

liamcripwell · 2024-11-22T13:56:47Z

When specifying the response_format within the current version of the OpenAI API it expects this via an object containing a "type" attribute, e.g. {"type": "<type>"}. However, distilabel is enforcing a string representation for this, which leads to either an error or silent failure.

E.g. when using a TextGeneration task under the existing codebase:

text_gen = TextGeneration(
    llm=OpenAILLM(
        model="gpt-4o",
        generation_kwargs={
            "response_format": "json"
        },
    )
)
text_gen.load()

output = next(
    text_gen.process(
        [{"instruction": "Convert this info to a JSON: John Smith is 30 years old."}]
    )
)

The OpenAI API will fail and yield BadRequestError: Error code: 400 - {'error': {'message': "Invalid type for 'response_format': expected an object, but got a string instead.", 'type': 'invalid_request_error', 'param': 'response_format', 'code': 'invalid_type'}}.

The same happens when directly calling generation from the LLM:

llm = OpenAILLM(
    model="gpt-4o",
)

llm.load()

output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

Presumably the same happens for requests to the batch api, which ultimately leads to AssertionError: No output file ID was found in the batch..

load = llm = OpenAILLM(
    model="gpt-4o",
    use_offline_batch_generation=True,
    offline_batch_generation_block_until_done=2,  # poll for results every 5 seconds
)

llm.load()
output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

This pr simply wraps the string representation of the specified response_format inside the object expected by OpenAI.
I have also added the same value checking that is done in agenerate() to offline_batch_generate().

plaguss · 2024-11-25T08:08:54Z

Hi @liamcripwell thanks for the PR! This bug was found and is already fixed in develop: updated agenerate method. The next release will include this fixed

liamcripwell · 2024-11-25T09:46:42Z

Hi @plaguss, great to hear it's already been fixed. Sorry, I didn't notice this change in develop.

However, I still think the doctstring for agenerate should be further updated because it still says that response_format must be either "text" or "json". This is no longer true as the method now only accepts a dictionary and will fail the pydantic validation if a string is provided.

wrap specified str in obj

04c1065

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable valid OpenAI `response_format` specification#1069

Enable valid OpenAI `response_format` specification#1069
liamcripwell wants to merge 1 commit intoargilla-io:developfrom
liamcripwell:openai_response_format

liamcripwell commented Nov 22, 2024

Uh oh!

plaguss commented Nov 25, 2024

Uh oh!

liamcripwell commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liamcripwell commented Nov 22, 2024

Uh oh!

plaguss commented Nov 25, 2024

Uh oh!

liamcripwell commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants