Skip to content

Enable valid OpenAI response_format specification#1069

Open
liamcripwell wants to merge 1 commit intoargilla-io:developfrom
liamcripwell:openai_response_format
Open

Enable valid OpenAI response_format specification#1069
liamcripwell wants to merge 1 commit intoargilla-io:developfrom
liamcripwell:openai_response_format

Conversation

@liamcripwell
Copy link
Copy Markdown

When specifying the response_format within the current version of the OpenAI API it expects this via an object containing a "type" attribute, e.g. {"type": "<type>"}. However, distilabel is enforcing a string representation for this, which leads to either an error or silent failure.

E.g. when using a TextGeneration task under the existing codebase:

text_gen = TextGeneration(
    llm=OpenAILLM(
        model="gpt-4o",
        generation_kwargs={
            "response_format": "json"
        },
    )
)
text_gen.load()

output = next(
    text_gen.process(
        [{"instruction": "Convert this info to a JSON: John Smith is 30 years old."}]
    )
)

The OpenAI API will fail and yield BadRequestError: Error code: 400 - {'error': {'message': "Invalid type for 'response_format': expected an object, but got a string instead.", 'type': 'invalid_request_error', 'param': 'response_format', 'code': 'invalid_type'}}.

The same happens when directly calling generation from the LLM:

llm = OpenAILLM(
    model="gpt-4o",
)

llm.load()

output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

Presumably the same happens for requests to the batch api, which ultimately leads to AssertionError: No output file ID was found in the batch..

load = llm = OpenAILLM(
    model="gpt-4o",
    use_offline_batch_generation=True,
    offline_batch_generation_block_until_done=2,  # poll for results every 5 seconds
)

llm.load()
output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

This pr simply wraps the string representation of the specified response_format inside the object expected by OpenAI.
I have also added the same value checking that is done in agenerate() to offline_batch_generate().

@plaguss
Copy link
Copy Markdown
Contributor

plaguss commented Nov 25, 2024

Hi @liamcripwell thanks for the PR! This bug was found and is already fixed in develop: updated agenerate method. The next release will include this fixed

@liamcripwell
Copy link
Copy Markdown
Author

Hi @plaguss, great to hear it's already been fixed. Sorry, I didn't notice this change in develop.

However, I still think the doctstring for agenerate should be further updated because it still says that response_format must be either "text" or "json". This is no longer true as the method now only accepts a dictionary and will fail the pydantic validation if a string is provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants