Skip to content

ollieglass/openai-so-batch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OpenAI Batch Processing Library

Python library for creating and managing OpenAI Structured Outputs batch API calls.

Makes it easy to extract structured data from large datasets using OpenAI's batch API.

Key Features

  • 🎯 Structured Outputs Only: Built specifically for OpenAI's Structured Outputs API in batch mode
  • πŸ”§ Schema Fix: Automatically handles the additionalProperties: false requirement - a common gotcha when working with Structured Outputs
  • πŸš€ Simple batch creation and management
  • πŸ’° Built-in cost tracking and estimation
  • πŸ“Š Progress monitoring and status checking
  • πŸ”„ Automatic file handling (input, output, errors)
  • πŸ›‘οΈ Input validation and error handling
  • πŸ“ Pydantic model support for structured outputs

What is Structured Outputs?

Structured Outputs is an OpenAI API feature that allows you to extract structured data from text using JSON Schema. This library makes it easy to process large amounts of text in batch mode while automatically handling the schema requirements.

Installation

From PyPI (recommended)

pip install openai-so-batch

From source

git clone https://github.com/ollieglass/openai-so-batch.git
cd openai-so-batch
pip install -e .

Quick Start

Basic Structured Outputs Usage

from pydantic import BaseModel
from openai_so_batch import Batch, Costs

# Define your response model (this will be converted to JSON Schema)
class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

# Create a batch
batch = Batch(
    input_file="batch-input.jl",
    output_file="batch-output.jl",
    error_file="batch-errors.jl",
    job_name="calendar-extract",
)

# Add tasks to the batch
examples = [
    "Alice and Bob are going to a science fair on Friday.",
    "Jane booked a meeting with Max and Omar next Tuesday at 2 pm.",
]

for i, sentence in enumerate(examples, 1):
    batch.add_task(
        id=i,
        model="gpt-4o-mini",
        system_prompt="Extract the event information.",
        user_prompt=sentence,
        response_model=CalendarEvent  # The library automatically handles schema conversion
    )

# Upload the batch
batch.upload()
print(f"Batch ID: {batch.batch_id}")

# Check status and download results
status = batch.get_status()
print(f"Status: {status}")

if status == "completed":
    batch.download()

Cost Tracking

from openai_so_batch import Costs

# Calculate costs for a model
costs = Costs(model="gpt-4o-mini")

# Estimate input costs
input_cost = costs.input_cost("batch-input.jl")
print(f"Input cost: ${input_cost:.4f}")

# Calculate actual output costs
output_cost = costs.output_cost("batch-output.jl")
print(f"Output cost: ${output_cost:.4f}")

Retrieving Existing Batches

# Retrieve an existing batch by ID
batch = Batch(
    input_file=None,
    output_file="batch-output.jl",
    error_file="batch-errors.jl",
    job_name="calendar-extract",
    batch_id="batch_6890b93c276c819091452db39758b32a"
)

status = batch.get_status()
print(f"Status: {status}")

if status == "completed":
    batch.download()

API Reference

Batch Class

The main class for managing Structured Outputs batch operations.

Constructor

Batch(
    input_file: str,
    output_file: str,
    error_file: str,
    job_name: str,
    batch_id: Optional[str] = None
)

Parameters:

  • input_file: Path to the input JSONL file
  • output_file: Path where output will be saved
  • error_file: Path where errors will be saved
  • job_name: Name identifier for the batch job
  • batch_id: Optional batch ID for retrieving existing batches

Methods

  • add_task(id, model, system_prompt, user_prompt, response_model): Add a Structured Outputs task to the batch. The response_model should be a Pydantic model that will be converted to JSON Schema with additionalProperties: false automatically applied.
  • upload(): Upload the batch to OpenAI
  • get_status(): Get the current status of the batch
  • download(): Download results and errors

Costs Class

Utility class for cost tracking and estimation.

Constructor

Costs(model: str)

Parameters:

  • model: OpenAI model name (e.g., "gpt-4o-mini", "gpt-4o", "o3")

Methods

  • input_cost(filename): Calculate input token costs
  • output_cost(filename): Calculate output token costs
  • input_tokens(filename): Count input tokens
  • output_tokens(filename): Count output tokens

Supported Models

The library supports the following OpenAI models with cost tracking:

  • gpt-4o-mini
  • gpt-4o
  • o3
  • o3-mini
  • o4-mini

Environment Setup

Make sure you have your OpenAI API key set in your environment:

export OPENAI_API_KEY="your-api-key-here"

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any issues or have questions, please open an issue on GitHub.

Changelog

0.1.1

  • Fixed description on Pypi

0.1.0

  • Initial release
  • Structured Outputs batch processing functionality
  • Automatic handling of additionalProperties: false schema requirement
  • Cost tracking and estimation
  • Pydantic model support for structured outputs

About

Python library for creating and managing OpenAI Structured Outputs batch API calls

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages