An automated benchmark for evaluating how effectively foundational LLMs can integrate Temporal into existing codebases.
The benchmark generates two types of scores for each model:
- Language Score: For each programming language, individual test case scores (0-2) are summed and normalized to a 0-100 scale
- Aggregate Score: The total score across all languages, also normalized to 0-100
Important: Scores are a function of the specific test set used. As the test set expands, the benchmark will be versioned accordingly. Scores across different benchmark versions are not comparable since the underlying tests have changed. However, within any given benchmark version, scores provide a reliable basis for comparing performance across different models.
This project uses a Python virtual environment to manage dependencies. Follow these steps to set up your development environment:
# Create a virtual environment named 'temporal-bench'
python -m venv temporal-bench
# Activate the virtual environment
# On macOS/Linux:
source temporal-bench/bin/activateWhen you're done working on the project, deactivate the virtual environment:
deactivateTo capture your project's dependencies in a requirements file:
# After installing packages with pip, generate requirements.txt
pip freeze > requirements.txtTo install all required packages from the requirements file:
# Make sure your virtual environment is activated first
pip install -r requirements.txtWhen adding new Python packages:
# Install the package
pip install package-name
# Update requirements.txt
pip freeze > requirements.txtFor detailed information about the benchmark strategy, architecture, and implementation, see the Product Requirements Document.
- Clone the repository
- Create and activate the Python virtual environment (see above)
- Install dependencies:
pip install -r requirements.txt - Set up your
.envfile with API keys for the LLM services - Run the benchmark:
python main.py
Results will be generated in the results/ directory.