-
Notifications
You must be signed in to change notification settings - Fork 0
Home
jembie edited this page Oct 17, 2025
·
6 revisions
This work was conducted as part of the Kather Lab summer project program. Throughout the summer, I have designed and implemented a small pipeline focusing on experimental evaluation of language models in medical contexts by comparing textual and numerical scoring.
We recommend using uv for this project, nonetheless, we also provide a requirements.txt file.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Update uv
uv self updateHowever, pip can also be used (though we will not provide further setup details, though if you installed uv via pip, you can simply type e.g. python3 -m uv sync and achieve similar results):
pip install uv- First, clone the repository
git clone git@github.com:jembie/judgely.git
cd judgely/- Then install the required dependencies
# Recommended through uv
uv sync
# Alternatively via pip
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtBefore running main.py, make sure to create and set the required configurations:
- Create a
.envfile in the project root containing the following variables:
BASE_URL = "<url>"
API_KEY = "<key>"
# Example:
# BASE_URL = "http://192.168.1.1/v1/"
# API_KEY = "sk-MY_API_KEY"- Create the
data/csv/folder where all the data sets are going to be located in. Each.csvfile should have the following format as columns:
qtype,Question,Answer
# Example
qtype,Question,Answer
1,"What is GitHub?", "GitHub is [...]"
1,"Who owns GitHub?", "It is owned by [...]"- Withing
main.py, modify the necessary string values in the script as indicated by comments (e.g., model names or amount of questions to sample).
def run_queries(
iterations: int = 5,
questions: int = 10,
judge_model: str = "Llama-4-Maverick-17B-128E-Instruct-FP8",
jury_model: str = "Qwen3-235B-A22B-Instruct-2507-FP8",
**llm_params,
):Once configured, you can execute the script with:
uv run main.py