Skip to content
jembie edited this page Oct 17, 2025 · 6 revisions

This work was conducted as part of the Kather Lab summer project program. Throughout the summer, I have designed and implemented a small pipeline focusing on experimental evaluation of language models in medical contexts by comparing textual and numerical scoring.

Table of Contents

Setup

We recommend using uv for this project, nonetheless, we also provide a requirements.txt file.

Installing/Updating uv

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Update uv
uv self update

However, pip can also be used (though we will not provide further setup details, though if you installed uv via pip, you can simply type e.g. python3 -m uv sync and achieve similar results):

pip install uv

Installation

  1. First, clone the repository
git clone git@github.com:jembie/judgely.git
cd judgely/
  1. Then install the required dependencies
# Recommended through uv
uv sync

# Alternatively via pip
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Before running main.py, make sure to create and set the required configurations:

  1. Create a .env file in the project root containing the following variables:
BASE_URL = "<url>"
API_KEY = "<key>"

# Example:
# BASE_URL = "http://192.168.1.1/v1/"
# API_KEY = "sk-MY_API_KEY"
  1. Create the data/csv/ folder where all the data sets are going to be located in. Each .csv file should have the following format as columns:
qtype,Question,Answer

# Example
qtype,Question,Answer
1,"What is GitHub?", "GitHub is [...]"
1,"Who owns GitHub?", "It is owned by [...]"
  1. Withing main.py, modify the necessary string values in the script as indicated by comments (e.g., model names or amount of questions to sample).
def run_queries(
    iterations: int = 5,
    questions: int = 10,
    judge_model: str = "Llama-4-Maverick-17B-128E-Instruct-FP8",
    jury_model: str = "Qwen3-235B-A22B-Instruct-2507-FP8",
    **llm_params,
):

Once configured, you can execute the script with:

uv run main.py

Clone this wiki locally