Scripts and prompts for working with the Expert-Annotated Clinical Encounters dataset, a benchmark for evaluating AI-generated medical documentation.
The dataset contains 513 clinical encounter cases with expert-authored evaluation rubrics. Each case includes a conversational transcript, longitudinal patient chart history, point-in-time encounter context, and three evaluation rubrics (two clinician-authored, one LLM-drafted).
Dataset: PhysioNet (link to be updated upon publication)
dataset-usage/
scripts/
load_data.py # Load and query cases and rubrics from the CSV files
score_note.py # Interactively score a clinical note against a rubric
prompts/
llm_rubric_generation.md # Prompt template for generating LLM rubrics
note_scoring.md # Prompt template for automated note scoring
Download cases.csv and rubrics.csv from PhysioNet and place them in a local directory (e.g., ./data/).
Look up a case and its associated rubrics:
python scripts/load_data.py --data-dir ./data --case-id 457Output:
Case 457: patients_sleep_issues_and_dating_life_76746416-b
Provenance: real_world
Specialty: psychiatry
Encounter type: follow-up, medication-review
Acuity: moderate
Rubrics (3):
Rubric 9424 (clinician): 5 criteria, total weight 100.0
[30] Reward for capturing all key patient symptoms and concerns...
...
Rubric 9451 (clinician): 5 criteria, total weight 100.0
...
Rubric 3744 (llm): 5 criteria, total weight 100.0
...
Generate a clinical note using your system, save it as a text file, then score it interactively against a rubric:
python scripts/score_note.py --data-dir ./data --rubric-id 9424 --note my_generated_note.txtThe tool displays each criterion and prompts you to mark it as met or not met, then computes the weighted score.
The prompts/llm_rubric_generation.md file contains the exact prompt template used to generate the LLM-drafted rubrics in the dataset. You can use this with any LLM to create rubrics for new cases or to experiment with different rubric generation strategies.
The prompts/note_scoring.md file contains the prompt template used for automated scoring of notes against rubrics. This is useful for building your own scoring pipeline at scale.
| Field | Description |
|---|---|
case_id |
Unique case identifier |
name |
Descriptive slug |
provenance |
real_world or synthetic |
specialty |
Clinical specialty tag |
encounter_type |
Encounter type tag |
encounter_length |
short, medium, or long |
problem_count |
single-problem or multi-problem |
acuity |
low, moderate, or high |
demographics |
Patient demographics |
transcript |
JSON array of conversational turns |
current_medications |
Current medications (RxNorm coded) |
current_conditions |
Current conditions (ICD-10 coded) |
condition_history |
Condition history (ICD-10 coded) |
current_allergies |
Current allergies |
family_history |
Family history (ICD-10 coded) |
surgery_history |
Surgery history (SNOMED coded) |
current_goals |
Current patient goals |
staged_commands |
Pre-existing note context |
clinician_rubric_ids |
Comma-separated rubric IDs |
llm_rubric_id |
LLM rubric ID |
| Field | Description |
|---|---|
rubric_id |
Groups criteria belonging to the same rubric |
case_id |
Foreign key to cases.csv |
author_type |
clinician or llm |
criterion_index |
Zero-based index within the rubric |
criterion |
Natural-language documentation requirement |
weight |
Clinical importance weight |
To reconstruct a full rubric, group rows by rubric_id and order by criterion_index. Each case has exactly 3 rubrics (2 clinician-authored, 1 LLM-drafted).
For a given clinical note and rubric:
- Reconstruct the rubric by grouping rows by
rubric_idand ordering bycriterion_index. - Evaluate each criterion as met or not met.
- Compute the score as the sum of weights for met criteria divided by the total rubric weight, yielding a 0-100% normalized score.
Python 3.10+ (standard library only, no external dependencies).
If you use this dataset in your research, please cite:
[Citation to be added upon publication]
Scripts in this repository are released under the MIT License. The dataset itself is subject to the PhysioNet data use agreement.