Expert-Annotated Clinical Encounters: Dataset Usage

Scripts and prompts for working with the Expert-Annotated Clinical Encounters dataset, a benchmark for evaluating AI-generated medical documentation.

The dataset contains 513 clinical encounter cases with expert-authored evaluation rubrics. Each case includes a conversational transcript, longitudinal patient chart history, point-in-time encounter context, and three evaluation rubrics (two clinician-authored, one LLM-drafted).

Dataset: PhysioNet (link to be updated upon publication)

Repository Structure

dataset-usage/
  scripts/
    load_data.py      # Load and query cases and rubrics from the CSV files
    score_note.py      # Interactively score a clinical note against a rubric
  prompts/
    llm_rubric_generation.md   # Prompt template for generating LLM rubrics
    note_scoring.md            # Prompt template for automated note scoring

Getting Started

1. Download the Dataset

Download cases.csv and rubrics.csv from PhysioNet and place them in a local directory (e.g., ./data/).

2. Load and Explore

Look up a case and its associated rubrics:

python scripts/load_data.py --data-dir ./data --case-id 457

Output:

Case 457: patients_sleep_issues_and_dating_life_76746416-b
  Provenance: real_world
  Specialty: psychiatry
  Encounter type: follow-up, medication-review
  Acuity: moderate

  Rubrics (3):
    Rubric 9424 (clinician): 5 criteria, total weight 100.0
      [30] Reward for capturing all key patient symptoms and concerns...
      ...
    Rubric 9451 (clinician): 5 criteria, total weight 100.0
      ...
    Rubric 3744 (llm): 5 criteria, total weight 100.0
      ...

3. Score a Note

Generate a clinical note using your system, save it as a text file, then score it interactively against a rubric:

python scripts/score_note.py --data-dir ./data --rubric-id 9424 --note my_generated_note.txt

The tool displays each criterion and prompts you to mark it as met or not met, then computes the weighted score.

4. Generate Your Own Rubrics

The prompts/llm_rubric_generation.md file contains the exact prompt template used to generate the LLM-drafted rubrics in the dataset. You can use this with any LLM to create rubrics for new cases or to experiment with different rubric generation strategies.

5. Automated Scoring

The prompts/note_scoring.md file contains the prompt template used for automated scoring of notes against rubrics. This is useful for building your own scoring pipeline at scale.

Dataset Schema

cases.csv (513 rows, 20 columns)

Field	Description
`case_id`	Unique case identifier
`name`	Descriptive slug
`provenance`	`real_world` or `synthetic`
`specialty`	Clinical specialty tag
`encounter_type`	Encounter type tag
`encounter_length`	short, medium, or long
`problem_count`	single-problem or multi-problem
`acuity`	low, moderate, or high
`demographics`	Patient demographics
`transcript`	JSON array of conversational turns
`current_medications`	Current medications (RxNorm coded)
`current_conditions`	Current conditions (ICD-10 coded)
`condition_history`	Condition history (ICD-10 coded)
`current_allergies`	Current allergies
`family_history`	Family history (ICD-10 coded)
`surgery_history`	Surgery history (SNOMED coded)
`current_goals`	Current patient goals
`staged_commands`	Pre-existing note context
`clinician_rubric_ids`	Comma-separated rubric IDs
`llm_rubric_id`	LLM rubric ID

rubrics.csv (7,415 rows, 6 columns)

Field	Description
`rubric_id`	Groups criteria belonging to the same rubric
`case_id`	Foreign key to cases.csv
`author_type`	`clinician` or `llm`
`criterion_index`	Zero-based index within the rubric
`criterion`	Natural-language documentation requirement
`weight`	Clinical importance weight

To reconstruct a full rubric, group rows by rubric_id and order by criterion_index. Each case has exactly 3 rubrics (2 clinician-authored, 1 LLM-drafted).

Scoring Methodology

For a given clinical note and rubric:

Reconstruct the rubric by grouping rows by rubric_id and ordering by criterion_index.
Evaluate each criterion as met or not met.
Compute the score as the sum of weights for met criteria divided by the total rubric weight, yielding a 0-100% normalized score.

Requirements

Python 3.10+ (standard library only, no external dependencies).

Citation

If you use this dataset in your research, please cite:

[Citation to be added upon publication]

License

Scripts in this repository are released under the MIT License. The dataset itself is subject to the PhysioNet data use agreement.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompts		prompts
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expert-Annotated Clinical Encounters: Dataset Usage

Repository Structure

Getting Started

1. Download the Dataset

2. Load and Explore

3. Score a Note

4. Generate Your Own Rubrics

5. Automated Scoring

Dataset Schema

cases.csv (513 rows, 20 columns)

rubrics.csv (7,415 rows, 6 columns)

Scoring Methodology

Requirements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Expert-Annotated Clinical Encounters: Dataset Usage

Repository Structure

Getting Started

1. Download the Dataset

2. Load and Explore

3. Score a Note

4. Generate Your Own Rubrics

5. Automated Scoring

Dataset Schema

cases.csv (513 rows, 20 columns)

rubrics.csv (7,415 rows, 6 columns)

Scoring Methodology

Requirements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages