NER training pipeline for detecting minor children, gender indications, and biomedical entities in text. Three model backends are provided: GLiNER (transformer-based, retains zero-shot generalization), SpanMarker (BERT span classifier, higher accuracy on fixed labels), and spaCy (lighter, faster, fixed-label only).
| Label | Sources |
|---|---|
MedicalCondition |
MACCROBAT, Corona2 |
ClinicalProcedure |
MACCROBAT |
ClinicalEvent |
MACCROBAT |
MinorChild |
MACCROBAT (Age < 18 + pronoun injection), mydata.csv |
GenderIndication |
mydata.csv |
BratStandoff-format clinical case reports. Download and extract the zip — the expected layout is:
9764942/
MACCROBAT2018/ *.txt + *.ann
MACCROBAT2020/ *.txt + *.ann ← identical to 2018, deduplicated automatically
Medical NER dataset in JLiNER export format.
Source: https://www.kaggle.com/datasets/finalepoch/medical-ner/data
Place Corona2.json next to the training scripts.
Internal annotation file with columns:
ori_review— source textminor_col— semicolon-separated spans marking minor children (including pronouns)gender_col— semicolon-separated spans marking gender indicationsmedical_col— rows with this column populated are excluded from CSV training to avoid false negative gradients on medical spans
# GLiNER backend:
pip install -r requirements-gliner.txt
# SpanMarker backend:
pip install -r requirements-spanmarker.txt
# spaCy backend:
pip install -r requirements-spacy.txtFine-tunes a GLiNER model on all three data sources in a single pass. The model retains its zero-shot generalization capabilities for entity types outside the training label set.
python train_gliner.py [options]| Argument | Default | Description |
|---|---|---|
--model |
EmergentMethods/gliner_medium_news-v2.1 |
HuggingFace model ID |
--epochs |
10 |
Number of training epochs |
--batch-size |
8 |
Batch size |
--lr |
3e-5 |
Learning rate |
--output-dir |
gliner_finetuned |
Output directory |
--seed |
42 |
Random seed |
--val-split |
0.1 |
Fraction held out for validation |
--data-dir |
9764942/ |
Directory containing MACCROBAT subdirs |
--corona |
Corona2.json |
Path to Corona2.json |
--csv |
mydata.csv |
Path to mydata.csv |
--minor-oversample |
2 |
Extra copies of MinorChild CSV examples to counter class imbalance |
Outputs:
gliner_finetuned/best/— checkpoint with highest validation F1gliner_finetuned/final/— last epoch checkpointgliner_finetuned/checkpoint-epoch-N/— per-epoch checkpointsgliner_finetuned/train.json/eval.json— serialized split data
Fine-tunes a SpanMarker span classifier on all three data sources. Uses direct BIO supervision rather than contrastive learning, which typically yields higher F1 on fixed label sets. Automatically saves the best checkpoint by eval_overall_f1.
python train_spanmarker.py [options]| Argument | Default | Description |
|---|---|---|
--model |
microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext |
HuggingFace encoder model ID |
--epochs |
5 |
Number of training epochs |
--batch-size |
8 |
Batch size |
--lr |
5e-5 |
Learning rate |
--output-dir |
spanmarker_finetuned |
Output directory |
--seed |
42 |
Random seed |
--val-split |
0.1 |
Fraction held out for validation |
--data-dir |
9764942/ |
Directory containing MACCROBAT subdirs |
--corona |
Corona2.json |
Path to Corona2.json |
--csv |
mydata.csv |
Path to mydata.csv |
--minor-oversample |
2 |
Extra copies of MinorChild CSV examples to counter class imbalance |
--entity-max-length |
8 |
Max tokens per entity span |
--model-max-length |
256 |
Max input sequence length |
Outputs:
spanmarker_finetuned/best/— checkpoint with highesteval_overall_f1spanmarker_finetuned/final/— last epoch checkpoint
Trains a spaCy NER model on all three data sources in a single pass. The output model recognizes only the five labels above — base model generalist NER (PERSON, ORG, etc.) is not retained.
python train_spacy.py [options]| Argument | Default | Description |
|---|---|---|
--model |
en_core_sci_lg |
spaCy base model (requires scispacy) |
--epochs |
10 |
Number of training epochs |
--batch-size |
8 |
Batch size |
--dropout |
0.2 |
Dropout rate |
--output-dir |
spacy_model |
Output directory |
--seed |
42 |
Random seed |
--val-split |
0.1 |
Fraction held out for validation |
--data-dir |
9764942/ |
Directory containing MACCROBAT subdirs |
--corona |
Corona2.json |
Path to Corona2.json |
--csv |
mydata.csv |
Path to mydata.csv |
--minor-oversample |
2 |
Extra copies of MinorChild CSV examples to counter class imbalance |
--use-gpu |
off | Enable GPU (CUDA or Apple MPS) |
Outputs:
spacy_model/best/— checkpoint with highest validation F1spacy_model/final/— last epoch checkpointspacy_model/checkpoint-epoch-N/— per-epoch checkpoints
Both scripts train on MACCROBAT + Corona2 + mydata.csv in one combined pass. Splitting into two rounds risked the model learning that pronouns are not MinorChild during round 1 (clinical data only), then having to unlearn that in round 2 — a harder optimization problem.
Clinical case reports are single-patient documents. When a document contains an Age < 18 annotation, all third-person singular pronouns (he, she, his, her, him) and noun phrases (the child, the patient, the boy, etc.) are automatically labeled MinorChild. Plural pronouns (they, them, their) are excluded as they typically refer to the medical team or family members.
MACCROBAT annotates age and sex as two separate spans — e.g., "16-year-old" (label Age) and "boy" (label Sex). mydata.csv, by contrast, labels the full noun phrase as a single MinorChild span (e.g., "4 year old son"). To align span boundaries across datasets, parse_ann absorbs an immediately adjacent Sex span into the MinorChild span when the two are separated by pure whitespace (≤ 2 characters). A comma, newline, or any non-whitespace character between them prevents the merge.
MACCROBAT2018 and MACCROBAT2020 contain identical files. Only MACCROBAT2018 is loaded to avoid training on duplicate documents.
mydata.csv rows with a non-empty medical_col are excluded from training. Including them while only annotating minor_col/gender_col would produce false negative gradients on medical spans, contradicting MACCROBAT/Corona2 supervision.
MedicalCondition and ClinicalProcedure appear many times per clinical document, while MinorChild is rare. --minor-oversample (default 2) duplicates MinorChild-containing CSV examples before training to partially compensate.
| GLiNER | SpanMarker | spaCy | |
|---|---|---|---|
| Architecture | Transformer span model | BERT span classifier | CNN / Transformer token classifier |
| Generalist NER after fine-tuning | Retained | Lost | Lost |
| Inference speed | Medium | Medium | Fast |
| F1 on fixed labels | ~0.60 | 0.801 | 0.738 |
| Implicit/subtle entity detection | Better | Better | Weaker |
spaCy per-label (en_core_sci_lg, F1=0.738 P=0.746 R=0.729):
| Label | F1 |
|---|---|
| MinorChild | 0.933 |
| GenderIndication | 0.923 |
| ClinicalEvent | 0.708 |
| ClinicalProcedure | 0.673 |
| MedicalCondition | 0.635 |