Curated NER training data for LatinCy Latin language models.
This dataset is currently in preparation and will be available shortly.
- Entity types: PERSON, LOC, NORP
- Sources: Universal Dependencies, biblical texts, Latin primers, Tesserae, and other annotated corpora
- Format: spaCy-compatible JSON singles with character-offset span annotations
- Splits: Train, dev, and held-out document-level test set