Skip to content

OCR Corpora Early 2025 Publication Thread #3

@ctschroeder

Description

@ctschroeder

OCR publication thread for early 2025

Do not close this issue until all checkboxes below are complete or have been rescheduled:

List of corpora:

In Processed OCR folder (need chapter divisions + sentence splitting+full automatic NLP processing like the bible corpora)

  • Giron Legendes (11 docs)
    • chapter divisions added/checked
    • metadata updated

all documents above need to be moved to https://github.com/CopticScriptorium/auto-corpora

In GitDox

  • apocalypse.paul (2)
    - [ ] corpus name needed
    - [ ] other metadata updated
    - possibly error in data -- translation on p. 1043 begins with folio 24a but OCR coptic begins in the middle of folio 6a p. 533

  • pscyril.alexandria

    • On Mary still in XML mode (auto tagging?)
    • entities and identities
  • pscyril.jerusalem

    • on the cross
      • needs corpus name
      • metadata updated
      • chapter & verse need to be updated in spreadsheet based on open tags in XML
    • on Mary
      • needs corpus name
      • metadata updated
      • chapter & verse need to be updated in spreadsheet based on open tags in XML
  • psepiphanius on Mary

    • translation span
    • entities and identities
    • metadata
  • pschrysostom

    • translation span
    • entities and identities
    • metadata
  • pscelestinus

  • pstimothy.alex

  • psote.psoi

  • timothy.discourse

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions