Skip to content

Conversation

@RaphaelaHeil
Copy link

The current ALTO-based text indexing assumes that transcriptions are recorded at a word-level, e.g.:

<TextLine ID="TextLine1" HEIGHT="37" WIDTH="218" HPOS="1557" VPOS="297">
  <String ID="String1" CONTENT="The" HEIGHT="4" WIDTH="47" HPOS="1557" VPOS="318"/>
  <SP ID="SP1" WIDTH="25" HPOS="1604" VPOS="297"/>
  <String ID="String2" CONTENT="first" HEIGHT="37" WIDTH="23" HPOS="1654" VPOS="297"/>
  <SP ID="SP2" WIDTH="49" HPOS="1677" VPOS="297"/>
  <String ID="String3" CONTENT="block." HEIGHT="3" WIDTH="49" HPOS="1726" VPOS="319"/>
</TextLine>

Depending on the OCR/HTR tool, this is not always the case, and transcriptions may instead be recorded at a line-level, e.g.:

<TextLine ID="TextLine1" HEIGHT="37" WIDTH="218" HPOS="1557" VPOS="297">
  <String ID="String1" CONTENT="The first block" HEIGHT="37" WIDTH="218" HPOS="1557" VPOS="297"/>
</TextLine>

In the letter case, the search only populates the "hits" portion of the response, while "resources" (annotations) are left blank. Due to this, search result highlighting does not work in Mirador/UV.

This PR introduces a quick work-around to enable search result highlighting for line-level annotations. Text lines are tokenised to create new TextWords, which retain the original line's coordinates and dimensions, allowing the rest of the search to function properly.

Additionally, this PR sets the textGranularity for annotation pages to line if more than one work is present in a TextWord content. This is more of a cosmetic than functionally-required change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant