Skip to content

Post-process PageXMLs to improve their region reading order, including updating the internal line order

License

Notifications You must be signed in to change notification settings

CARomein/PageXML_RegionandLineRecalculation

 
 

Repository files navigation

PageXML Reading Order Recalculation

This is a simple rule-based script for recalculating the region and line reading order of a PageXML file. It is meant to post-process results of a layout recognition using Transkribus or Loghi. More specifically, it is modelled to correctly order one or two page scans which contain marginalia.

This script was originally developed for Het Utrechts Archief within the context of an internship.

Fork and Extension

This repository is a fork of the original ReadingOrderRecalculation tool developed by cconzen. Whereas the original script recalculates the reading order of text regions within PageXML files, this extended version additionally recalculates the reading order of text lines within each individual region. This enhancement ensures that not only the sequence of regions follows the correct reading order, but that the line-level order within those regions is also properly established.

The line-level recalculation applies similar geometric logic to that used for regions, ensuring that marginalia and other layout features are correctly ordered at the most granular level of the document structure.

This extension was developed by C.A.Romein for use within the project on the Resoluties van de Staten van Overijssel.

What It Does

  • Extract Features: Parses XML files to extract image information (height, width) as well as text regions, text lines, and their coordinates.
  • Calculate Reading Order: Uses extracted features to calculate both the region reading order and the line reading order within each region.
  • Update Files: Saves the new reading order into the PageXMLs at both region and line level.

Requirements

You only need numpy and pandas in addition to some standard Python libraries. You can install the required dependencies using pip:

pip install numpy pandas

Usage

Batch Reading Order Recalculation of PageXML files

The code is written to process all XML files located in a directory; To execute the script, install all dependencies first and then run following:

python reorder_update_wlines.py example_folder/page --overwrite

As arguments, specify the base directory containing the PageXML files (here example_folder/page), and add --overwrite if you wish to overwrite the existing file.

How It Works

The script is using simple logic based on the geometric properties of the regions, lines, and page.

Region-Level Reading Order

Given this sample layout of a scan:

  1. Determine orientation (landscape = two pages, portrait = one page) based on the image's height and width. Depending on the orientation, the bookfold location is estimated:
    • at the horizontal centre of the scan for landscape orientation
    • at the left edge (x = 0) for portrait orientation

  1. The regions are assigned either 0 for left page or 1 for right page based on where their own horizontal centre is located.
  2. The regions are ordered:
    • Left page to right page
      • Top to bottom
        • Left to right

  1. The script then uses this initial order to iterate through all regions, comparing every current box with its immediate following one in the ranking. It checks whether the following box might be a marginalium by inspecting if they are located on the same page, and then if the candidate is vertically contained within the current box:

  1. It is then confirmed that it is located to the left or right of the current box (In this case, it is considered to be left of it; it's comparing the left edge for the left condition (and vice versa) so overlapping boxes are handled correctly):

  1. If all these conditions apply, their ranks/indices are swapped:

  1. If a swap occurs, the loop breaks and restarts with the new order. This gets repeated until no more swaps occur in a full loop; the final reading order has been reached:

Line-Level Reading Order

After establishing the correct region order, the script applies a similar process to the text lines within each region. For each region, the lines are sorted using the same geometric principles (top to bottom, left to right), with special handling for marginalia that may appear alongside the main text body. This ensures that the reading order is coherent not only between regions, but also within each individual region.

Visualisation

You can visualise the calculated reading order path by specifying your base directory and executing it:

python visualise.py example_folder

Here are some side-by-side comparisons of input image and visualised result:

(these can be found in the example_folder; The scans were processed using Loghi.

Acknowledgements

This work builds upon the original ReadingOrderRecalculation tool by cconzen, which provides region-level reading order correction for PageXML files.

The line-level extensions were developed within the context of the [HAICu-project] (https://haicu.science) on the Resoluties van de Staten van Overijssel, funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105]. The development of the line-level extensions was assisted by Claude (Anthropic) for code implementation and documentation.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Post-process PageXMLs to improve their region reading order, including updating the internal line order

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%