This is a simple rule-based script for recalculating the region and line reading order of a PageXML file. It is meant to post-process results of a layout recognition using Transkribus or Loghi. More specifically, it is modelled to correctly order one or two page scans which contain marginalia.
This script was originally developed for Het Utrechts Archief within the context of an internship.
This repository is a fork of the original ReadingOrderRecalculation tool developed by cconzen. Whereas the original script recalculates the reading order of text regions within PageXML files, this extended version additionally recalculates the reading order of text lines within each individual region. This enhancement ensures that not only the sequence of regions follows the correct reading order, but that the line-level order within those regions is also properly established.
The line-level recalculation applies similar geometric logic to that used for regions, ensuring that marginalia and other layout features are correctly ordered at the most granular level of the document structure.
This extension was developed by C.A.Romein for use within the project on the Resoluties van de Staten van Overijssel.
- Extract Features: Parses XML files to extract image information (height, width) as well as text regions, text lines, and their coordinates.
- Calculate Reading Order: Uses extracted features to calculate both the region reading order and the line reading order within each region.
- Update Files: Saves the new reading order into the PageXMLs at both region and line level.
You only need numpy and pandas in addition to some standard Python libraries. You can install the required dependencies using pip:
pip install numpy pandas
The code is written to process all XML files located in a directory; To execute the script, install all dependencies first and then run following:
python reorder_update_wlines.py example_folder/page --overwrite
As arguments, specify the base directory containing the PageXML files (here example_folder/page), and add --overwrite if you wish to overwrite the existing file.
The script is using simple logic based on the geometric properties of the regions, lines, and page.
Given this sample layout of a scan:
- Determine orientation (landscape = two pages, portrait = one page) based on the image's height and width. Depending on the orientation, the bookfold location is estimated:
- at the horizontal centre of the scan for landscape orientation
- at the left edge (x = 0) for portrait orientation
- The regions are assigned either 0 for left page or 1 for right page based on where their own horizontal centre is located.
- The regions are ordered:
- Left page to right page
- Top to bottom
- Left to right
- Top to bottom
- Left page to right page
- The script then uses this initial order to iterate through all regions, comparing every current box with its immediate following one in the ranking. It checks whether the following box might be a marginalium by inspecting if they are located on the same page, and then if the candidate is vertically contained within the current box:
- It is then confirmed that it is located to the left or right of the current box (In this case, it is considered to be left of it; it's comparing the left edge for the left condition (and vice versa) so overlapping boxes are handled correctly):
- If all these conditions apply, their ranks/indices are swapped:
- If a swap occurs, the loop breaks and restarts with the new order. This gets repeated until no more swaps occur in a full loop; the final reading order has been reached:
After establishing the correct region order, the script applies a similar process to the text lines within each region. For each region, the lines are sorted using the same geometric principles (top to bottom, left to right), with special handling for marginalia that may appear alongside the main text body. This ensures that the reading order is coherent not only between regions, but also within each individual region.
You can visualise the calculated reading order path by specifying your base directory and executing it:
python visualise.py example_folder
Here are some side-by-side comparisons of input image and visualised result:
(these can be found in the example_folder; The scans were processed using Loghi.
This work builds upon the original ReadingOrderRecalculation tool by cconzen, which provides region-level reading order correction for PageXML files.
The line-level extensions were developed within the context of the [HAICu-project] (https://haicu.science) on the Resoluties van de Staten van Overijssel, funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105]. The development of the line-level extensions was assisted by Claude (Anthropic) for code implementation and documentation.
This project is licensed under the MIT License - see the LICENSE file for details.





