HTRVX - pronounced Ashterux - allows for quality control of XML using XSD schema validation, Segmonto validation and other verifications.
Simply run pip install htrvx
The basic way to run the script is htrvx PATHTOFILES --format FORMAT, eg. htrvx ./tests/test_data/page/*.xml --format page
Each verification is an opt-in verification: you need to express the fact that you want to check it.
--segmontowill check for Segmonto compliancy- You can use your own vocabulary or a restricted Segmonto vocabulary by using
--zone ZONENAMEand--line LINENAMEsuch ashtrvx [...] --line DefaultLine --line HeadingLine --zone MainZone - You can use
--allow-untaggedwith eitherline,zoneorbothso that zones without type are allowed. If you want to limit such lines or zone, combine it with--max-untagged-zones Nor--max-untagged-lines Nwhere N is the number of allowed occurrences.
- You can use your own vocabulary or a restricted Segmonto vocabulary by using
--xsdwill check if the data are compliant with XML Schemas--check-emptywill check if regions have no lines or if lines have no text--check-emptycan be refined with--raise-emptyto throw an error if empty elements are found, otherwise it's simply reported. =--check-imagechecks for link in the XML. Link are checked relatively to the XML file, ie. if XML file ./data/element.xml points to file.jpeg, file ./data/file.jpeg is expected to exist.
Other parameters mainly have to do with verbosity: --verbose displays details about errors, --group groups errors (instead of showing one line per error, groups by error types).
| Parameters | Default | Function |
|---|---|---|
| -v, --verbose | False | Prints more information |
| -f, --format [alto,page] | alto | Format of files |
| -s, --segmonto | False | Apply Segmonto Zoning verification |
| -e, --check-empty | False | Check for empty lines or empty zones |
| -r, --raise-empty | False | Warns but not fails if empty lines or empty zones are found |
| -x, --xsd | False | Apply XSD Schema verification |
| -g, --group | False | Group error types (reduce verbosity) |
| -i, --check-image | False | Check if the image link in the XML points to the right path |
| -l, --verbose-level | zen | Level of details and amount of color shown in the logs (see below). |
| --zone TEXT | None | Provide a custom zone to control zone types instead of Segmonto |
| --line TEXT | None | Provide a custom line to control Line types instead of Segmonto |
minimal: shows only failing tests, no details.low: shows only failing test and their details, such as which lines fails in a file.zen(default): shows all tests and their details, but displays only one color (red for errors).all: shows everything.
If you want to add this to your github repository, as a continuous integration workflow, add a file htrux.yml at in the path .github/workflows of your repository.
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: HTRVX
on: [push, pull_request] # You can edit this of course !
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install htrvx
- name: Run HTRVX
run: |
htrvx --verbose --group --format alto --segmonto --xsd --check-empty --raise-empty UNIX/Path/to/**/your/*.xml
Logo by Alix Chagué.
