Skip to content

Feat/flex line dirs #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 24, 2025
Merged

Feat/flex line dirs #142

merged 20 commits into from
Apr 24, 2025

Conversation

mikegerber
Copy link
Member

@mikegerber mikegerber commented Apr 22, 2025

This adds more flexibility w.r.t. evaluating directories of line texts.


  • Test dinglehopper

    • Check support for "plain-encoding"
  • Test dinglehopper-line-dirs

    • Check support for "plain-encoding"
  • Test dinglehopper-extract

    • Check support for "plain-encoding"
  • Test dinglehopper-summarize

  • Test ocrd-dinglehopper

    • Check support for "plain-encoding"
  • Update docs w.r.t this feature

    • dinglehopper-line-dirs --help
    • README.md
  • Review Unexpected UTF-8 problems #123

@mikegerber mikegerber self-assigned this Apr 22, 2025
@mikegerber mikegerber marked this pull request as draft April 22, 2025 11:27
@mikegerber
Copy link
Member Author

Needs more testing, converting to draft for now.

@mikegerber mikegerber force-pushed the feat/flex-line-dirs branch from 9405df7 to a70260c Compare April 22, 2025 11:57
@mikegerber
Copy link
Member Author

I've added a check list above to go through the various CLIs and test them. Because this also adds support to specify a plain text encoding. I've also added this to the check list.

@mikegerber
Copy link
Member Author

dinglehopper CLI works fine.

@mikegerber
Copy link
Member Author

Ha, dinglehopper-extract doesn't have --plain-encoding yet (but complains about auto-detecting) → Needs a fix.

@mikegerber
Copy link
Member Author

Ha, dinglehopper-extract doesn't have --plain-encoding yet (but complains about auto-detecting) → Needs a fix.

Fixed in 14a4bc5.

@mikegerber
Copy link
Member Author

dinglehopper-summarize does not read any input files, so it's not relevant here. Should be tested in #112.

@mikegerber
Copy link
Member Author

dinglehopper-line-dirs behaves as expected!

@mikegerber
Copy link
Member Author

Manual test of ocrd-dinglehopper also correctly warns about autodetecting the plain text encoding + has the option to give an explicit encoding.

❯ ocrd-dinglehopper -P plain_encoding utf-8 -I GT-TXT,OCR-D-OCR-TESS -O DINGLEHOPPER-FROM-TXT2

Don't see how to stick the information about the plain text encoding into the METS file - that could be an improvement over this. Maybe @bertsky has an idea?

(I see comparing to txt GT as useful in some cases, e.g. when working with corpora where only the text is available but no PAGE/ALTO.)

@mikegerber
Copy link
Member Author

The help text of dinglehopper-line-dirs looks ready.

❯ dinglehopper-line-dirs --help
Usage: dinglehopper-line-dirs [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the GT line text directory against the OCR line text directory.

  This assumes that the GT line text directory contains textfiles with a
  common suffix like ".gt.txt", and the OCR line text directory contains
  textfiles with a common suffix like ".some-ocr.txt". The text files also
  need to be paired, i.e. the GT filename "line001.gt.txt" needs to match a
  filename "line001.some-ocr.txt" in the OCR lines directory.

  GT and OCR directories may contain line text files in matching
  subdirectories, e.g. "GT/goethe_faust/line1.gt.txt" and
  "OCR/goethe_faust/line1.pred.txt".

  GT and OCR directories can also be the same directory, but in this case you
  need to give --gt-suffix and --ocr-suffix explicitly.

  The GT and OCR directories are usually ground truth line texts and the
  results of an OCR software, but you may use dinglehopper to compare two OCR
  results. In that case, use --no-metrics to disable the then meaningless
  metrics and also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character error
  rate (CER) and the word error rate (WER).

  It is recommended to specify the encoding of the text files, for example
  with --plain-encoding utf-8. If this option is not given, we try to auto-
  detect it.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --gt-suffix TEXT          Suffix of GT line text files
  --ocr-suffix TEXT         Suffix of OCR line text files
  --plain-encoding TEXT     Encoding (e.g. "utf-8") of plain text files
  --help                    Show this message and exit.
  ```

@mikegerber
Copy link
Member Author

README.md now points to dinglehopper-line-dirs --help for the special case of mixed GT/OCR line text directories.

@mikegerber
Copy link
Member Author

I've added a test for plain text files with BOM.

@mikegerber mikegerber marked this pull request as ready for review April 24, 2025 14:48
@mikegerber mikegerber merged commit d7814db into master Apr 24, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant