Skip to content

long description of short hand-written text (signatures); heading after image #94

@mayeulk

Description

@mayeulk

Hello; thanks for this soft!
This issue affects the OCR on the pdf document at https://www.irnc.org/IRNC/Textes/2797
(also attached below) which is a (published) open letter (to the French Ministry of Justice).

Lettre à M. Eric Dupont-Moretti.pdf

The output as markdown of chandra ocr2 contains the following:

Le lendemain matin, 10 juillet 2020, nous étions auditionnés à la gendarmerie d'Is-sur-Tille au sujet des faits d' « *intrusion en réunion dans l'enceinte d'une installation civile abritant des matières nucléaires, en l'espèce être entré (s) dans l'enceinte du CEA Valduc malgré la présence de panneaux indiquant qu'il s'agissait d'une propriété privée* », citation indiquée téléphoniquement aux gendarmes par Mme Caroline Locks, Substitut du Procureur de la République près le Tribunal judiciaire de Dijon.

A handwritten signature in black ink, appearing to be initials and a surname, possibly 'JB' followed by a stylized 'X' and 'JME'.## La procédure

N'ayant pu obtenir communication du dossier malgré ses demandes répétées, notre avocate, Me Dominique Clémang, a sollicité à l'audience correctionnelle du 12 février à 14 h le report du procès.

The important part for this issue is the paragraph starting with "A handwritten signature in black ink..." and ending with "## La procédure"
Namely:
A handwritten signature in black ink, appearing to be initials and a surname, possibly 'JB' followed by a stylized 'X' and 'JME'.## La procédure

There are two issues here:

  • In a French text, having a signature replaced by a long description in English will not match the audience needs. I cannot find a way to post-process (and remove) this, as some of my documents do have printed (not handwritten) English text encoded in images (hence needing OCR). Here, I would expect the have either some gibberish text, or the image linked (as separate asset) with link in markdown; or maybe that description in French. This could be possible if I had found a way to say "do not process handwritten text" (my 1st best) or "if you add text description, do it in French". Using the docker server image, I tried to modify the prompts.py in the client side (on the side calling ./chandra --method vllm /input/folder ) but it had no impact (should I rebuild the server image?). Image description (in English) is frequent with Chandra OCR2 but usually I can remove it as it is in a caption-style markdown format for an image... except for a signature.
  • Second issue is: when an image is directly followed by a heading, the image description (or the image markdown link) often gets glued with the heading (like here: ## La procédure). This is easy to fix and post-process if the source text contains no literal sharp sign (#), but could become tricky otherwise.

For the firs issue, I would have expected the json to help, but it does not:

{
  "file_name": "Lettre \u00e0 M. Eric Dupont-Moretti.pdf",
  "num_pages": 3,
  "total_token_count": 3168,
  "total_chunks": 37,
  "total_images": 0,
  "pages": [
    {
      "page_num": 0,
      "page_box": [        0,        0,        1588,        2245      ],
      "token_count": 1098,
      "num_chunks": 14,
      "num_images": 0
    },
    {
      "page_num": 1,
      "page_box": [        0,        0,        1588,        2245      ],
      "token_count": 1184,
      "num_chunks": 9,
      "num_images": 0
    },
    {
      "page_num": 2,
      "page_box": [        0,        0,        1588,        2245      ],
      "token_count": 886,
      "num_chunks": 14,
      "num_images": 0
    }
  ]
}

My workaround for this is currently to read myself all output in English (since some documents contains English text as images in tables or "text boxes"). I also search "handwritten" and "signature" (but that word is the same in French: "signature"); I fear false negatives.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions