Skip to content

Add sectioner code #1

@marcverhagen

Description

@marcverhagen

Run this on a TarsqiDocument and have it add docelement tags.

Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:

  • Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.

  • Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.

  • Certain XML tags when available.

The Thyme data have many variations of the short line heuristic. There are enumerations:

Height=150.00 cm,
Weight=39.40 kg,
Height=59.06 [in_i],
Singer - Hospital Summary
Admission Date: 05-Dec-2005  Dismissal Date: 07-Dec-2005
Contributing Author: Kxixcaj Q. Oarvui

The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence.

Section headers are also a variation of the short line heuristic:

[end section id="20104"]

These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.

Also, in Thyme, a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions