-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Run this on a TarsqiDocument and have it add docelement tags.
Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:
-
Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.
-
Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.
-
Certain XML tags when available.
The Thyme data have many variations of the short line heuristic. There are enumerations:
Height=150.00 cm,
Weight=39.40 kg,
Height=59.06 [in_i],
Singer - Hospital Summary
Admission Date: 05-Dec-2005 Dismissal Date: 07-Dec-2005
Contributing Author: Kxixcaj Q. Oarvui
The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence.
Section headers are also a variation of the short line heuristic:
[end section id="20104"]
These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.
Also, in Thyme, a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.