Add sectioner code

Run this on a TarsqiDocument and have it add docelement tags.

Focus on some fairly generic heuristics for splitting and a few specific ones for the Thyme data. Generic heuristics to use:

- Short lines with only a few words, not ending in a period, potentially starting with a number or some other indicator that something is a header.

- Empty lines, almost always indicating that the text above and below are not in the same sentence, this is similar to what the current document structure parser in https://github.com/tarsqi/ttk does.

- Certain XML tags when available.

The Thyme data have many variations of the short line heuristic. There are enumerations:

```
Height=150.00 cm,
Weight=39.40 kg,
Height=59.06 [in_i],
```

```
Singer - Hospital Summary
Admission Date: 05-Dec-2005  Dismissal Date: 07-Dec-2005
Contributing Author: Kxixcaj Q. Oarvui
```

The Stanford splitter puts these in one sentence, and a bad one that explodes the parse, so we should at least not recognize this as one sentence. 

Section headers are also a variation of the short line heuristic:

```
[end section id="20104"]
```

These are also usually separated by whitespace, but not always so adding the brackets to the heuristics for Thyme may be good.

Also, in Thyme,  a couple of consecutive words in ALL CAPS, starting at the beginning of a line (even long lines) are a strong indication that this may be a section. This needs to be explored further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sectioner code #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add sectioner code #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions