Skip to content

[strip-sdh] Text incorrectly stripped as speaker name #39

@gizeto

Description

@gizeto

Before:

00:20:11,320 --> 00:20:15,399
One: If he's secretly got
a thing for big women,

After:

00:20:11,320 --> 00:20:15,399
If he's secretly got
a thing for big women,

I have similar script for stripping SDH, and came up with idea to calculate number of each potential SDH feature, and strip each only when it meets certain threshold.

sdh_features = {
  brackets: texts.count("["),
  parentheses: texts.count("("),
  speaker_labels: texts.scan(/^-?\s*\[?\p{Lu}[\p{L}\s\.]+\]?:\s*/).size,
  speaker_labels_upper_case: texts.scan(/^-?\s*\[?\p{Lu}[\p{Lu}\s\.]+\]?:\s*/).size,
  all_caps: texts.scan(/^[A-Z ]+$/).size
}

Thoughts on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions