Before:
00:20:11,320 --> 00:20:15,399
One: If he's secretly got
a thing for big women,
After:
00:20:11,320 --> 00:20:15,399
If he's secretly got
a thing for big women,
I have similar script for stripping SDH, and came up with idea to calculate number of each potential SDH feature, and strip each only when it meets certain threshold.
sdh_features = {
brackets: texts.count("["),
parentheses: texts.count("("),
speaker_labels: texts.scan(/^-?\s*\[?\p{Lu}[\p{L}\s\.]+\]?:\s*/).size,
speaker_labels_upper_case: texts.scan(/^-?\s*\[?\p{Lu}[\p{Lu}\s\.]+\]?:\s*/).size,
all_caps: texts.scan(/^[A-Z ]+$/).size
}
Thoughts on this?