Skip to content

Commit c37670d

Browse files
Updated crfcut.py
Crfcut creating issues for split using terminal punctuation commonly '.' (full stop) which should be treated as end of the sentence, Modified the function such that it should split using terminal punctuations and avoid any kind of empty strings.
1 parent fa0a2ca commit c37670d

File tree

1 file changed

+11
-1
lines changed

1 file changed

+11
-1
lines changed

pythainlp/tokenize/crfcut.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,12 +198,22 @@ def segment(text: str) -> List[str]:
198198
feat = extract_features(toks)
199199
labs = _tagger.tag(feat)
200200
labs[-1] = "E" # make sure it cuts the last sentence
201+
202+
#To ensure splitting of sentences using Terminal Punctuation
203+
for idx, _ in enumerate(toks):
204+
if(toks[idx].strip().endswith(('!', '.', '?'))):
205+
labs[idx] = "E"
206+
207+
#Spaces or empty strings would no longer be treated as end of the sentence.
208+
elif(toks[idx].strip() == ""):
209+
labs[idx] = "I"
201210

202211
sentences = []
203212
sentence = ""
204213
for i, w in enumerate(toks):
205214
sentence = sentence + w
206-
if labs[i] == "E":
215+
#Constraining empty strings to get added, to avoid any sort of unusual behaviour due to empty strings.
216+
if labs[i] == "E" and sentence != '':
207217
sentences.append(sentence)
208218
sentence = ""
209219

0 commit comments

Comments
 (0)