Add option to TextSplitter to return individual sentences. Adding general SaT model support.#408
Conversation
jrobble
left a comment
There was a problem hiding this comment.
@jrobble reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)
a discussion (no related file):
Mention SaT here:
def split_input_text(self, text: str, from_lang: Optional[str],
from_lang_confidence: Optional[float]) -> SplitTextResult:
"""
Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
Each chunk will contain one or more complete sentences as reported
by the (WtP or spaCy) sentence splitter.
"""
Mention SaT here:
class SentenceSplitter:
"""
Class to divide large sections of text at sentence breaks using WtP and spaCy.
It is only used when the text to translate exceeds
the translation endpoint's character limit.
"""
a discussion (no related file):
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.
python/AzureTranslation/README.md line 108 at r1 (raw file):
More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of model names [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The
This link only lists SaT. Also include a link to an older release for WTP models.
python/AzureTranslation/README.md line 112 at r1 (raw file):
Review list of languages supported by SaT/WtP [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).
I think this link is only for SaT. Also include a link to an older release for WTP model languages.
|
Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this: and breaks it down into individual words: Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself. |
…HOLD (#409) * Validate timestamps. --------- Co-authored-by: jrobble <jrobble@mitre.org>
hhuangMITRE
left a comment
There was a problem hiding this comment.
Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @hhuangMITRE and @jrobble)
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Mention SaT here:
def split_input_text(self, text: str, from_lang: Optional[str], from_lang_confidence: Optional[float]) -> SplitTextResult: """ Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS. Each chunk will contain one or more complete sentences as reported by the (WtP or spaCy) sentence splitter. """Mention SaT here:
class SentenceSplitter: """ Class to divide large sections of text at sentence breaks using WtP and spaCy. It is only used when the text to translate exceeds the translation endpoint's character limit. """
Done.
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.
Updated NLLB with new tests as well. Also, I didn't see a LICENSE file so I did my best to add one in.
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:
pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do nosso desanuviado sol, ou a face desassombrada da lua no firmamento peninsular, onde não tem, como a de Londres--_a romper a custo um plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_, irremediavelmente lugubres e soturnos, como se a cada momento saíssem das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados inglezes. Como se enganam ou como pretendem enganar-nos! É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a indelevel e accentuada expressão de beatitude, que transluz no rosto illuminado dos homens de além da Mancha, os quaes parece caminharem entre nós, envolvidos em densa atmosphera de perenne contentamento, satisfeitos do mundo, satisfeitos dos homens e, muito especialmente, satisfeitos de si. """and breaks it down into individual words:
#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini #85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences. #85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences. #85 129.1 INFO:NllbTranslationComponent:Translating sentences... #85 131.6 DEBUG:NllbTranslationComponent:Translated: #85 131.6 Teimam #85 131.6 to: #85 131.6 They 're scared . #85 133.2 DEBUG:NllbTranslationComponent:Translated: #85 133.2 de #85 133.2 to: #85 133.2 of #85 134.8 DEBUG:NllbTranslationComponent:Translated: #85 134.8 facto #85 134.8 to: #85 134.8 fact #85 136.3 DEBUG:NllbTranslationComponent:Translated: #85 136.3 estes #85 136.3 to: #85 136.3 these #85 137.8 DEBUG:NllbTranslationComponent:Translated: #85 137.8 em #85 137.8 to: #85 137.8 in #85 139.4 DEBUG:NllbTranslationComponent:Translated: #85 139.4 que #85 139.4 to: #85 139.4 than #85 140.9 DEBUG:NllbTranslationComponent:Translated: #85 140.9 são #85 140.9 to: #85 140.9 are #85 142.5 DEBUG:NllbTranslationComponent:Translated: #85 142.5 indispensaveis #85 142.5 to: #85 142.5 The Commission #85 144.0 DEBUG:NllbTranslationComponent:Translated: #85 144.0 os #85 144.0 to: #85 144.0 the #85 145.6 DEBUG:NllbTranslationComponent:Translated: #85 145.6 vividos #85 145.6 to: #85 145.6 lived #85 149.7 DEBUG:NllbTranslationComponent:Translated: #85 149.7 raios do #85 149.7 nosso desanuviado #85 149.7 to: #85 149.7 The lightning of our desnuviado . #85 151.2 DEBUG:NllbTranslationComponent:Translated: #85 151.2 sol, #85 151.2 to: #85 151.2 #85 153.0 DEBUG:NllbTranslationComponent:Translated: #85 153.0 ou a #85 153.0 to: #85 153.0 or a #85 154.6 DEBUG:NllbTranslationComponent:Translated: #85 154.6 face #85 154.6 to: #85 154.6 faceTry using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.
This mainly due to the text splitter not recognizing newlines, which we've updated in the most recent update.
python/AzureTranslation/README.md line 108 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This link only lists SaT. Also include a link to an older release for WTP models.
Done!
python/AzureTranslation/README.md line 112 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
I think this link is only for SaT. Also include a link to an older release for WTP model languages.
Done!
…sentence-mode-sat-model-update
…sentence-mode-sat-model-update
There was a problem hiding this comment.
I'm still not done reviewing this, but I wanted to at least submit some feedback before we next talked.
@gonzalezjo reviewed 10 files and made 8 comments.
Reviewable status: 10 of 12 files reviewed, 12 unresolved discussions (waiting on hhuangMITRE and jrobble).
python/AzureTranslation/LICENSE line 27 at r4 (raw file):
The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation library falls under the MIT License:
It probably isn't super important to change "library" to "libraries" here, but I'll just note that maybe it should be plural? I am not a lawyer or even remotely legally qualified to evaluate license changes outside this comment so would want someone else to confirm that this is fine.
python/NllbTranslation/LICENSE (raw file):
I cannot be responsible for vetting anything legal. But LGTM if it looks good to someone who can be responsible.
python/NllbTranslation/README.md line 112 at r4 (raw file):
- When active, the `NLLB_TRANSLATION_TOKEN_SOFT_LIMIT` is replaced by a more aggressive `DIFFICULT_LANGUAGE_TOKEN_LIMIT`. - `DIFFICULT_LANGUAGE_TOKEN_LIMIT`: Token size for translation chunks when processing languages specified in `PROCESS_DIFFICULT_LANGUAGES`.
Hmm. Off the top of my head, I don't have any suggestions for how this could/should be written. I hate to leave criticism that isn't very constructive, so I apologize, but I don't think "token size" is the right terminology, and I'm not sure this really articulates what PROCESS_DIFFICULT_LANGUAGES (and the count of 50 tokens) actually does, which I think we should try to document better. Its behavior seems counterintuitive to me, and maybe buggy? I elaborate on this more in another comment. But we can discuss this separately over a call if you want.
(See later discussion in nllb_translation_component for further questions/comments.)
python/NllbTranslation/README.md line 139 at r4 (raw file):
component will always default to CPU processing. - `SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE`: More advanced WTP models require a language code.
Maybe use "language identifier" since that's the terminology we use elsewhere and a natural followup question when reading this is "what are the language codes?"
python/NllbTranslation/README.md line 144 at r4 (raw file):
# Language Identifiers The following are the ISO 639-3 and ISO 15924 codes, and their corresponding languages which Nllb can translate.
Super nitpicky (sorry 😓), but if you're going to change this file again: Nllb -> NLLB. Not blocking, though.
python/NllbTranslation/nllb_component/nllb_translation_component.py line 215 at r4 (raw file):
# Difficult-language override (optional): clamp token hard limit if configured. if config.use_token_length and _is_difficult_language(config.translate_from_language, difficult_set): diff_limit = config.difficult_language_token_limit
I don't know that there's a bug here and if there is, it doesn't seem to be caught by any tests, but I think that's because we don't test this case. So:
-
(unless I'm crazy): Could you please add a test for this?
-
(again, unless I'm crazy): I think this is buggy or our spec is wrong? The behavior here is very counterintuitive to me, but:
-
We say (in both the README and this folder's associated descriptor.json) that we override NLLB_TRANSLATION_TOKEN_SOFT_LIMIT. But it seems like we're clamping the hard limit? I made a really dumb local test and I can reproducibly cause this to "lose" text, but I may be using the API wrong.
-
The descriptor.json also says we force sentence-by-sentence splitting for difficult languages. I may be missing something (sorry if so!) but I'm not able to find where that happens.
python/NllbTranslation/nllb_component/nllb_utils.py line 438 at r4 (raw file):
'zul' : 'zul_Latn' # Zulu }
Nice. Extremely satisfying. I'm not 100% sure it needs to be part of this PR instead of a follow-on simplification PR, but it doesn't break anything and it's awesome seeing this much code simplified away.
python/NllbTranslation/plugin-files/descriptor/descriptor.json line 139 at r4 (raw file):
{ "name": "PROCESS_DIFFICULT_LANGUAGES", "description": "Comma-separated list of languages that should force sentence-by-sentence splitting and reduce the hard token limit. Default includes 'arabic'.",
"and reduce the hard token limit" is unclear verbiage. See later earlier discussion in nllb_translation_component for further questions. (edit: oops, reviewable does not put comments in chronological order)
gonzalezjo
left a comment
There was a problem hiding this comment.
I've just learned how to publish comments on reviewable without marking a file as reviewed. Cool. I'm not done with these files but I'll leave what I have so far.
@gonzalezjo made 3 comments.
Reviewable status: 9 of 12 files reviewed, 14 unresolved discussions (waiting on hhuangMITRE and jrobble).
python/NllbTranslation/tests/test_nllb_translation.py line 518 at r4 (raw file):
test_generic_job_props['SENTENCE_SPLITTER_MODE'] = 'SENTENCE' test_generic_job_props['SENTENCE_SPLITTER_NEWLINE_BEHAVIOR'] = 'GUESS' pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable to pour joy into the soul and send to the countenances the reflection of them; They imagine themselves fatally haunted by spleen, hopelessly gloomy and sullen, as if at every moment they were emerging from the underground galleries of a pit-coal mine, Our British allies. How they deceive themselves or how they intend to deceive us! Is this an illusion or bad faith, against which there is much to be lamented in vain the indelevel and accentuated expression of beatitude, which shines through the illuminated faces of the men from beyond the Channel, who seem to walk among us, wrapped in a dense atmosphere of perennial contentment, satisfied with the world, satisfied with men and, very especially, satisfied with themselves? Yes , please ."
Please leave a short comment somewhere explaining the spacing around the comma and period in "Yes , please ." If that someone is anyone like me, it's the first thing that'll raise the eyebrows of someone scanning the tests.
python/NllbTranslation/tests/test_nllb_translation.py line 521 at r4 (raw file):
job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track) result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job) self.assertEqual(pt_text_translation, result_props["TRANSLATION"])
This is technically dead-code as-is :) - because result_props wasn't updated in the line above (only result_track was updated) and because pt_text_translation has the same value as it did in the prior test, this is a no-op.
But there's some good news: the test passes even after fixing that.
|
I think this should demonstrate the behavior that perplexed me regarding difficult language token limits: the last words appear to get chopped off here. I think it'd be good to try to get a test with this code included, and to try to get it to pass. base_job_props: dict[str, str] = dict(self.defaultProps)
base_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'spa'
base_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn'
base_job_props['TARGET_LANGUAGE'] = 'fra'
base_job_props['TARGET_SCRIPT'] = 'Latn'
base_job_props['USE_NLLB_TOKEN_LENGTH'] = 'TRUE'
base_job_props['NLLB_TRANSLATION_TOKEN_LIMIT'] = '200'
base_job_props['DIFFICULT_LANGUAGE_TOKEN_LIMIT'] = '50'
text = (
'Para la cena, o más bien para la comida nocturna, tomé pollo preparado '
'de algún modo con pimiento rojo, que estaba muy sabroso, pero me dio '
'mucha sed.'
)
normal_ff_track = mpf.GenericTrack(-1, dict(TEXT=text))
normal_props = dict(base_job_props)
normal_props['PROCESS_DIFFICULT_LANGUAGES'] = 'disabled'
normal_job = mpf.GenericJob('Test Generic', 'test.pdf', normal_props, {}, normal_ff_track)
normal_result = self.component.get_detections_from_generic(normal_job)[0]
normal_translation = normal_result.detection_properties["TRANSLATION"]
source_token_count = len(self.component._tokenizer(text)["input_ids"])
normal_target_token_count = len(self.component._tokenizer(normal_translation)["input_ids"])
difficult_ff_track = mpf.GenericTrack(-1, dict(TEXT=text))
difficult_props = dict(base_job_props)
difficult_props['PROCESS_DIFFICULT_LANGUAGES'] = 'spa'
difficult_job = mpf.GenericJob('Test Generic', 'test.pdf', difficult_props, {}, difficult_ff_track)
difficult_result = self.component.get_detections_from_generic(difficult_job)[0]
difficult_translation = difficult_result.detection_properties["TRANSLATION"]
self.assertLessEqual(source_token_count, 50)
self.assertGreater(normal_target_token_count, 50)
self.assertEqual(normal_translation, difficult_translation)Relatedly, the use of a hard limit instead of a soft limit means if you use (perhaps stupidly...?) low difficult language token limits (e.g. setting it to 3) you'll see output with a lot of dropped words and oddities in which words are split mid-word, spaces are inserted randomly, etc. It's an easy way to see the problem and test fixes, even if, obviously, people should not be using a limit of 3 in the real world. |
Issues:
Related PRs:
Summary:
This PR updates the NLLB translation component to take advantage of the new nlp_text_splitter capabilities developed in PR # 93. The original component could only split text using a hard character limit.
The PR version can now divide text by tokens (which aligns better with NLLB's token limits), apply a preferred soft token limit, and propagate sentence-splitting and newline-handling options.
It also improves text segmentation behavior by using an active token limit with soft and hard cutoffs rather than using a hard character limit. Finally, it adds optional difficult-language handling for Arabic languages and changes the default sentence segmentation model from wtp-bert-mini to sat-3l-sm.
Reasons for changes and default settings based on brief testing of NLLB capability:
Analysis of NLLB Results
From
NLLB Token Length Investigation.xlsxthe team investigated how the NLLB translation capability handles text translations for small to very large chunks of text.Overall our findings are:
Most languages have a breaking point occurring around 120-140 tokens. Translations after this point start to lose or forget parts of the text being translated.
Likewise, most languages also benefit from additional context clues and sections of text being submitted together. This helps provide sufficient translation context clues.
Draculawas actually mistranslated in one test when the word was presented on its own and not with additional newlines or whitespace to indicate it is a title for the story.Arabic, out of the languages tested, performed more poorly than other languages. This may be due to the submission text being a mismatch (could be a different variant of Arabic) so more testing is warranted.
As a precaution, we also tested Hebrew, which did not seem to display the same number of issues as Arabic.
in most text displays (and no special directional characters were present).
This change is