Skip to content

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#408

Open
hhuangMITRE wants to merge 28 commits intodevelopfrom
feature/nlp-text-splitter-sentence-mode-sat-model-update
Open

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#408
hhuangMITRE wants to merge 28 commits intodevelopfrom
feature/nlp-text-splitter-sentence-mode-sat-model-update

Conversation

@hhuangMITRE
Copy link
Copy Markdown
Contributor

@hhuangMITRE hhuangMITRE commented Sep 23, 2025

Issues:

Related PRs:

Summary:

This PR updates the NLLB translation component to take advantage of the new nlp_text_splitter capabilities developed in PR # 93. The original component could only split text using a hard character limit.

The PR version can now divide text by tokens (which aligns better with NLLB's token limits), apply a preferred soft token limit, and propagate sentence-splitting and newline-handling options.

It also improves text segmentation behavior by using an active token limit with soft and hard cutoffs rather than using a hard character limit. Finally, it adds optional difficult-language handling for Arabic languages and changes the default sentence segmentation model from wtp-bert-mini to sat-3l-sm.

Reasons for changes and default settings based on brief testing of NLLB capability:

Analysis of NLLB Results

From NLLB Token Length Investigation.xlsx the team investigated how the NLLB translation capability handles text translations for small to very large chunks of text.

Overall our findings are:

  1. Most languages have a breaking point occurring around 120-140 tokens. Translations after this point start to lose or forget parts of the text being translated.

  2. Likewise, most languages also benefit from additional context clues and sections of text being submitted together. This helps provide sufficient translation context clues.

    • For instance, the term Dracula was actually mistranslated in one test when the word was presented on its own and not with additional newlines or whitespace to indicate it is a title for the story.
    • Overall, it seems combining a few short sentences together, with a limit around 130 tokens, ensures effective translation accuracy rates.
  3. Arabic, out of the languages tested, performed more poorly than other languages. This may be due to the submission text being a mismatch (could be a different variant of Arabic) so more testing is warranted.

    • In the meantime it was observed that Arabic translations improved greatly with smaller chunks of text. Thus we added a separate translation soft limit for difficult languages (Arabic).
  4. As a precaution, we also tested Hebrew, which did not seem to display the same number of issues as Arabic.

    • We also examined the text inputs for Hebrew and Arabic for potential reversed character directions that may confuse the translator. Instead what we found was they were instead simply rendered differently
      in most text displays (and no special directional characters were present).

This change is Reviewable

@hhuangMITRE hhuangMITRE requested a review from jrobble September 23, 2025 20:25
@hhuangMITRE hhuangMITRE self-assigned this Sep 23, 2025
@hhuangMITRE hhuangMITRE changed the title Add option to TextSplitter to return individual sentences. Adding SaT model support. Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025
Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrobble reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)


a discussion (no related file):
Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

a discussion (no related file):
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.


python/AzureTranslation/README.md line 108 at r1 (raw file):

  More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of
  model names
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The

This link only lists SaT. Also include a link to an older release for WTP models.


python/AzureTranslation/README.md line 112 at r1 (raw file):

  Review list of languages supported by SaT/WtP
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

@jrobble
Copy link
Copy Markdown
Member

jrobble commented Sep 26, 2025

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face 

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

Copy link
Copy Markdown
Contributor Author

@hhuangMITRE hhuangMITRE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @hhuangMITRE and @jrobble)


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

Done.


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.

Updated NLLB with new tests as well. Also, I didn't see a LICENSE file so I did my best to add one in.


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face 

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

This mainly due to the text splitter not recognizing newlines, which we've updated in the most recent update.


python/AzureTranslation/README.md line 108 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This link only lists SaT. Also include a link to an older release for WTP models.

Done!


python/AzureTranslation/README.md line 112 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

Done!

@jrobble jrobble moved this from To do to In Progress in OpenMPF: Development Jan 26, 2026
@hhuangMITRE hhuangMITRE requested a review from gonzalezjo March 3, 2026 17:43
Copy link
Copy Markdown

@gonzalezjo gonzalezjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not done reviewing this, but I wanted to at least submit some feedback before we next talked.

@gonzalezjo reviewed 10 files and made 8 comments.
Reviewable status: 10 of 12 files reviewed, 12 unresolved discussions (waiting on hhuangMITRE and jrobble).


python/AzureTranslation/LICENSE line 27 at r4 (raw file):


The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation
library falls under the MIT License:

It probably isn't super important to change "library" to "libraries" here, but I'll just note that maybe it should be plural? I am not a lawyer or even remotely legally qualified to evaluate license changes outside this comment so would want someone else to confirm that this is fine.


python/NllbTranslation/LICENSE (raw file):

I cannot be responsible for vetting anything legal. But LGTM if it looks good to someone who can be responsible.


python/NllbTranslation/README.md line 112 at r4 (raw file):

  - When active, the  `NLLB_TRANSLATION_TOKEN_SOFT_LIMIT` is replaced by a more aggressive `DIFFICULT_LANGUAGE_TOKEN_LIMIT`.

- `DIFFICULT_LANGUAGE_TOKEN_LIMIT`: Token size for translation chunks when processing languages specified in `PROCESS_DIFFICULT_LANGUAGES`.

Hmm. Off the top of my head, I don't have any suggestions for how this could/should be written. I hate to leave criticism that isn't very constructive, so I apologize, but I don't think "token size" is the right terminology, and I'm not sure this really articulates what PROCESS_DIFFICULT_LANGUAGES (and the count of 50 tokens) actually does, which I think we should try to document better. Its behavior seems counterintuitive to me, and maybe buggy? I elaborate on this more in another comment. But we can discuss this separately over a call if you want.

(See later discussion in nllb_translation_component for further questions/comments.)


python/NllbTranslation/README.md line 139 at r4 (raw file):

  component will always default to CPU processing.

- `SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE`: More advanced WTP models require a language code.

Maybe use "language identifier" since that's the terminology we use elsewhere and a natural followup question when reading this is "what are the language codes?"


python/NllbTranslation/README.md line 144 at r4 (raw file):

# Language Identifiers
The following are the ISO 639-3 and ISO 15924 codes, and their corresponding languages which Nllb can translate.

Super nitpicky (sorry 😓), but if you're going to change this file again: Nllb -> NLLB. Not blocking, though.


python/NllbTranslation/nllb_component/nllb_translation_component.py line 215 at r4 (raw file):

            # Difficult-language override (optional): clamp token hard limit if configured.
            if config.use_token_length and _is_difficult_language(config.translate_from_language, difficult_set):
                diff_limit = config.difficult_language_token_limit

I don't know that there's a bug here and if there is, it doesn't seem to be caught by any tests, but I think that's because we don't test this case. So:

  • (unless I'm crazy): Could you please add a test for this?

  • (again, unless I'm crazy): I think this is buggy or our spec is wrong? The behavior here is very counterintuitive to me, but:

  • We say (in both the README and this folder's associated descriptor.json) that we override NLLB_TRANSLATION_TOKEN_SOFT_LIMIT. But it seems like we're clamping the hard limit? I made a really dumb local test and I can reproducibly cause this to "lose" text, but I may be using the API wrong.

  • The descriptor.json also says we force sentence-by-sentence splitting for difficult languages. I may be missing something (sorry if so!) but I'm not able to find where that happens.


python/NllbTranslation/nllb_component/nllb_utils.py line 438 at r4 (raw file):

    'zul' : 'zul_Latn'	# Zulu
    }

Nice. Extremely satisfying. I'm not 100% sure it needs to be part of this PR instead of a follow-on simplification PR, but it doesn't break anything and it's awesome seeing this much code simplified away.


python/NllbTranslation/plugin-files/descriptor/descriptor.json line 139 at r4 (raw file):

        {
          "name": "PROCESS_DIFFICULT_LANGUAGES",
          "description": "Comma-separated list of languages that should force sentence-by-sentence splitting and reduce the hard token limit. Default includes 'arabic'.",

"and reduce the hard token limit" is unclear verbiage. See later earlier discussion in nllb_translation_component for further questions. (edit: oops, reviewable does not put comments in chronological order)

Copy link
Copy Markdown

@gonzalezjo gonzalezjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just learned how to publish comments on reviewable without marking a file as reviewed. Cool. I'm not done with these files but I'll leave what I have so far.

@gonzalezjo made 3 comments.
Reviewable status: 9 of 12 files reviewed, 14 unresolved discussions (waiting on hhuangMITRE and jrobble).


python/NllbTranslation/tests/test_nllb_translation.py line 518 at r4 (raw file):

        test_generic_job_props['SENTENCE_SPLITTER_MODE'] = 'SENTENCE'
        test_generic_job_props['SENTENCE_SPLITTER_NEWLINE_BEHAVIOR'] = 'GUESS'
        pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable to pour joy into the soul and send to the countenances the reflection of them; They imagine themselves fatally haunted by spleen, hopelessly gloomy and sullen, as if at every moment they were emerging from the underground galleries of a pit-coal mine, Our British allies. How they deceive themselves or how they intend to deceive us! Is this an illusion or bad faith, against which there is much to be lamented in vain the indelevel and accentuated expression of beatitude, which shines through the illuminated faces of the men from beyond the Channel, who seem to walk among us, wrapped in a dense atmosphere of perennial contentment, satisfied with the world, satisfied with men and, very especially, satisfied with themselves? Yes , please ."

Please leave a short comment somewhere explaining the spacing around the comma and period in "Yes , please ." If that someone is anyone like me, it's the first thing that'll raise the eyebrows of someone scanning the tests.


python/NllbTranslation/tests/test_nllb_translation.py line 521 at r4 (raw file):

        job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track)
        result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job)
        self.assertEqual(pt_text_translation, result_props["TRANSLATION"])

This is technically dead-code as-is :) - because result_props wasn't updated in the line above (only result_track was updated) and because pt_text_translation has the same value as it did in the prior test, this is a no-op.

But there's some good news: the test passes even after fixing that.

@gonzalezjo
Copy link
Copy Markdown

gonzalezjo commented Apr 13, 2026

I think this should demonstrate the behavior that perplexed me regarding difficult language token limits: the last words appear to get chopped off here.

I think it'd be good to try to get a test with this code included, and to try to get it to pass.

base_job_props: dict[str, str] = dict(self.defaultProps)
base_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'spa'
base_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn'
base_job_props['TARGET_LANGUAGE'] = 'fra'
base_job_props['TARGET_SCRIPT'] = 'Latn'
base_job_props['USE_NLLB_TOKEN_LENGTH'] = 'TRUE'
base_job_props['NLLB_TRANSLATION_TOKEN_LIMIT'] = '200'
base_job_props['DIFFICULT_LANGUAGE_TOKEN_LIMIT'] = '50'

text = (
    'Para la cena, o más bien para la comida nocturna, tomé pollo preparado '
    'de algún modo con pimiento rojo, que estaba muy sabroso, pero me dio '
    'mucha sed.'
)

normal_ff_track = mpf.GenericTrack(-1, dict(TEXT=text))
normal_props = dict(base_job_props)
normal_props['PROCESS_DIFFICULT_LANGUAGES'] = 'disabled'
normal_job = mpf.GenericJob('Test Generic', 'test.pdf', normal_props, {}, normal_ff_track)
normal_result = self.component.get_detections_from_generic(normal_job)[0]
normal_translation = normal_result.detection_properties["TRANSLATION"]

source_token_count = len(self.component._tokenizer(text)["input_ids"])
normal_target_token_count = len(self.component._tokenizer(normal_translation)["input_ids"])

difficult_ff_track = mpf.GenericTrack(-1, dict(TEXT=text))
difficult_props = dict(base_job_props)
difficult_props['PROCESS_DIFFICULT_LANGUAGES'] = 'spa'
difficult_job = mpf.GenericJob('Test Generic', 'test.pdf', difficult_props, {}, difficult_ff_track)
difficult_result = self.component.get_detections_from_generic(difficult_job)[0]
difficult_translation = difficult_result.detection_properties["TRANSLATION"]

self.assertLessEqual(source_token_count, 50)
self.assertGreater(normal_target_token_count, 50)
self.assertEqual(normal_translation, difficult_translation)

Relatedly, the use of a hard limit instead of a soft limit means if you use (perhaps stupidly...?) low difficult language token limits (e.g. setting it to 3) you'll see output with a lot of dropped words and oddities in which words are split mid-word, spaces are inserted randomly, etc. It's an easy way to see the problem and test fixes, even if, obviously, people should not be using a limit of 3 in the real world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants