Add option to TextSplitter to return individual sentences. Adding general SaT model support. by hhuangMITRE · Pull Request #408 · openmpf/openmpf-components

hhuangMITRE · 2025-09-23T20:25:25Z

Issues:

Add option to TextSplitter to return individual sentences openmpf#1965

Related PRs:

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf-python-component-sdk#93

Summary:

This PR updates the NLLB translation component to take advantage of the new nlp_text_splitter capabilities developed in PR # 93. The original component could only split text using a hard character limit.

The PR version can now divide text by tokens (which aligns better with NLLB's token limits), apply a preferred soft token limit, and propagate sentence-splitting and newline-handling options.

It also improves text segmentation behavior by using an active token limit with soft and hard cutoffs rather than using a hard character limit. Finally, it adds optional difficult-language handling for Arabic languages and changes the default sentence segmentation model from wtp-bert-mini to sat-3l-sm.

Reasons for changes and default settings based on brief testing of NLLB capability:

Analysis of NLLB Results

From NLLB Token Length Investigation.xlsx the team investigated how the NLLB translation capability handles text translations for small to very large chunks of text.

Overall our findings are:

Most languages have a breaking point occurring around 120-140 tokens. Translations after this point start to lose or forget parts of the text being translated.
Likewise, most languages also benefit from additional context clues and sections of text being submitted together. This helps provide sufficient translation context clues.
- For instance, the term Dracula was actually mistranslated in one test when the word was presented on its own and not with additional newlines or whitespace to indicate it is a title for the story.
- Overall, it seems combining a few short sentences together, with a limit around 130 tokens, ensures effective translation accuracy rates.
Arabic, out of the languages tested, performed more poorly than other languages. This may be due to the submission text being a mismatch (could be a different variant of Arabic) so more testing is warranted.
- In the meantime it was observed that Arabic translations improved greatly with smaller chunks of text. Thus we added a separate translation soft limit for difficult languages (Arabic).
As a precaution, we also tested Hebrew, which did not seem to display the same number of issues as Arabic.
- We also examined the text inputs for Hebrew and Arabic for potential reversed character directions that may confuse the translator. Instead what we found was they were instead simply rendered differently
  in most text displays (and no special directional characters were present).

This change is

jrobble

@jrobble reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @hhuangMITRE)

a discussion (no related file):
Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

a discussion (no related file):
Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.

python/AzureTranslation/README.md line 108 at r1 (raw file):

  More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available. See list of
  model names
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The

This link only lists SaT. Also include a link to an older release for WTP models.

python/AzureTranslation/README.md line 112 at r1 (raw file):

  Review list of languages supported by SaT/WtP
  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

jrobble · 2025-09-26T15:41:11Z

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

…HOLD (#409) * Validate timestamps. --------- Co-authored-by: jrobble <jrobble@mitre.org>

…at-model-update

hhuangMITRE

Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @hhuangMITRE and @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Mention SaT here:

    def split_input_text(self, text: str, from_lang: Optional[str],
                         from_lang_confidence: Optional[float]) -> SplitTextResult:
        """
        Splits up the given text in to chunks that are under TranslationClient.DETECT_MAX_CHARS.
        Each chunk will contain one or more complete sentences as reported
        by the (WtP or spaCy) sentence splitter.
        """

Mention SaT here:


class SentenceSplitter:
    """
    Class to divide large sections of text at sentence breaks using WtP and spaCy.
    It is only used when the text to translate exceeds
    the translation endpoint's character limit.
    """

Done.

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Once the NLLB component lands, include it in this PR. It will need to be updated to mention SaT.

Updated NLLB with new tests as well. Also, I didn't see a LICENSE file so I did my best to add one in.

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Via separate chat I mentioned that when using single-sentence splitting with NLLB with wtp-bert-mini it takes this:

pt_text="""Teimam de facto estes em que são indispensaveis os vividos raios do
nosso desanuviado sol, ou a face desassombrada da lua no firmamento
peninsular, onde não tem, como a de Londres--_a romper a custo um
plumbeo céo_--para verterem alegrias na alma e mandarem aos semblantes o
reflexo d'ellas; imaginam fatalmente perseguidos de _spleen_,
irremediavelmente lugubres e soturnos, como se a cada momento saíssem
das galerias subterraneas de uma mina de _pit-coul_, os nossos alliados
inglezes.

Como se enganam ou como pretendem enganar-nos!

É esta uma illusão ou má fé, contra a qual ha muito reclama debalde a
indelevel e accentuada expressão de beatitude, que transluz no rosto
illuminado dos homens de além da Mancha, os quaes parece caminharem
entre nós, envolvidos em densa atmosphera de perenne contentamento,
satisfeitos do mundo, satisfeitos dos homens e, muito especialmente,
satisfeitos de si.
"""

and breaks it down into individual words:

#85 128.8 INFO:nlp_text_splitter:Setup WtP model: wtp-bert-mini
#85 128.8 INFO:NllbTranslationComponent:Text to translate is larger than the 360 character limit, splitting into smaller sentences.
#85 129.1 INFO:NllbTranslationComponent:Input text split into 86 sentences.
#85 129.1 INFO:NllbTranslationComponent:Translating sentences...
#85 131.6 DEBUG:NllbTranslationComponent:Translated:
#85 131.6 Teimam
#85 131.6 to:
#85 131.6 They 're scared .
#85 133.2 DEBUG:NllbTranslationComponent:Translated:
#85 133.2 de
#85 133.2 to:
#85 133.2 of
#85 134.8 DEBUG:NllbTranslationComponent:Translated:
#85 134.8 facto
#85 134.8 to:
#85 134.8 fact
#85 136.3 DEBUG:NllbTranslationComponent:Translated:
#85 136.3 estes
#85 136.3 to:
#85 136.3 these
#85 137.8 DEBUG:NllbTranslationComponent:Translated:
#85 137.8 em
#85 137.8 to:
#85 137.8 in
#85 139.4 DEBUG:NllbTranslationComponent:Translated:
#85 139.4 que
#85 139.4 to:
#85 139.4 than
#85 140.9 DEBUG:NllbTranslationComponent:Translated:
#85 140.9 são
#85 140.9 to:
#85 140.9 are
#85 142.5 DEBUG:NllbTranslationComponent:Translated:
#85 142.5 indispensaveis
#85 142.5 to:
#85 142.5 The Commission
#85 144.0 DEBUG:NllbTranslationComponent:Translated:
#85 144.0 os
#85 144.0 to:
#85 144.0 the
#85 145.6 DEBUG:NllbTranslationComponent:Translated:
#85 145.6 vividos
#85 145.6 to:
#85 145.6 lived
#85 149.7 DEBUG:NllbTranslationComponent:Translated:
#85 149.7 raios do
#85 149.7 nosso desanuviado
#85 149.7 to:
#85 149.7 The lightning of our desnuviado .
#85 151.2 DEBUG:NllbTranslationComponent:Translated:
#85 151.2 sol,
#85 151.2 to:
#85 151.2 
#85 153.0 DEBUG:NllbTranslationComponent:Translated:
#85 153.0 ou a
#85 153.0 to:
#85 153.0 or a
#85 154.6 DEBUG:NllbTranslationComponent:Translated:
#85 154.6 face
#85 154.6 to:
#85 154.6 face

Try using SaT. Determine if this behavior is a result of our text splitter logic or the model itself.

This mainly due to the text splitter not recognizing newlines, which we've updated in the most recent update.

python/AzureTranslation/README.md line 108 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This link only lists SaT. Also include a link to an older release for WTP models.

Done!

python/AzureTranslation/README.md line 112 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

I think this link is only for SaT. Also include a link to an older release for WTP model languages.

Done!

…sentence-mode-sat-model-update

…splitting.

…sentence-mode-sat-model-update

gonzalezjo

I'm still not done reviewing this, but I wanted to at least submit some feedback before we next talked.

@gonzalezjo reviewed 10 files and made 8 comments.
Reviewable status: 10 of 12 files reviewed, 12 unresolved discussions (waiting on hhuangMITRE and jrobble).

python/AzureTranslation/LICENSE line 27 at r4 (raw file):


The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation
library falls under the MIT License:

It probably isn't super important to change "library" to "libraries" here, but I'll just note that maybe it should be plural? I am not a lawyer or even remotely legally qualified to evaluate license changes outside this comment so would want someone else to confirm that this is fine.

python/NllbTranslation/LICENSE (raw file):

I cannot be responsible for vetting anything legal. But LGTM if it looks good to someone who can be responsible.

python/NllbTranslation/README.md line 112 at r4 (raw file):

  - When active, the  `NLLB_TRANSLATION_TOKEN_SOFT_LIMIT` is replaced by a more aggressive `DIFFICULT_LANGUAGE_TOKEN_LIMIT`.

- `DIFFICULT_LANGUAGE_TOKEN_LIMIT`: Token size for translation chunks when processing languages specified in `PROCESS_DIFFICULT_LANGUAGES`.

Hmm. Off the top of my head, I don't have any suggestions for how this could/should be written. I hate to leave criticism that isn't very constructive, so I apologize, but I don't think "token size" is the right terminology, and I'm not sure this really articulates what PROCESS_DIFFICULT_LANGUAGES (and the count of 50 tokens) actually does, which I think we should try to document better. Its behavior seems counterintuitive to me, and maybe buggy? I elaborate on this more in another comment. But we can discuss this separately over a call if you want.

(See later discussion in nllb_translation_component for further questions/comments.)

python/NllbTranslation/README.md line 139 at r4 (raw file):

  component will always default to CPU processing.

- `SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE`: More advanced WTP models require a language code.

Maybe use "language identifier" since that's the terminology we use elsewhere and a natural followup question when reading this is "what are the language codes?"

python/NllbTranslation/README.md line 144 at r4 (raw file):

# Language Identifiers
The following are the ISO 639-3 and ISO 15924 codes, and their corresponding languages which Nllb can translate.

Super nitpicky (sorry 😓), but if you're going to change this file again: Nllb -> NLLB. Not blocking, though.

python/NllbTranslation/nllb_component/nllb_translation_component.py line 215 at r4 (raw file):

            # Difficult-language override (optional): clamp token hard limit if configured.
            if config.use_token_length and _is_difficult_language(config.translate_from_language, difficult_set):
                diff_limit = config.difficult_language_token_limit

I don't know that there's a bug here and if there is, it doesn't seem to be caught by any tests, but I think that's because we don't test this case. So:

(unless I'm crazy): Could you please add a test for this?
(again, unless I'm crazy): I think this is buggy or our spec is wrong? The behavior here is very counterintuitive to me, but:
We say (in both the README and this folder's associated descriptor.json) that we override NLLB_TRANSLATION_TOKEN_SOFT_LIMIT. But it seems like we're clamping the hard limit? I made a really dumb local test and I can reproducibly cause this to "lose" text, but I may be using the API wrong.
The descriptor.json also says we force sentence-by-sentence splitting for difficult languages. I may be missing something (sorry if so!) but I'm not able to find where that happens.

python/NllbTranslation/nllb_component/nllb_utils.py line 438 at r4 (raw file):

    'zul' : 'zul_Latn'	# Zulu
    }

Nice. Extremely satisfying. I'm not 100% sure it needs to be part of this PR instead of a follow-on simplification PR, but it doesn't break anything and it's awesome seeing this much code simplified away.

python/NllbTranslation/plugin-files/descriptor/descriptor.json line 139 at r4 (raw file):

        {
          "name": "PROCESS_DIFFICULT_LANGUAGES",
          "description": "Comma-separated list of languages that should force sentence-by-sentence splitting and reduce the hard token limit. Default includes 'arabic'.",

"and reduce the hard token limit" is unclear verbiage. See ~~later~~ earlier discussion in nllb_translation_component for further questions. (edit: oops, reviewable does not put comments in chronological order)

gonzalezjo

I've just learned how to publish comments on reviewable without marking a file as reviewed. Cool. I'm not done with these files but I'll leave what I have so far.

@gonzalezjo made 3 comments.
Reviewable status: 9 of 12 files reviewed, 14 unresolved discussions (waiting on hhuangMITRE and jrobble).

python/NllbTranslation/tests/test_nllb_translation.py line 518 at r4 (raw file):

        test_generic_job_props['SENTENCE_SPLITTER_MODE'] = 'SENTENCE'
        test_generic_job_props['SENTENCE_SPLITTER_NEWLINE_BEHAVIOR'] = 'GUESS'
        pt_text_translation = "They fear, indeed, those in whom the vivid rays of our unblinking sun, or the unclouded face of the moon in the peninsular firmament, where it has not, like that of London--to break at the cost of a plumbeo heaven--are indispensable to pour joy into the soul and send to the countenances the reflection of them; They imagine themselves fatally haunted by spleen, hopelessly gloomy and sullen, as if at every moment they were emerging from the underground galleries of a pit-coal mine, Our British allies. How they deceive themselves or how they intend to deceive us! Is this an illusion or bad faith, against which there is much to be lamented in vain the indelevel and accentuated expression of beatitude, which shines through the illuminated faces of the men from beyond the Channel, who seem to walk among us, wrapped in a dense atmosphere of perennial contentment, satisfied with the world, satisfied with men and, very especially, satisfied with themselves? Yes , please ."

Please leave a short comment somewhere explaining the spacing around the comma and period in "Yes , please ." If that someone is anyone like me, it's the first thing that'll raise the eyebrows of someone scanning the tests.

python/NllbTranslation/tests/test_nllb_translation.py line 521 at r4 (raw file):

        job = mpf.GenericJob('Test Generic', 'test.pdf', test_generic_job_props, {}, ff_track)
        result_track: Sequence[mpf.GenericTrack] = self.component.get_detections_from_generic(job)
        self.assertEqual(pt_text_translation, result_props["TRANSLATION"])

This is technically dead-code as-is :) - because result_props wasn't updated in the line above (only result_track was updated) and because pt_text_translation has the same value as it did in the prior test, this is a no-op.

But there's some good news: the test passes even after fixing that.

gonzalezjo · 2026-04-13T15:07:40Z

I think this should demonstrate the behavior that perplexed me regarding difficult language token limits: the last words appear to get chopped off here.

I think it'd be good to try to get a test with this code included, and to try to get it to pass.

base_job_props: dict[str, str] = dict(self.defaultProps)
base_job_props['DEFAULT_SOURCE_LANGUAGE'] = 'spa'
base_job_props['DEFAULT_SOURCE_SCRIPT'] = 'Latn'
base_job_props['TARGET_LANGUAGE'] = 'fra'
base_job_props['TARGET_SCRIPT'] = 'Latn'
base_job_props['USE_NLLB_TOKEN_LENGTH'] = 'TRUE'
base_job_props['NLLB_TRANSLATION_TOKEN_LIMIT'] = '200'
base_job_props['DIFFICULT_LANGUAGE_TOKEN_LIMIT'] = '50'

text = (
    'Para la cena, o más bien para la comida nocturna, tomé pollo preparado '
    'de algún modo con pimiento rojo, que estaba muy sabroso, pero me dio '
    'mucha sed.'
)

normal_ff_track = mpf.GenericTrack(-1, dict(TEXT=text))
normal_props = dict(base_job_props)
normal_props['PROCESS_DIFFICULT_LANGUAGES'] = 'disabled'
normal_job = mpf.GenericJob('Test Generic', 'test.pdf', normal_props, {}, normal_ff_track)
normal_result = self.component.get_detections_from_generic(normal_job)[0]
normal_translation = normal_result.detection_properties["TRANSLATION"]

source_token_count = len(self.component._tokenizer(text)["input_ids"])
normal_target_token_count = len(self.component._tokenizer(normal_translation)["input_ids"])

difficult_ff_track = mpf.GenericTrack(-1, dict(TEXT=text))
difficult_props = dict(base_job_props)
difficult_props['PROCESS_DIFFICULT_LANGUAGES'] = 'spa'
difficult_job = mpf.GenericJob('Test Generic', 'test.pdf', difficult_props, {}, difficult_ff_track)
difficult_result = self.component.get_detections_from_generic(difficult_job)[0]
difficult_translation = difficult_result.detection_properties["TRANSLATION"]

self.assertLessEqual(source_token_count, 50)
self.assertGreater(normal_target_token_count, 50)
self.assertEqual(normal_translation, difficult_translation)

Relatedly, the use of a hard limit instead of a soft limit means if you use (perhaps stupidly...?) low difficult language token limits (e.g. setting it to 3) you'll see output with a lot of dropped words and oddities in which words are split mid-word, spaces are inserted randomly, etc. It's an easy way to see the problem and test fixes, even if, obviously, people should not be using a limit of 3 in the real world.

Updating WtP models. Adding sentence splitting option.

8b7b5ad

hhuangMITRE requested a review from jrobble September 23, 2025 20:25

hhuangMITRE self-assigned this Sep 23, 2025

hhuangMITRE mentioned this pull request Sep 23, 2025

Add option to TextSplitter to return individual sentences. Adding general SaT model support. openmpf/openmpf-python-component-sdk#93

Open

hhuangMITRE changed the title ~~Add option to TextSplitter to return individual sentences. Adding SaT model support.~~ Add option to TextSplitter to return individual sentences. Adding general SaT model support. Sep 23, 2025

jrobble requested changes Sep 25, 2025

View reviewed changes

regexer and others added 5 commits October 14, 2025 03:04

Update LlamaVideoSummarization to use TIMELINE_CHECK_ACCEPTABLE_THRES…

0138e9b

…HOLD (#409) * Validate timestamps. --------- Co-authored-by: jrobble <jrobble@mitre.org>

Merge branch 'develop' into feature/nlp-text-splitter-sentence-mode-s…

df3a979

…at-model-update

Updating documentation. Adding license file for NLLB component.

b40f3e8

Adding support for new text splitter. Merging develop changes.

315bf6d

Adding support for new text splitter. Merging develop changes.

ae281c3

hhuangMITRE commented Oct 16, 2025

View reviewed changes

Adding token length checks to NLLB's text splitter capability.

ba19487

jrobble added this to OpenMPF: Development Jan 26, 2026

github-project-automation bot moved this to To do in OpenMPF: Development Jan 26, 2026

jrobble moved this from To do to In Progress in OpenMPF: Development Jan 26, 2026

hhuangMITRE added 13 commits February 24, 2026 19:12

Merge remote-tracking branch 'origin' into feature/nlp-text-splitter-…

d68d646

…sentence-mode-sat-model-update

Adding support for handling Arabic, secondary threshold for sentence …

b701a98

…splitting.

Update to nllb_utils, transfer to text_splitter.

2343220

Update to nllb_utils, transfer to text_splitter.

7a1cfde

Update to nllb_utils, transfer to text_splitter.

2a527c5

Update to nllb_utils, transfer to text_splitter.

ed342db

Update to nllb_utils, transfer to text_splitter.

0618c3b

Minor bugfix.

c27ec26

Need to rebuild image without failing test.

23b9e40

Need to rebuild image without failing test.

74c658e

Need to rebuild image without failing test.

9c80ffb

Simplifying handling of difficult languages.

479be7b

testing.

e57de50

hhuangMITRE added 6 commits February 27, 2026 05:26

Updated unit tests.

6e57f7a

Updated unit tests.

da95fd4

Updated unit tests.

99c0fbc

Updating unit test.

44f4825

Updating unit test.

f6f2fc3

Tooltip and documentation update.

f208516

hhuangMITRE requested a review from gonzalezjo March 3, 2026 17:43

hhuangMITRE added 2 commits March 9, 2026 00:06

Minor doc update.

0bb4624

Merge remote-tracking branch 'origin' into feature/nlp-text-splitter-…

0cda2c6

…sentence-mode-sat-model-update

gonzalezjo requested changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#408

Add option to TextSplitter to return individual sentences. Adding general SaT model support.#408
hhuangMITRE wants to merge 28 commits intodevelopfrom
feature/nlp-text-splitter-sentence-mode-sat-model-update

hhuangMITRE commented Sep 23, 2025 •

edited

Loading

Uh oh!

jrobble left a comment

Uh oh!

jrobble commented Sep 26, 2025

Uh oh!

hhuangMITRE left a comment

Uh oh!

gonzalezjo left a comment •

edited

Loading

Uh oh!

gonzalezjo left a comment

Uh oh!

gonzalezjo commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hhuangMITRE commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis of NLLB Results

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble commented Sep 26, 2025

Uh oh!

hhuangMITRE left a comment

Choose a reason for hiding this comment

Uh oh!

gonzalezjo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzalezjo left a comment

Choose a reason for hiding this comment

Uh oh!

gonzalezjo commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hhuangMITRE commented Sep 23, 2025 •

edited

Loading

gonzalezjo left a comment •

edited

Loading

gonzalezjo commented Apr 13, 2026 •

edited

Loading