openmpf · hhuangMITRE · Sep 23, 2025 · Oct 1, 2025 · Oct 14, 2025 · Oct 14, 2025
diff --git a/python/AzureTranslation/LICENSE b/python/AzureTranslation/LICENSE
@@ -19,15 +19,18 @@ is used in a deployment or embedded within another project, it is requested
 that you send an email to opensource@mitre.org in order to let us know where
 this software is being used.
 
+The nlp_text_splitter utlity uses the following sentence detection libraries:
+
 *****************************************************************************
 
-The WtP, "Where the Point", sentence segmentation library falls under the MIT License:
+The WtP, "Where the Point", and SaT, "Segment any Text" sentence segmentation
+library falls under the MIT License:
 
-https://github.com/bminixhofer/wtpsplit/blob/main/LICENSE
+https://github.com/segment-any-text/wtpsplit/blob/main/LICENSE
 
 MIT License
 
-Copyright (c) 2024 Benjamin Minixhofer
+Copyright (c) 2024 Benjamin Minixhofer, Markus Frohmann, Igor Sterner
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/python/AzureTranslation/README.md b/python/AzureTranslation/README.md
@@ -87,26 +87,36 @@ must be provided. Neither has a default value.
 The following settings control the behavior of dividing input text into acceptable chunks
 for processing.
 
-Through preliminary investigation, we identified the [WtP library ("Where's the
+Through preliminary investigation, we identified the [SaT/WtP library ("Segment any Text" / "Where's the
 Point")](https://github.com/bminixhofer/wtpsplit) and [spaCy's multilingual sentence
 detection model](https://spacy.io/models) for identifying sentence breaks
 in a large section of text.
 
-WtP models are trained to split up multilingual text by sentence without the need of an
+SaT/WtP models are trained to split up multilingual text by sentence without the need of an
 input language tag. The disadvantage is that the most accurate WtP models will need ~3.5
-GB of GPU memory. On the other hand, spaCy has a single multilingual sentence detection
+GB of GPU memory. SaT models are a more recent addition and considered to be a more accurate
+set of sentence segmentation models; their resource costs are similar to WtP.
+
+On the other hand, spaCy has a single multilingual sentence detection
 that appears to work better for splitting up English text in certain cases, unfortunately
 this model lacks support handling for Chinese punctuation.
 
-- `SENTENCE_MODEL`: Specifies the desired WtP or spaCy sentence detection model. For CPU
-  and runtime considerations, the author of WtP recommends using `wtp-bert-mini`. More
-  advanced WtP models that use GPU resources (up to ~8 GB) are also available. See list of
-  WtP model names
-  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models). The
-  only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`.
+- `SENTENCE_MODEL`: Specifies the desired SaT/WtP or spaCy sentence detection model. For CPU
+  and runtime considerations, the authors of SaT/WtP recommends using `sat-3l-sm` or `wtp-bert-mini`.
+  More advanced SaT/WtP models that use GPU resources (up to ~8 GB for WtP) are also available.
+
+  See list of model names below:
+
+  - [WtP Models](https://github.com/segment-any-text/wtpsplit/tree/1.3.0?tab=readme-ov-file#available-models)
+  - [SaT Models](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#available-models).
+
+    Please note, the only available spaCy model (for text with unknown language) is `xx_sent_ud_sm`.
+
+  Review list of languages supported by SaT/WtP below:
+
+  - [WtP Models](https://github.com/segment-any-text/wtpsplit/tree/1.3.0?tab=readme-ov-file#supported-languages)
+  - [SaT Models](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages)
 
-  Review list of languages supported by WtP
-  [here](https://github.com/bminixhofer/wtpsplit?tab=readme-ov-file#supported-languages).
   Review models and languages supported by spaCy [here](https://spacy.io/models).
 
 - `SENTENCE_SPLITTER_CHAR_COUNT`: Specifies maximum number of characters to process
@@ -115,16 +125,20 @@ this model lacks support handling for Chinese punctuation.
   lengths
   [here](https://discourse.mozilla.org/t/proposal-sentences-lenght-limit-from-14-words-to-100-characters).
 
+- `SENTENCE_SPLITTER_MODE`: Specifies text splitting behavior, options include:
+  - `DEFAULT` : Splits text into chunks based on the `SENTENCE_SPLITTER_CHAR_COUNT` limit.
+  - `SENTENCE`: Splits text at detected sentence boundaries. This mode creates more sentence breaks than `DEFAULT`, which is more focused on avoiding text splits unless the chunk size is reached.
+
 - `SENTENCE_SPLITTER_INCLUDE_INPUT_LANG`: Specifies whether to pass input language to
-  sentence splitter algorithm. Currently, only WtP supports model threshold adjustments by
+  sentence splitter algorithm. Currently, only SaT/WtP supports model threshold adjustments by
   input language.
 
 - `SENTENCE_MODEL_CPU_ONLY`: If set to TRUE, only use CPU resources for the sentence
   detection model. If set to FALSE, allow sentence model to also use GPU resources.
-  For most runs using spaCy `xx_sent_ud_sm` or `wtp-bert-mini` models, GPU resources
+  For most runs using spaCy `xx_sent_ud_sm`, `sat-3l-sm`, or `wtp-bert-mini` models, GPU resources
   are not required. If using more advanced WtP models like `wtp-canine-s-12l`,
   it is recommended to set `SENTENCE_MODEL_CPU_ONLY=FALSE` to improve performance.
-  That model can use up to ~3.5 GB of GPU memory.
+  That WtP model can use up to ~3.5 GB of GPU memory.
 
   Please note, to fully enable this option, you must also rebuild the Docker container
   with the following change: Within the Dockerfile, set `ARG BUILD_TYPE=gpu`.

diff --git a/python/AzureTranslation/acs_translation_component/acs_translation_component.py b/python/AzureTranslation/acs_translation_component/acs_translation_component.py
@@ -461,7 +461,7 @@ def __init__(self, job_properties: Mapping[str, str],
         self._num_boundary_chars =  mpf_util.get_property(job_properties,
                                                           "SENTENCE_SPLITTER_CHAR_COUNT",
                                                           500)
-        nlp_model_name = mpf_util.get_property(job_properties, "SENTENCE_MODEL", "wtp-bert-mini")
+        nlp_model_name = mpf_util.get_property(job_properties, "SENTENCE_MODEL", "sat-3l-sm")
         self._incl_input_lang = mpf_util.get_property(job_properties,
                                                       "SENTENCE_SPLITTER_INCLUDE_INPUT_LANG",
                                                       True)
@@ -471,6 +471,10 @@ def __init__(self, job_properties: Mapping[str, str],
                                                      "en")
         nlp_model_setting = mpf_util.get_property(job_properties, "SENTENCE_MODEL_CPU_ONLY", True)
 
+        self._sentence_splitter_mode = mpf_util.get_property(job_properties,
+                                                            "SENTENCE_SPLITTER_MODE",
+                                                            "DEFAULT")
+
         if not nlp_model_setting:
             nlp_model_setting = "cuda"
         else:
@@ -500,14 +504,18 @@ def split_input_text(self, text: str, from_lang: Optional[str],
                 self._num_boundary_chars,
                 get_azure_char_count,
                 self._sentence_model,
-                from_lang)
+                from_lang,
+                split_mode=self._sentence_splitter_mode,
+                newline_behavior='NONE') # This component already uses a newline filtering step.
         else:
             divided_text_list = TextSplitter.split(
                 text,
                 TranslationClient.DETECT_MAX_CHARS,
                 self._num_boundary_chars,
                 get_azure_char_count,
-                self._sentence_model)
+                self._sentence_model,
+                split_mode=self._sentence_splitter_mode,
+                newline_behavior='NONE') # This component already uses a newline filtering step.
 
         chunks = list(divided_text_list)
 

diff --git a/python/AzureTranslation/plugin-files/descriptor/descriptor.json b/python/AzureTranslation/plugin-files/descriptor/descriptor.json
@@ -71,10 +71,16 @@
         },
         {
           "name": "STRIP_NEW_LINE_BEHAVIOR",
-          "description": "The translation endpoint treats newline characters as sentence boundaries. To prevent this newlines can be removed from the input text. Valid values are SPACE (replace with space character), REMOVE (remove newlines), NONE (leave newlines as they are), and GUESS (If source language is Chinese or Japanese use REMOVE, else use SPACE).",
+          "description": "The translation endpoint and text splitter treat newline characters as sentence boundaries. To prevent this newlines can be removed from the input text. Valid values are SPACE (replace with space character), REMOVE (remove newlines), NONE (leave newlines as they are), and GUESS (If source language is Chinese or Japanese use REMOVE, else use SPACE).",
           "type": "STRING",
           "defaultValue": "GUESS"
         },
+        {
+          "name": "SENTENCE_SPLITTER_MODE",
+          "description": "Determines how text is split: `DEFAULT` mode splits text into chunks based on the character limit, while `SENTENCE` mode splits text strictly at sentence boundaries (may yield smaller segments), unless the character limit is reached.",
+          "type": "STRING",
+          "defaultValue": "DEFAULT"
+        },
         {
           "name": "DETECT_BEFORE_TRANSLATE",
           "description": "Use the /detect endpoint to check if translation can be skipped because the text is already in TO_LANGUAGE.",
@@ -95,9 +101,9 @@
         },
         {
           "name": "SENTENCE_MODEL",
-          "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model and the Where's the Point (WtP) `wtp-bert-mini` model.",
+          "description": "Name of sentence segmentation model. Supported options are spaCy's multilingual `xx_sent_ud_sm` model, Segment any Text (SaT) `sat-3l-sm` model, and Where's the Point (WtP) `wtp-bert-mini` model.",
           "type": "STRING",
-          "defaultValue": "wtp-bert-mini"
+          "defaultValue": "sat-3l-sm"
         },
         {
           "name": "SENTENCE_MODEL_CPU_ONLY",
@@ -107,7 +113,7 @@
         },
         {
           "name": "SENTENCE_MODEL_WTP_DEFAULT_ADAPTOR_LANGUAGE",
-          "description": "More advanced WTP models will require a target language. This property sets the default language to use for sentence splitting, unless `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`, or Azure language detection return a different, WtP-supported language option.",
+          "description": "More advanced WtP/SaT models will require a target language. This property sets the default language to use for sentence splitting, unless `FROM_LANGUAGE`, `SUGGESTED_FROM_LANGUAGE`, or Azure language detection return a different, WtP-supported language option.",
           "type": "STRING",
           "defaultValue": "en"
         },

diff --git a/python/AzureTranslation/tests/test_acs_translation.py b/python/AzureTranslation/tests/test_acs_translation.py
@@ -65,12 +65,14 @@ class TestAcsTranslation(unittest.TestCase):
 
     mock_server: ClassVar['MockServer']
     wtp_model: ClassVar['TextSplitterModel']
+    sat_model: ClassVar['TextSplitterModel']
     spacy_model: ClassVar['TextSplitterModel']
 
     @classmethod
     def setUpClass(cls):
         cls.mock_server = MockServer()
         cls.wtp_model = TextSplitterModel("wtp-bert-mini", "cpu", "en")
+        cls.sat_model = TextSplitterModel("sat-3l-sm", "cpu", "en")
         cls.spacy_model = TextSplitterModel("xx_sent_ud_sm", "cpu", "en")
 
 
@@ -669,6 +671,79 @@ def test_split_wtp_unknown_lang(self, _):
                       'Spaces should be kept due to incorrect language detection.')
 
 
+    @mock.patch.object(TranslationClient, 'DETECT_MAX_CHARS', new_callable=lambda: 150)
+    def test_split_sat_unknown_lang(self, _):
+        # Check that the text splitter does not have an issue
+        # processing an unknown detected language.
+        self.set_results_file('invalid-lang-detect-result.json')
+        self.set_results_file('split-sentence/art-of-war-translation-1.json')
+        self.set_results_file('split-sentence/art-of-war-translation-2.json')
+        self.set_results_file('split-sentence/art-of-war-translation-3.json')
+        self.set_results_file('split-sentence/art-of-war-translation-4.json')
+
+        text = (TEST_DATA / 'split-sentence/art-of-war.txt').read_text()
+        detection_props = dict(TEXT=text)
+        TranslationClient(get_test_properties(), self.sat_model).add_translations(detection_props)
+
+        self.assertEqual(5, len(detection_props))
+        self.assertEqual(text, detection_props['TEXT'])
+
+        expected_translation = (TEST_DATA / 'split-sentence/art-war-translation.txt') \
+            .read_text().strip()
+        self.assertEqual(expected_translation, detection_props['TRANSLATION'])
+        self.assertEqual('EN', detection_props['TRANSLATION TO LANGUAGE'])
+
+        self.assertEqual('fake-lang', detection_props['TRANSLATION SOURCE LANGUAGE'])
+        self.assertAlmostEqual(1.0,
+            float(detection_props['TRANSLATION SOURCE LANGUAGE CONFIDENCE']))
+
+        detect_request_text = self.get_request_body()[0]['Text']
+        self.assertEqual(text[0:TranslationClient.DETECT_MAX_CHARS], detect_request_text)
+
+        expected_chunk_lengths = [88, 118, 116, 106]
+        self.assertEqual(sum(expected_chunk_lengths), len(text))
+
+        # Due to an incorrect language detection, newlines are
+        # not properly replaced for Chinese text, and
+        # additional whitespace is present in the text.
+        # This alters the behavior of WtP sentence splitting.
+        translation_request1 = self.get_request_body()[0]['Text']
+        self.assertEqual(expected_chunk_lengths[0], len(translation_request1))
+        self.assertTrue(translation_request1.startswith('兵者，'))
+        self.assertTrue(translation_request1.endswith('而不危也；'))
+        self.assertNotIn('\n', translation_request1,
+                         'Newlines were not properly removed')
+        self.assertIn(' ', translation_request1,
+                      'Spaces should be kept due to incorrect language detection.')
+
+        translation_request2 = self.get_request_body()[0]['Text']
+        self.assertEqual(expected_chunk_lengths[1], len(translation_request2))
+        self.assertTrue(translation_request2.startswith('天者，陰陽'))
+        self.assertTrue(translation_request2.endswith('兵眾孰強？'))
+        self.assertNotIn('\n', translation_request2,
+                         'Newlines were not properly removed')
+        self.assertIn(' ', translation_request2,
+                      'Spaces should be kept due to incorrect language detection.')
+
+        translation_request3 = self.get_request_body()[0]['Text']
+        self.assertEqual(expected_chunk_lengths[2], len(translation_request3))
+        self.assertTrue(translation_request3.startswith('士卒孰練？'))
+        self.assertTrue(translation_request3.endswith('亂而取之， '))
+        self.assertNotIn('\n', translation_request3,
+                         'Newlines were not properly removed')
+        self.assertIn(' ', translation_request3,
+                      'Spaces should be kept due to incorrect language detection.')
+
+        translation_request4 = self.get_request_body()[0]['Text']
+        self.assertEqual(expected_chunk_lengths[3], len(translation_request4))
+        self.assertTrue(translation_request4.startswith('實而備之，'))
+        self.assertTrue(translation_request4.endswith('勝負見矣。 '))
+        self.assertNotIn('\n', translation_request4,
+                         'Newlines were not properly removed')
+        self.assertIn(' ', translation_request4,
+                      'Spaces should be kept due to incorrect language detection.')
+
+
     def test_newline_removal(self):
 
         def replace(text):
@@ -1044,6 +1119,7 @@ def get_test_properties(**extra_properties):
     return {
         'ACS_URL': os.getenv('ACS_URL', 'http://localhost:10670/translator'),
         'ACS_SUBSCRIPTION_KEY': os.getenv('ACS_SUBSCRIPTION_KEY', 'test_key'),
+        'SENTENCE_MODEL':'wtp-bert-mini',
         **extra_properties
     }