Add Byte level FSM and tokenizer by shubhamugare · Pull Request #151 · structuredllm/syncode

shubhamugare · 2025-03-14T16:56:17Z

This is a reasonably large refactoring of the mask store and needs additional profiling and testing before merging. Since we used a character-level FSM in the mask store, the existing SynCode version had an issue: it did not work correctly with non-ASCII characters in the grammar. This change should fix this issue.

The main changes include:

Addition of Byte-level tokenizer (Thanks @pmfirestone)
Addition Byte-level FSM, which is created by modifying the original interegular character-level FSM

resolves #153

pmfirestone

Each members of a RAW vocabulary should be bytes rather than str, I think. The semantics are nearly trivial to change, the tests less so. See my branch byte_new_test for the changes I have in mind.

pmfirestone · 2025-03-14T18:02:34Z

tests/mask_store/test_byte_tokenizer.py

+        token_ids = [4, 5, 6, 7]  # 你, 好, 吗, ？  
+        mock_tokenizer.decode.return_value = "你好吗？"
+        result = byte_tokenizer.decode(token_ids)
+        self.assertEqual(result.decode('utf-8'), "你好吗？")


Here you should test incomplete utf-8 sequences. The branch byte_new_test changes the tests that use the RAW tokenizer to work on a vocabulary made of bytes rather than code points. These codes are all passed by modifying enbyte_raw to expect and return bytes without altering them.

pmfirestone · 2025-03-14T18:33:46Z

tests/mask_store/test_byte_tokenizer.py

+            "hello": 1,
+            "world": 2,
+            "!": 3,
+            "<s>": 4,  # special token
+            "</s>": 5,  # special token


Use bytes instead of strings.

pmfirestone · 2025-03-14T18:34:00Z

tests/mask_store/test_byte_tokenizer.py

+    def test_auto_detection(self):
+        """Test automatic detection of tokenizer type."""
+        # Test RAW detection
+        raw_vocab = {"hello": 1, "world": 2}


Use bytes instead of strings.

pmfirestone · 2025-03-14T18:34:11Z

tests/mask_store/test_byte_tokenizer.py

+        vocab = {f"token{i}": i for i in range(1000)}
+        # Add some special tokens
+        vocab["<s>"] = 1000
+        vocab["</s>"] = 1001


Use bytes instead of strings.

pmfirestone · 2025-03-14T18:34:24Z

tests/mask_store/test_byte_tokenizer.py

+            "hello": 1,
+            " ": 2,
+            "world": 3,
+            "!": 4,


Use bytes instead of strings.

pmfirestone · 2025-03-14T18:35:07Z

syncode/mask_store/byte_tokenizer.py

+def enbyte_raw(token: str) -> bytes:
+    """Turn a raw token directly into bytes.
+
+    Example:
+    --------
+    >>> enbyte_raw('hello')
+    b'hello'
+    """
+    return token.encode('utf-8')


Assume the token here is bytes and return it unchanged.

pmfirestone · 2025-03-14T18:36:56Z

tests/mask_store/test_byte_tokenizer.py

+            while remaining:
+                matched = False
+                for token, token_id in sorted(vocab.items(), key=lambda x: len(x[0]), reverse=True):
+                    if remaining.startswith(token):


This break if you have to work with tokens that are raw byte sequences rather than strings.

Change semantics of RAW tokenizer test.

pmfirestone

Looks good to me!

Add Byte level FSM and tokenizer

01a353c

shubhamugare requested a review from pmfirestone March 14, 2025 16:56

Change semantics of RAW tokenizer test.

047819c

pmfirestone suggested changes Mar 14, 2025

View reviewed changes

shubhamugare added 2 commits March 14, 2025 22:24

Merge pull request #152 from uiuc-focal-lab/byte_new_test

872bb02

Change semantics of RAW tokenizer test.

Fix issues with the mask store

760b07f

shubhamugare force-pushed the byte branch from d109111 to 760b07f Compare March 15, 2025 17:44

shubhamugare added 2 commits March 16, 2025 23:09

Add different logging system

36f4346

Fix issues with mask store

638244f

shubhamugare force-pushed the byte branch from e3eb8c8 to 638244f Compare March 17, 2025 07:24

Add tests for lookup table

c5974c4

shubhamugare force-pushed the byte branch from 02bb8ce to c5974c4 Compare March 17, 2025 14:03

shubhamugare added 2 commits March 17, 2025 09:45

Mask store construction is 20% faster

e5dd3f6

Mask store construction is another 20% faster

f1b67e5

pmfirestone approved these changes Mar 17, 2025

View reviewed changes

shubhamugare added 2 commits March 17, 2025 12:58

Mask store construction is another 20% faster

b7e3f03

Additional logging changes

dcfdea9

shubhamugare self-assigned this Mar 17, 2025

shubhamugare merged commit 8c23cef into main Mar 17, 2025
1 check passed

shubhamugare deleted the byte branch March 17, 2025 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Byte level FSM and tokenizer#151

Add Byte level FSM and tokenizer#151
shubhamugare merged 11 commits intomainfrom
byte

shubhamugare commented Mar 14, 2025 •

edited

Loading

Uh oh!

pmfirestone left a comment

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone Mar 14, 2025

Uh oh!

pmfirestone left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

shubhamugare commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmfirestone left a comment

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmfirestone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

shubhamugare commented Mar 14, 2025 •

edited

Loading