Skip to content

Add Byte level FSM and tokenizer#151

Merged
shubhamugare merged 11 commits intomainfrom
byte
Mar 17, 2025
Merged

Add Byte level FSM and tokenizer#151
shubhamugare merged 11 commits intomainfrom
byte

Conversation

@shubhamugare
Copy link
Collaborator

@shubhamugare shubhamugare commented Mar 14, 2025

This is a reasonably large refactoring of the mask store and needs additional profiling and testing before merging. Since we used a character-level FSM in the mask store, the existing SynCode version had an issue: it did not work correctly with non-ASCII characters in the grammar. This change should fix this issue.

The main changes include:

  1. Addition of Byte-level tokenizer (Thanks @pmfirestone)
  2. Addition Byte-level FSM, which is created by modifying the original interegular character-level FSM

resolves #153

Copy link
Contributor

@pmfirestone pmfirestone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each members of a RAW vocabulary should be bytes rather than str, I think. The semantics are nearly trivial to change, the tests less so. See my branch byte_new_test for the changes I have in mind.

token_ids = [4, 5, 6, 7] # 你, 好, 吗, ?
mock_tokenizer.decode.return_value = "你好吗?"
result = byte_tokenizer.decode(token_ids)
self.assertEqual(result.decode('utf-8'), "你好吗?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should test incomplete utf-8 sequences. The branch byte_new_test changes the tests that use the RAW tokenizer to work on a vocabulary made of bytes rather than code points. These codes are all passed by modifying enbyte_raw to expect and return bytes without altering them.

Comment on lines 154 to 158
"hello": 1,
"world": 2,
"!": 3,
"<s>": 4, # special token
"</s>": 5, # special token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use bytes instead of strings.

def test_auto_detection(self):
"""Test automatic detection of tokenizer type."""
# Test RAW detection
raw_vocab = {"hello": 1, "world": 2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use bytes instead of strings.

Comment on lines 199 to 202
vocab = {f"token{i}": i for i in range(1000)}
# Add some special tokens
vocab["<s>"] = 1000
vocab["</s>"] = 1001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use bytes instead of strings.

Comment on lines 316 to 319
"hello": 1,
" ": 2,
"world": 3,
"!": 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use bytes instead of strings.

Comment on lines 98 to 106
def enbyte_raw(token: str) -> bytes:
"""Turn a raw token directly into bytes.

Example:
--------
>>> enbyte_raw('hello')
b'hello'
"""
return token.encode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume the token here is bytes and return it unchanged.

while remaining:
matched = False
for token, token_id in sorted(vocab.items(), key=lambda x: len(x[0]), reverse=True):
if remaining.startswith(token):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break if you have to work with tokens that are raw byte sequences rather than strings.

Copy link
Contributor

@pmfirestone pmfirestone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@shubhamugare shubhamugare self-assigned this Mar 17, 2025
@shubhamugare shubhamugare merged commit 8c23cef into main Mar 17, 2025
1 check passed
@shubhamugare shubhamugare deleted the byte branch March 17, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SynCode fails on non-ASCII characters.

2 participants

Comments