Conversation
pmfirestone
left a comment
There was a problem hiding this comment.
Each members of a RAW vocabulary should be bytes rather than str, I think. The semantics are nearly trivial to change, the tests less so. See my branch byte_new_test for the changes I have in mind.
| token_ids = [4, 5, 6, 7] # 你, 好, 吗, ? | ||
| mock_tokenizer.decode.return_value = "你好吗?" | ||
| result = byte_tokenizer.decode(token_ids) | ||
| self.assertEqual(result.decode('utf-8'), "你好吗?") |
There was a problem hiding this comment.
Here you should test incomplete utf-8 sequences. The branch byte_new_test changes the tests that use the RAW tokenizer to work on a vocabulary made of bytes rather than code points. These codes are all passed by modifying enbyte_raw to expect and return bytes without altering them.
| "hello": 1, | ||
| "world": 2, | ||
| "!": 3, | ||
| "<s>": 4, # special token | ||
| "</s>": 5, # special token |
There was a problem hiding this comment.
Use bytes instead of strings.
| def test_auto_detection(self): | ||
| """Test automatic detection of tokenizer type.""" | ||
| # Test RAW detection | ||
| raw_vocab = {"hello": 1, "world": 2} |
There was a problem hiding this comment.
Use bytes instead of strings.
| vocab = {f"token{i}": i for i in range(1000)} | ||
| # Add some special tokens | ||
| vocab["<s>"] = 1000 | ||
| vocab["</s>"] = 1001 |
There was a problem hiding this comment.
Use bytes instead of strings.
| "hello": 1, | ||
| " ": 2, | ||
| "world": 3, | ||
| "!": 4, |
There was a problem hiding this comment.
Use bytes instead of strings.
syncode/mask_store/byte_tokenizer.py
Outdated
| def enbyte_raw(token: str) -> bytes: | ||
| """Turn a raw token directly into bytes. | ||
|
|
||
| Example: | ||
| -------- | ||
| >>> enbyte_raw('hello') | ||
| b'hello' | ||
| """ | ||
| return token.encode('utf-8') |
There was a problem hiding this comment.
Assume the token here is bytes and return it unchanged.
| while remaining: | ||
| matched = False | ||
| for token, token_id in sorted(vocab.items(), key=lambda x: len(x[0]), reverse=True): | ||
| if remaining.startswith(token): |
There was a problem hiding this comment.
This break if you have to work with tokens that are raw byte sequences rather than strings.
Change semantics of RAW tokenizer test.
This is a reasonably large refactoring of the mask store and needs additional profiling and testing before merging. Since we used a character-level FSM in the mask store, the existing SynCode version had an issue: it did not work correctly with non-ASCII characters in the grammar. This change should fix this issue.
The main changes include:
resolves #153