-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Enhance MythosTokenizer with new methods #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -6,35 +6,42 @@ | |||||||||||
| class MythosTokenizer: | ||||||||||||
| """ | ||||||||||||
| HuggingFace tokenizer wrapper for OpenMythos. | ||||||||||||
|
|
||||||||||||
| Args: | ||||||||||||
| model_id (str): The HuggingFace model ID or path to use with AutoTokenizer. | ||||||||||||
| Defaults to "openai/gpt-oss-20b". | ||||||||||||
|
|
||||||||||||
| Attributes: | ||||||||||||
| tokenizer: An instance of HuggingFace's AutoTokenizer. | ||||||||||||
|
|
||||||||||||
| Example: | ||||||||||||
| >>> tok = MythosTokenizer() | ||||||||||||
| >>> ids = tok.encode("Hello world") | ||||||||||||
| >>> s = tok.decode(ids) | ||||||||||||
| """ | ||||||||||||
|
|
||||||||||||
| def __init__(self, model_id: str = DEFAULT_MODEL_ID): | ||||||||||||
| """ | ||||||||||||
| Initialize the MythosTokenizer. | ||||||||||||
|
|
||||||||||||
| Args: | ||||||||||||
| model_id (str): HuggingFace model identifier or path to tokenizer files. | ||||||||||||
| """ | ||||||||||||
| self.tokenizer = AutoTokenizer.from_pretrained(model_id) | ||||||||||||
|
|
||||||||||||
| @property | ||||||||||||
| def vocab_size(self) -> int: | ||||||||||||
| """ | ||||||||||||
| Return the size of the tokenizer vocabulary. | ||||||||||||
| return self.tokenizer.vocab_size | ||||||||||||
|
|
||||||||||||
| Returns: | ||||||||||||
| def encode(self, text: str): | ||||||||||||
| return self.tokenizer.encode(text) | ||||||||||||
|
|
||||||||||||
| def decode(self, token_ids): | ||||||||||||
| return self.tokenizer.decode(token_ids) | ||||||||||||
|
|
||||||||||||
| # ✅ New methods added | ||||||||||||
| def token_count(self, text: str) -> int: | ||||||||||||
| """Return number of tokens in text.""" | ||||||||||||
| return len(self.tokenizer.encode(text)) | ||||||||||||
|
||||||||||||
| return len(self.tokenizer.encode(text)) | |
| return len(self.encode(text)) |
Copilot
AI
Apr 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New public methods (token_count, batch_encode, get_special_tokens, is_within_limit) are added to MythosTokenizer, but there are existing tokenizer tests and none cover these new behaviors. Add tests validating token counting/limits and that batch_encode returns the expected shapes/types (and respects padding/truncation), to prevent regressions.
Copilot
AI
Apr 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_within_limit currently contains stray docstring text (Returns: …) on the same line as the return statement, which will raise a SyntaxError on import and also leaves a duplicate return self.tokenizer.vocab_size block inside this method. Remove the stray docstring fragment and ensure is_within_limit only returns the boolean check (and that vocab_size logic remains only in the vocab_size property).
| return self.token_count(text) <= max_tokens Returns: | |
| int: The number of unique tokens in the tokenizer vocabulary. | |
| """ | |
| return self.tokenizer.vocab_size | |
| return self.token_count(text) <= max_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file now defines
encodeanddecodetwice; the later definitions override the earlier ones, so the newly-addedencode/decodemethods here are dead code and can confuse behavior (especially since their defaults differ from the later versions). Keep only one pair ofencode/decodewrappers and make the other methods call those wrappers for consistency.