Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 28 additions & 21 deletions open_mythos/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,35 +6,42 @@
class MythosTokenizer:
"""
HuggingFace tokenizer wrapper for OpenMythos.

Args:
model_id (str): The HuggingFace model ID or path to use with AutoTokenizer.
Defaults to "openai/gpt-oss-20b".

Attributes:
tokenizer: An instance of HuggingFace's AutoTokenizer.

Example:
>>> tok = MythosTokenizer()
>>> ids = tok.encode("Hello world")
>>> s = tok.decode(ids)
"""

def __init__(self, model_id: str = DEFAULT_MODEL_ID):
"""
Initialize the MythosTokenizer.

Args:
model_id (str): HuggingFace model identifier or path to tokenizer files.
"""
self.tokenizer = AutoTokenizer.from_pretrained(model_id)

@property
def vocab_size(self) -> int:
"""
Return the size of the tokenizer vocabulary.
return self.tokenizer.vocab_size

Returns:
def encode(self, text: str):
return self.tokenizer.encode(text)

def decode(self, token_ids):
return self.tokenizer.decode(token_ids)
Comment on lines +18 to +22
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now defines encode and decode twice; the later definitions override the earlier ones, so the newly-added encode/decode methods here are dead code and can confuse behavior (especially since their defaults differ from the later versions). Keep only one pair of encode/decode wrappers and make the other methods call those wrappers for consistency.

Copilot uses AI. Check for mistakes.

# ✅ New methods added
def token_count(self, text: str) -> int:
"""Return number of tokens in text."""
return len(self.tokenizer.encode(text))
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_count calls self.tokenizer.encode(text) directly, which may count tokens differently than your wrapper encode method (e.g., add_special_tokens defaults). To avoid inconsistent behavior across the wrapper API, have token_count compute the length using the same encoding path/options as encode.

Suggested change
return len(self.tokenizer.encode(text))
return len(self.encode(text))

Copilot uses AI. Check for mistakes.

def batch_encode(self, texts: list[str], padding: bool = True, truncation: bool = True):
"""Encode multiple texts at once."""
return self.tokenizer(
texts,
padding=padding,
truncation=truncation,
return_tensors="pt"
)

def get_special_tokens(self):
"""Return special tokens used by tokenizer."""
return self.tokenizer.special_tokens_map

def is_within_limit(self, text: str, max_tokens: int) -> bool:
"""Check if text fits within a token limit."""
return self.token_count(text) <= max_tokens Returns:
Comment on lines +25 to +44
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public methods (token_count, batch_encode, get_special_tokens, is_within_limit) are added to MythosTokenizer, but there are existing tokenizer tests and none cover these new behaviors. Add tests validating token counting/limits and that batch_encode returns the expected shapes/types (and respects padding/truncation), to prevent regressions.

Copilot uses AI. Check for mistakes.
int: The number of unique tokens in the tokenizer vocabulary.
"""
return self.tokenizer.vocab_size
Comment on lines +44 to 47
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_within_limit currently contains stray docstring text (Returns: …) on the same line as the return statement, which will raise a SyntaxError on import and also leaves a duplicate return self.tokenizer.vocab_size block inside this method. Remove the stray docstring fragment and ensure is_within_limit only returns the boolean check (and that vocab_size logic remains only in the vocab_size property).

Suggested change
return self.token_count(text) <= max_tokens Returns:
int: The number of unique tokens in the tokenizer vocabulary.
"""
return self.tokenizer.vocab_size
return self.token_count(text) <= max_tokens

Copilot uses AI. Check for mistakes.
Expand Down
Loading