Skip to content

Conversation

@MichelDucartier
Copy link
Contributor

This pull request refactors and generalizes the prompt tokenization and chat formatting logic across the codebase. The main improvement is the introduction of a flexible ChatTemplate abstraction, which supports multiple chat prompt formats (LLaMA, Apertus, Qwen, etc.) and is now integrated into the PromptTokenizer. This update removes the need for model-specific tokenizers and simplifies configuration, making it easier to add support for new chat formats.

Key changes:

Chat template abstraction and integration:

  • Introduced a new ChatTemplate dataclass in model.py, which encapsulates chat serialization logic for various LLM families (LLaMA, Apertus, Qwen3) and provides a unified interface for formatting conversations. (src/multimeditron/model/model.py)
  • Refactored PromptTokenizer to accept a ChatTemplate instance, removing hardcoded logic for specific models and enabling dynamic chat prompt formatting. (src/multimeditron/model/prompt_tokenizers.py) [1] [2]

Removal of model-specific tokenizers:

  • Removed the Llama3PromptTokenizer and the TOKENIZER_MAP indirection, consolidating all prompt tokenization logic into the generalized PromptTokenizer class. (src/multimeditron/model/prompt_tokenizers.py, src/multimeditron/dataset/sample_preprocessor.py, scripts/benchmarking.py) [1] [2]

API and code modernization:

  • Updated the construction of PromptTokenizer throughout the codebase to pass the appropriate ChatTemplate (using ChatTemplate.from_name). (scripts/benchmarking.py, src/multimeditron/dataset/sample_preprocessor.py) [1] [2]
  • Improved code style and readability in the prompt tokenizer, including function signature cleanup, better error messages, and more consistent formatting. (src/multimeditron/model/prompt_tokenizers.py) [1] [2] [3] [4] [5] [6] [7] [8]

These changes make the codebase more modular, extensible, and easier to maintain as new chat prompt formats emerge.

@MichelDucartier MichelDucartier merged commit cdbfb35 into master Oct 29, 2025
1 check failed
@MichelDucartier MichelDucartier deleted the generalize_prompt_tokenizers branch October 29, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants