Generalize prompt tokenizers #19

MichelDucartier · 2025-10-28T11:04:10Z

This pull request refactors and generalizes the prompt tokenization and chat formatting logic across the codebase. The main improvement is the introduction of a flexible ChatTemplate abstraction, which supports multiple chat prompt formats (LLaMA, Apertus, Qwen, etc.) and is now integrated into the PromptTokenizer. This update removes the need for model-specific tokenizers and simplifies configuration, making it easier to add support for new chat formats.

Key changes:

Chat template abstraction and integration:

Introduced a new ChatTemplate dataclass in model.py, which encapsulates chat serialization logic for various LLM families (LLaMA, Apertus, Qwen3) and provides a unified interface for formatting conversations. (src/multimeditron/model/model.py)
Refactored PromptTokenizer to accept a ChatTemplate instance, removing hardcoded logic for specific models and enabling dynamic chat prompt formatting. (src/multimeditron/model/prompt_tokenizers.py) [1] [2]

Removal of model-specific tokenizers:

Removed the Llama3PromptTokenizer and the TOKENIZER_MAP indirection, consolidating all prompt tokenization logic into the generalized PromptTokenizer class. (src/multimeditron/model/prompt_tokenizers.py, src/multimeditron/dataset/sample_preprocessor.py, scripts/benchmarking.py) [1] [2]

API and code modernization:

Updated the construction of PromptTokenizer throughout the codebase to pass the appropriate ChatTemplate (using ChatTemplate.from_name). (scripts/benchmarking.py, src/multimeditron/dataset/sample_preprocessor.py) [1] [2]
Improved code style and readability in the prompt tokenizer, including function signature cleanup, better error messages, and more consistent formatting. (src/multimeditron/model/prompt_tokenizers.py) [1] [2] [3] [4] [5] [6] [7] [8]

These changes make the codebase more modular, extensible, and easier to maintain as new chat prompt formats emerge.

src/multimeditron/model/model.py

scripts/benchmarking.py

Co-authored-by: MichelDucartier <m.zhang2490@gmail.com>

…ation

fabnemEPFL added 6 commits October 24, 2025 18:29

bug fix for tokenize_text

be1f683

ChatTemplate dataclass

5640af8

generalist PromptTokenizer, no longer abstract

c20d6ff

removed Llama3PromptTokenizer and ApertusPromptTokenizer

a149bc8

isort + ruff

d5577b2

horrible code -> nice code

f342a48

MichelDucartier commented Oct 28, 2025

View reviewed changes

fabnemEPFL and others added 5 commits October 28, 2025 17:26

Addressed comments by MichelDucartier

9bc3625

Co-authored-by: MichelDucartier <m.zhang2490@gmail.com>

Fix ChatTemplate for Qwen LlaMA and Apertus + fix in tokenize_convers…

a208b96

…ation

Clean train

e79288c

Merge

4fde5f7

Clean

66487bc

MichelDucartier merged commit cdbfb35 into master Oct 29, 2025
1 check failed

MichelDucartier deleted the generalize_prompt_tokenizers branch October 29, 2025 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generalize prompt tokenizers #19

Generalize prompt tokenizers #19

Uh oh!

MichelDucartier commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Generalize prompt tokenizers #19

Generalize prompt tokenizers #19

Uh oh!

Conversation

MichelDucartier commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants