-
Notifications
You must be signed in to change notification settings - Fork 0
Migrate Gemma variants [1B, 4B, 27B] to TT-Transformers Library #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: mcw/gemma_3_27b/pr_1_experimental
Are you sure you want to change the base?
Migrate Gemma variants [1B, 4B, 27B] to TT-Transformers Library #28
Conversation
265f912 to
6ba3246
Compare
471e07f to
f1a0685
Compare
359c925 to
a6255ce
Compare
a6255ce to
863496d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR migrates Gemma-3 multimodal models (1B, 4B, 27B) from the experimental codebase to the unified TT-Transformers library, consolidating functionality and enabling better code reuse.
- Migrates all Gemma-3 multimodal components to tt_transformers structure
- Adds sliding window attention support and additional normalization layers
- Consolidates experimental code into the main transformer library
Reviewed Changes
Copilot reviewed 49 out of 52 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| models/tt_transformers/tt/multimodal/gemma3/*.py | Migrated vision components from experimental to tt_transformers |
| models/tt_transformers/tt/model_config.py | Added Gemma-3 specific configurations and vision parameters |
| models/tt_transformers/tt/model.py | Added attention mask support for sliding window attention |
| models/tt_transformers/tt/decoder.py | Added pre/post feedforward normalization support |
| models/tt_transformers/tt/attention.py | Enhanced with sliding window attention and mask parameters |
| models/tt_transformers/tt/generator.py | Extended to support multimodal inputs and Gemma-3 models |
| models/experimental/gemma3/tt/*.py | Removed experimental files (migrated to tt_transformers) |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| text_prefix = "" | ||
| else: | ||
| text_prefix = self.state_dict_text_prefix |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded empty string for Gemma-3 models could be replaced with a more explicit constant or method to improve maintainability and make the special case clearer.
| layer = model.model.embed_tokens | ||
| else: | ||
| layer = reference_model.model.embed_tokens | ||
| layer = reference_model.model.model.embed_tokens |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double .model access looks suspicious and could lead to AttributeError if the structure is not as expected. This should be verified against the actual model structure.
| layer = reference_model.model.model.embed_tokens | |
| layer = reference_model.model.embed_tokens |
| core_grid=None, # FIXME: validate on TG ttnn.CoreGrid(y=8, x=8) if not pc_2 else None, | ||
| ) | ||
| ttnn.deallocate(w2_in) | ||
| w2_out = ttnn.multiply(w2_out, self.num_devices) |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The multiplication followed by division pattern (line 254 and 275) suggests a scaling operation that could be abstracted into a helper function to reduce code duplication and improve clarity.
| TG=args.is_galaxy, | ||
| ) | ||
| else: | ||
| # If pre_feedforward_layernorm is not in state_dict, we do not use it |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment should explain why pre_feedforward_layernorm might not be present and the implications of setting it to None for model behavior.
| # If pre_feedforward_layernorm is not in state_dict, we do not use it | |
| # pre_feedforward_layernorm may be absent from state_dict for certain model variants or checkpoint formats. | |
| # When it is missing, self.pre_ff_norm is set to None and the model skips this normalization step, | |
| # which may affect output consistency or compatibility with architectures that expect this layernorm. |
| # Deallocate input tensor to free memory | ||
| ttnn.deallocate(x_in) | ||
| # Reshape output back to original shape | ||
|
|
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "Reshape output back to original shape" but no reshaping operation follows. This comment should be removed or the intended operation should be implemented.
| if seq_len >= MAX_MM_SEQ_LEN: | |
| output = ttnn.reshape(output, [1, seq_len, -1]) |
863496d to
41a117d
Compare
fafc552 to
7ee5ddc
Compare
41a117d to
d40f3f7
Compare
Ticket
Link to JIRA Issue
Problem description
Migration of Gemma-3-27b-it from experimental setup to TT-Transformers Library
What's changed
Moved the experimental Gemma-3-27b-it to TT-Transformers
Added sliding_window support in Attention.
Added pre_feedforward_norm and post_feedforward_norm support in decoder
Checklist