Add Experimental Support for Gemma variants [1B, 27B] #27

MohammedTaherMcW · 2025-08-24T18:03:57Z

Ticket

Problem description

Enable Support for Gemma-3-27b-it Model.

What's changed

Added support for gemma-3-27b-it model
Updated model_config.py to support gemma-3-27b-it , including end-of-sequence (EoS) token handling.
Updated load_checkpoints.py to support gemma-3-27b-it weight loading.
Modified apply_scaling logic to handle both LLaMA and gemma-3-27b-it model.

Checklist

Copilot

Pull Request Overview

This PR adds experimental support for Gemma variants (1B and 27B) while refactoring the Gemma3 model architecture. The implementation moves from a single Gemma3-4B specific structure to a generalized Gemma3 architecture that supports multiple model sizes with updated attention mechanisms including sliding window attention and enhanced multimodal capabilities.

Key Changes

Added support for gemma-3-1b and gemma-3-27b model configurations with device-specific parameters
Refactored attention mechanism to use causal masks and sliding window patterns
Enhanced multimodal support with improved vision-text integration
Updated RMSNorm implementation with distributed computation capabilities

Reviewed Changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
models/tt_transformers/tt/model_config.py	Added configuration support for Gemma 1B and 27B variants with device-specific parameters
models/experimental/gemma3/tt/text_model.py	Refactored Gemma3Transformer with enhanced attention masking and multimodal support
models/experimental/gemma3/tt/attention.py	Updated attention mechanism with causal mask support and sliding window functionality
models/experimental/gemma3/tt/decoder.py	Enhanced TransformerBlock with improved attention routing and residual connections
models/experimental/gemma3/tt/mlp.py	Updated MLP with device scaling and distributed computation

Comments suppressed due to low confidence (3)

models/experimental/gemma3/tt/text_model.py:1

[nitpick] Returning None as the 6th element without clear documentation makes the return tuple unclear. Consider using a named tuple or adding a comment explaining what this None represents.

"""

models/experimental/gemma3/tt/text_model.py:1

[nitpick] Adding **kwargs to forward methods without documenting the expected kwargs could lead to confusion and potential misuse. Consider explicitly defining the expected parameters or adding documentation.

"""

models/experimental/gemma3/tt/text_model.py:1

[nitpick] Adding **kwargs to forward methods without documenting the expected kwargs could lead to confusion and potential misuse. Consider explicitly defining the expected parameters or adding documentation.

"""

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-24T07:30:50Z

models/tt_transformers/tt/model_config.py

            x_1BSH,
            device=self.mesh_device,
-            dtype=ttnn.bfloat16,
+            dtype=ttnn.bfloat8_b,


Changing from ttnn.bfloat16 to ttnn.bfloat8_b reduces precision which may impact model accuracy. Ensure this change has been validated through testing.

Suggested change

dtype=ttnn.bfloat8_b,

dtype=ttnn.bfloat16,

jschuhmacher

I'm not entirely done with the review, but I'll post these already.

jschuhmacher · 2025-09-24T07:41:13Z

models/tt_transformers/tt/model_config.py

        if is_gemma3:
            self.rms_norm_add_unit_offset = True
            self.embed_scale = self.dim**0.5
+            self.sliding_window = 512


The sliding window parameter is present in the HF configuration (1024 by default). It would be better to get it from there.

jschuhmacher · 2025-09-24T07:49:19Z

models/tt_transformers/tt/model_config.py

        Returns the number of tokens per chunk, accounting for the extra class token
        """
-        return (self.vision_chunk_size // self.vision_patch_size) ** 2 + 1
+        return (self.image_size // self.vision_patch_size) ** 2 + 1


This is common code. Have you verified other uses of vision_chunk_size in e.g. Llama models? I'd suggest keeping the previous calculation intact, and making a new one for image chunks.

jschuhmacher · 2025-09-24T07:55:13Z

models/tt_transformers/tt/model_config.py

+            if hasattr(model.model, "rotary_emb_local") and model.model.rotary_emb_local is not None:
+                wrapper = HfDecoderWrapper(layer, self.head_dim, model.model.rotary_emb, model.model.rotary_emb_local)
            else:
-                rotary_emb_local = None
-            wrapper = HfDecoderWrapper(layer, self.head_dim, model.model.rotary_emb, rotary_emb_local)
+                wrapper = HfDecoderWrapper(layer, self.head_dim, model.model.rotary_emb)


How about?

rotary_emb_local = getattr(model.model, "rotary_emb_local", None) wrapper = HfDecoderWrapper(layer, self.head_dim, model.model.rotary_emb, rotary_emb_local=rotary_emb_local)

jschuhmacher · 2025-09-24T08:14:28Z

models/tt_transformers/tt/common.py

    num_layers=None,
 ):
-    from models.tt_transformers.tt.model import Transformer
+    if "HF_MODEL" in os.environ and "gemma-3" in os.environ["HF_MODEL"].lower():


This is common code, also used for models that are supposed to be more production quality. Loading a module from experimental does not seem like the right thing to do here. Could you think of another way of accomplishing this? For inspiration, maybe look at gemma3/demo/text_demo.py#L71.

jschuhmacher · 2025-09-24T08:31:43Z

models/tt_transformers/tt/common.py

+        ),
+    )
+
+


I'd suggest moving all the masking utilities to their own module. And since they are Gemma specific, they could be located with the rest of the Gemma code.

jschuhmacher · 2025-09-24T10:20:13Z

models/tt_transformers/tt/model.py

        chunk_start_idx=None,
        get_last_token=-1,
        kv_cache=None,
+        **kwargs,


This makes it hard to see which arguments could be expected. Why not list the potential arguments with default values above?

jschuhmacher · 2025-09-24T10:20:25Z

models/tt_transformers/tt/model.py

        page_table=None,
        kv_cache=None,
        argmax_on_device=False,
+        **kwargs,


This makes it hard to see which arguments could be expected. Why not list the potential arguments with default values above?

jschuhmacher · 2025-09-24T10:22:45Z

models/experimental/gemma3/tests/vision_tests/test_mmp.py



-# SPDX-FileCopyrightText: © 2025 Tenstorrent AI ULC
+# SPDX-FileCopyrightText: © 2025 Tenstorrent Inc.


General comment for all the files: it seems Tenstorrent prefers the

# SPDX-FileCopyrightText: © 2025 Tenstorrent AI ULC

text instead.

jschuhmacher · 2025-09-24T10:28:32Z

models/experimental/gemma3/tests/test_attention.py

        model_args.rope_theta,
        model_args.rope_scaling,
    )
+


You're loading the first layer, which is a sliding attention layer. The rotary setup and precompute_freqs are set up for a global attention layer, so that should not be matching the reference output?

Maybe another test with a global attention layer is also interesting?

jschuhmacher · 2025-09-24T10:30:05Z

models/experimental/gemma3/tests/test_decoder.py


    seqlen = 1

+    cos, sin = precompute_freqs(


Same as above, the rope setup and layer seem to be mismatching

jschuhmacher · 2025-09-24T12:11:35Z

models/experimental/gemma3/tt/text_model.py

            if mode == "decode" and not self.args.is_galaxy:
                x = ttnn.to_memory_config(x, self.model_config["DECODE_RESIDUAL_MEMCFG"], activation_dtype)
            elif activation_dtype is not None and x.dtype != activation_dtype:
                x = ttnn.typecast(x, activation_dtype)


Nitpick, naming the masks first makes it a bit easier to read, for instance:

Suggested change

x = ttnn.typecast(x, activation_dtype)

causal_mask = None

if attention_masks is not None:

sliding_causal_mask, global_causal_mask = attention_masks

causal_mask = sliding_causal_mask if getattr(layer.attention, "is_sliding", False) else global_causal_mask

MohammedTaherMcW force-pushed the mcw/gemma_3_27b/pr_1_experimental branch from 265f912 to 6ba3246 Compare August 27, 2025 19:15

jennychristopher requested a review from willwray September 12, 2025 07:46

jennychristopher changed the title ~~Add Experimental Support for Gemma-3-27b-it~~ Add Experimental Support for Gemma variants [1B, 27b] Sep 12, 2025

jennychristopher changed the title ~~Add Experimental Support for Gemma variants [1B, 27b]~~ Add Experimental Support for Gemma variants [1B, 27B] Sep 12, 2025

jschuhmacher requested review from Copilot and jschuhmacher September 24, 2025 07:28

Copilot AI reviewed Sep 24, 2025

View reviewed changes

flexaihq deleted a comment from Copilot AI Sep 24, 2025

jschuhmacher reviewed Sep 24, 2025

View reviewed changes

MohammedTaherMcW and others added 15 commits October 13, 2025 07:23

Add Base commit for Gemma3

7de6963

Add Gemma Text and Vision model support

c7d3ea4

experimantal gemma27b CCL changes

3659202

Add test_end2end script Multidevice support for Gemma

7d7e79b

Fix submodule tests for Gemma

c0afe26

Fix Rebase issue

54c7170

Add Gemma-3-1b-it support

f175b08

Remove experimental Gemma-3-4b-it

530d180

Fix end to end Gemma model

bd90541

Fix Repetition issue

69297ec

Fix Trace issue

a1f090b

Modify Attention mask logic

b98bb19

Fix Gemma vision generator

8da698c

Add sliding window mask support in SDPA_decode

0d27fda

RMS Norm Fix

7ee5ddc

MohammedTaherMcW force-pushed the mcw/gemma_3_27b/pr_1_experimental branch from fafc552 to 7ee5ddc Compare October 13, 2025 10:46

Addressed the review comments

ac90423



		# SPDX-FileCopyrightText: © 2025 Tenstorrent AI ULC
		# SPDX-FileCopyrightText: © 2025 Tenstorrent Inc.

-                x = ttnn.typecast(x, activation_dtype)
+            causal_mask = None
+            if attention_masks is not None:
+                sliding_causal_mask, global_causal_mask = attention_masks
+                causal_mask = sliding_causal_mask if getattr(layer.attention, "is_sliding", False) else global_causal_mask

Add Experimental Support for Gemma variants [1B, 27B] #27

Are you sure you want to change the base?

Add Experimental Support for Gemma variants [1B, 27B] #27

Uh oh!

Conversation

MohammedTaherMcW commented Aug 24, 2025

Ticket

Problem description

What's changed

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jschuhmacher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants