Skip to content

Experiment: Attention patterns, GQA, head dim, value embeddings, softcap #5

@Jason-Adam

Description

@Jason-Adam

Objective

Optimize attention configuration: window patterns, grouped-query attention, head dimension, value embeddings, and logit softcap.

1. Attention Window Patterns

Current: WINDOW_PATTERN = "SSSL" (S=half context, L=full context). Last layer forced to L.

ID Pattern % Global Priority Rationale
2A "SSLL" 50% HIGH More global access at moderate cost
2B "L" 100% MEDIUM Upper bound — may reduce throughput
2C "SL" 50% HIGH Alternating — good balance
2D "SSSLSSLL" 37.5% LOW Gradual expansion

2. Head Dimension

Current: HEAD_DIM = 128, giving 4 heads at dim=512.

ID HEAD_DIM Effect Priority
7A 64 Doubles head count (8 at dim=512) HIGH
7B 96 Only works at certain model dims LOW

HEAD_DIM=64 gives more diverse attention patterns. Common in GPT-2 and LLaMA-small. Should test after depth/width winner is known.

3. Grouped Query Attention

Current: n_kv_head = n_head (full MHA). With only 4 heads at baseline, GQA options are limited.

ID n_kv_head Priority Notes
4A n_head/2 MEDIUM Saves memory, may lose capacity at small scale
4B 1 (MQA) LOW Too aggressive at ≤4 heads
4C GQA + extra depth MEDIUM Reinvest memory savings into more layers

Dependencies: More viable after HEAD_DIM=64 (more heads to group).

4. Value Embeddings

Current: ResFormer-style on alternating layers with learned gating.

ID Change Priority
5A Value embeds on every layer MEDIUM
5B Remove value embeddings entirely MEDIUM
5C Fixed gate weight (remove learned gating) LOW

5. Logit Softcap

Current: softcap = 15 applied via softcap * tanh(logits / softcap).

ID Value Priority
6A 30 MEDIUM
6B Remove entirely MEDIUM

Suggested Execution Order

  1. Wave 1 (independent): 2A, 2C, 5A, 5B, 6A, 6B
  2. Wave 2 (after depth/width winner): 7A, 4A, 4C
  3. Wave 3: Combine all winners

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentHyperparameter or architecture experimentpriority: mediumMedium impactsize: LLarge — 15+ experiments or 3+ hours

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions