forked from karpathy/autoresearch
-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment: Attention patterns, GQA, head dim, value embeddings, softcap #5
Copy link
Copy link
Open
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: mediumMedium impactMedium impactsize: LLarge — 15+ experiments or 3+ hoursLarge — 15+ experiments or 3+ hours
Description
Objective
Optimize attention configuration: window patterns, grouped-query attention, head dimension, value embeddings, and logit softcap.
1. Attention Window Patterns
Current: WINDOW_PATTERN = "SSSL" (S=half context, L=full context). Last layer forced to L.
| ID | Pattern | % Global | Priority | Rationale |
|---|---|---|---|---|
| 2A | "SSLL" |
50% | HIGH | More global access at moderate cost |
| 2B | "L" |
100% | MEDIUM | Upper bound — may reduce throughput |
| 2C | "SL" |
50% | HIGH | Alternating — good balance |
| 2D | "SSSLSSLL" |
37.5% | LOW | Gradual expansion |
2. Head Dimension
Current: HEAD_DIM = 128, giving 4 heads at dim=512.
| ID | HEAD_DIM | Effect | Priority |
|---|---|---|---|
| 7A | 64 | Doubles head count (8 at dim=512) | HIGH |
| 7B | 96 | Only works at certain model dims | LOW |
HEAD_DIM=64 gives more diverse attention patterns. Common in GPT-2 and LLaMA-small. Should test after depth/width winner is known.
3. Grouped Query Attention
Current: n_kv_head = n_head (full MHA). With only 4 heads at baseline, GQA options are limited.
| ID | n_kv_head | Priority | Notes |
|---|---|---|---|
| 4A | n_head/2 | MEDIUM | Saves memory, may lose capacity at small scale |
| 4B | 1 (MQA) | LOW | Too aggressive at ≤4 heads |
| 4C | GQA + extra depth | MEDIUM | Reinvest memory savings into more layers |
Dependencies: More viable after HEAD_DIM=64 (more heads to group).
4. Value Embeddings
Current: ResFormer-style on alternating layers with learned gating.
| ID | Change | Priority |
|---|---|---|
| 5A | Value embeds on every layer | MEDIUM |
| 5B | Remove value embeddings entirely | MEDIUM |
| 5C | Fixed gate weight (remove learned gating) | LOW |
5. Logit Softcap
Current: softcap = 15 applied via softcap * tanh(logits / softcap).
| ID | Value | Priority |
|---|---|---|
| 6A | 30 | MEDIUM |
| 6B | Remove entirely | MEDIUM |
Suggested Execution Order
- Wave 1 (independent): 2A, 2C, 5A, 5B, 6A, 6B
- Wave 2 (after depth/width winner): 7A, 4A, 4C
- Wave 3: Combine all winners
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: mediumMedium impactMedium impactsize: LLarge — 15+ experiments or 3+ hoursLarge — 15+ experiments or 3+ hours