Experiment: Attention patterns, GQA, head dim, value embeddings, softcap

## Objective

Optimize attention configuration: window patterns, grouped-query attention, head dimension, value embeddings, and logit softcap.

## 1. Attention Window Patterns

Current: `WINDOW_PATTERN = "SSSL"` (S=half context, L=full context). Last layer forced to L.

| ID | Pattern | % Global | Priority | Rationale |
|----|---------|----------|----------|-----------|
| 2A | `"SSLL"` | 50% | HIGH | More global access at moderate cost |
| 2B | `"L"` | 100% | MEDIUM | Upper bound — may reduce throughput |
| 2C | `"SL"` | 50% | HIGH | Alternating — good balance |
| 2D | `"SSSLSSLL"` | 37.5% | LOW | Gradual expansion |

## 2. Head Dimension

Current: `HEAD_DIM = 128`, giving 4 heads at dim=512.

| ID | HEAD_DIM | Effect | Priority |
|----|----------|--------|----------|
| 7A | 64 | Doubles head count (8 at dim=512) | HIGH |
| 7B | 96 | Only works at certain model dims | LOW |

HEAD_DIM=64 gives more diverse attention patterns. Common in GPT-2 and LLaMA-small. Should test after depth/width winner is known.

## 3. Grouped Query Attention

Current: `n_kv_head = n_head` (full MHA). With only 4 heads at baseline, GQA options are limited.

| ID | n_kv_head | Priority | Notes |
|----|-----------|----------|-------|
| 4A | n_head/2 | MEDIUM | Saves memory, may lose capacity at small scale |
| 4B | 1 (MQA) | LOW | Too aggressive at ≤4 heads |
| 4C | GQA + extra depth | MEDIUM | Reinvest memory savings into more layers |

**Dependencies**: More viable after HEAD_DIM=64 (more heads to group).

## 4. Value Embeddings

Current: ResFormer-style on alternating layers with learned gating.

| ID | Change | Priority |
|----|--------|----------|
| 5A | Value embeds on every layer | MEDIUM |
| 5B | Remove value embeddings entirely | MEDIUM |
| 5C | Fixed gate weight (remove learned gating) | LOW |

## 5. Logit Softcap

Current: `softcap = 15` applied via `softcap * tanh(logits / softcap)`.

| ID | Value | Priority |
|----|-------|----------|
| 6A | 30 | MEDIUM |
| 6B | Remove entirely | MEDIUM |

## Suggested Execution Order

1. **Wave 1** (independent): 2A, 2C, 5A, 5B, 6A, 6B
2. **Wave 2** (after depth/width winner): 7A, 4A, 4C
3. **Wave 3**: Combine all winners

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Attention patterns, GQA, head dim, value embeddings, softcap #5

Objective

1. Attention Window Patterns

2. Head Dimension

3. Grouped Query Attention

4. Value Embeddings

5. Logit Softcap

Suggested Execution Order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ID	Pattern	% Global	Priority	Rationale
2A	`"SSLL"`	50%	HIGH	More global access at moderate cost
2B	`"L"`	100%	MEDIUM	Upper bound — may reduce throughput
2C	`"SL"`	50%	HIGH	Alternating — good balance
2D	`"SSSLSSLL"`	37.5%	LOW	Gradual expansion

ID	HEAD_DIM	Effect	Priority
7A	64	Doubles head count (8 at dim=512)	HIGH
7B	96	Only works at certain model dims	LOW

ID	n_kv_head	Priority	Notes
4A	n_head/2	MEDIUM	Saves memory, may lose capacity at small scale
4B	1 (MQA)	LOW	Too aggressive at ≤4 heads
4C	GQA + extra depth	MEDIUM	Reinvest memory savings into more layers

ID	Change	Priority
5A	Value embeds on every layer	MEDIUM
5B	Remove value embeddings entirely	MEDIUM
5C	Fixed gate weight (remove learned gating)	LOW

Experiment: Attention patterns, GQA, head dim, value embeddings, softcap #5

Description

Objective

1. Attention Window Patterns

2. Head Dimension

3. Grouped Query Attention

4. Value Embeddings

5. Logit Softcap

Suggested Execution Order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions