Skip to content

Add probe QK hook worker for Apple Silicon backend#1

Open
tburleyinfo wants to merge 56 commits intoIBM:mainfrom
tburleyinfo:sandbox
Open

Add probe QK hook worker for Apple Silicon backend#1
tburleyinfo wants to merge 56 commits intoIBM:mainfrom
tburleyinfo:sandbox

Conversation

@tburleyinfo
Copy link
Copy Markdown

@tburleyinfo tburleyinfo commented Mar 16, 2026

Summary

  • switch the Metal attention capture path to self_attn-based Q/K/V reconstruction and restore proper hook teardown
  • tighten the Metal worker inline documentation and add clearer housing-analogy notes and flowcharts
  • reorganize sandbox assets and remove the obsolete Metal-specific Granite config

Details

  • capture raw x at the Metal self_attn boundary and reconstruct attention-ready Q/K/V inside the worker
  • archive per-run qkv.pt notebooks and restore original wrapped modules during worker teardown
  • expand worker comments around hook installation, capture, flush, and execution flow
  • move sandbox assets into scripts, markdowns, and visualizations
  • add .gitignore entries for Python cache and build artifacts

Validation

  • ran python3 -m py_compile against the Metal worker after comment and flow updates
  • verified the branch was pushed to origin/sandbox

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Switch the Metal Granite probe worker to the current self_attn-based capture path and keep the capture data in qkv artifacts that are normalized back into the analyzer's expected q/k view.

Key resolution details:

- keep the Metal worker at the self_attn boundary and compute raw x plus projected q/k/v inside the recording path

- preserve per-sample batch packets in the Metal cache format so batch analysis stays aligned with the non-Metal worker

- restore proper worker teardown via HookLLM disposal so repeated hooked runs in the same Python process do not inherit stale wrapped self_attn modules

- retain the Metal-specific Granite config override in the demo and keep the Metal head set in a dedicated config file

- keep analyzer-side debug support and qkv normalization in place

- add updated housing-analogy documentation plus Graphviz flowcharts for both the housing view and the worker control-flow view

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Delete the Metal-specific Granite attention-tracker config and stop selecting it from the demo so Granite uses the standard config again.

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Expand and align the inline housing-analogy comments in the Metal worker so they match the current self_attn capture flow, notebook/archive model, and teardown behavior.

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Move sandbox assets into scripts, markdowns, and visualizations, refresh the Metal worker diagrams, and add more detailed inline comments to the Metal hook installation flow.

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Add repo ignore rules for Python build artifacts and expand the Metal worker comments around hook installation, Q/K/V capture, cache flushing, and execution flow.

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
@tburleyinfo tburleyinfo changed the title Improve Metal self-attn capture flow and sandbox docs Add probe QK hook worker for Apple Silicon backend Mar 16, 2026
Add comments explaining how Metal qkv artifacts are normalized into the legacy qk cache view before the attention analyzer consumes them.

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Update the attention tracker notebook to set max_model_len=2048 so HookLLM initialization does not fail under MLX auto memory mode on Apple Silicon.

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>
This changes the GPU probe worker matching logic to support models whose attention module is exposed as model.layers.<i>.self_attn instead of model.layers.<i>.self_attn.attn.
This keeps the existing tuple-based attention hook path and adds a fallback for Granite-style self_attn modules that compute q and k internally. In that fallback, the worker hooks q_proj and k_proj directly and stores their outputs under the same qk cache structure.
This skips the legacy tuple-based q/k capture path when a matched attention module does not expose q and k in its forward-hook input tuple. Granite-style self_attn modules can then rely on the q_proj/k_proj fallback without crashing.
This updates the actsteer, attention-tracker, and core-reranker Colab notebooks to default to RedHatAI/granite-3.1-2b-instruct-quantized.w4a16 and adds a temporary activation-steering config for that checkpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant