Add probe QK hook worker for Apple Silicon backend by tburleyinfo · Pull Request #1 · IBM/vLLM-Hook

tburleyinfo · 2026-03-16T00:00:10Z

Summary

switch the Metal attention capture path to self_attn-based Q/K/V reconstruction and restore proper hook teardown
tighten the Metal worker inline documentation and add clearer housing-analogy notes and flowcharts
reorganize sandbox assets and remove the obsolete Metal-specific Granite config

Details

capture raw x at the Metal self_attn boundary and reconstruct attention-ready Q/K/V inside the worker
archive per-run qkv.pt notebooks and restore original wrapped modules during worker teardown
expand worker comments around hook installation, capture, flush, and execution flow
move sandbox assets into scripts, markdowns, and visualizations
add .gitignore entries for Python cache and build artifacts

Validation

ran python3 -m py_compile against the Metal worker after comment and flow updates
verified the branch was pushed to origin/sandbox

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Switch the Metal Granite probe worker to the current self_attn-based capture path and keep the capture data in qkv artifacts that are normalized back into the analyzer's expected q/k view. Key resolution details: - keep the Metal worker at the self_attn boundary and compute raw x plus projected q/k/v inside the recording path - preserve per-sample batch packets in the Metal cache format so batch analysis stays aligned with the non-Metal worker - restore proper worker teardown via HookLLM disposal so repeated hooked runs in the same Python process do not inherit stale wrapped self_attn modules - retain the Metal-specific Granite config override in the demo and keep the Metal head set in a dedicated config file - keep analyzer-side debug support and qkv normalization in place - add updated housing-analogy documentation plus Graphviz flowcharts for both the housing view and the worker control-flow view Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Delete the Metal-specific Granite attention-tracker config and stop selecting it from the demo so Granite uses the standard config again. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Expand and align the inline housing-analogy comments in the Metal worker so they match the current self_attn capture flow, notebook/archive model, and teardown behavior. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Move sandbox assets into scripts, markdowns, and visualizations, refresh the Metal worker diagrams, and add more detailed inline comments to the Metal hook installation flow. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add repo ignore rules for Python build artifacts and expand the Metal worker comments around hook installation, Q/K/V capture, cache flushing, and execution flow. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add comments explaining how Metal qkv artifacts are normalized into the legacy qk cache view before the attention analyzer consumes them. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Update the attention tracker notebook to set max_model_len=2048 so HookLLM initialization does not fail under MLX auto memory mode on Apple Silicon. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

This changes the GPU probe worker matching logic to support models whose attention module is exposed as model.layers.<i>.self_attn instead of model.layers.<i>.self_attn.attn.

This keeps the existing tuple-based attention hook path and adds a fallback for Granite-style self_attn modules that compute q and k internally. In that fallback, the worker hooks q_proj and k_proj directly and stores their outputs under the same qk cache structure.

This skips the legacy tuple-based q/k capture path when a matched attention module does not expose q and k in its forward-hook input tuple. Granite-style self_attn modules can then rely on the q_proj/k_proj fallback without crashing.

This updates the actsteer, attention-tracker, and core-reranker Colab notebooks to default to RedHatAI/granite-3.1-2b-instruct-quantized.w4a16 and adds a temporary activation-steering config for that checkpoint.

tburleyinfo added 9 commits March 15, 2026 20:01

Add sandbox hook experiments and backend worker selection

2e81a4b

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add Metal hook debugging status update

7188632

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Refine Metal hook alignment and diagnostics

5868244

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Checkpoint Granite Metal attention boundary findings

2440f5d

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Remove obsolete Metal Granite config

0500d60

Delete the Metal-specific Granite attention-tracker config and stop selecting it from the demo so Granite uses the standard config again. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Document Metal worker capture flow

38a5755

Add repo ignore rules for Python build artifacts and expand the Metal worker comments around hook installation, Q/K/V capture, cache flushing, and execution flow. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

tburleyinfo force-pushed the sandbox branch from 6185911 to 38a5755 Compare March 16, 2026 00:02

tburleyinfo changed the title ~~Improve Metal self-attn capture flow and sandbox docs~~ Add probe QK hook worker for Apple Silicon backend Mar 16, 2026

tburleyinfo added 2 commits March 16, 2026 07:06

Document qkv cache normalization path

6f808a1

Add comments explaining how Metal qkv artifacts are normalized into the legacy qk cache view before the attention analyzer consumes them. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add max_model_len=2048 to avoid MLX memory error

74e85df

Update the attention tracker notebook to set max_model_len=2048 so HookLLM initialization does not fail under MLX auto memory mode on Apple Silicon. Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

tburleyinfo force-pushed the sandbox branch from 051d19e to 74e85df Compare March 17, 2026 03:35

tburleyinfo added 3 commits March 17, 2026 12:20

Created using Colab

cd1297d

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Created using Colab

21cad00

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add attention tracker Colab notebook

641c61d

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

tburleyinfo force-pushed the sandbox branch from 4a6c472 to 641c61d Compare March 17, 2026 16:20

tburleyinfo added 12 commits March 17, 2026 12:24

Remove invalid widget metadata from Colab notebook

b7a7d6d

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add nbformat to notebook requirements

146b38e

Signed-off-by: Timothy Burley <34224160+tburleyinfo@users.noreply.github.com>

Add CoRe Colab notebook and QK comparison tools

3177d04

Align CoRe Colab install cell with attntracker

e68fafc

Isolate colab sandbox workflow

5913c0c

Update colab sandbox clone instructions

a10a027

Make colab README copy-pasteable

d7a6265

Adjust colab install command

b3dd9e0

Gate metal plugin imports by backend

a47efc8

Switch demos to Granite 4 micro defaults

3ffb5a1

Fix Granite attntracker template detection

da547df

Add attention tracker analyzer alias

34040bd

tburleyinfo added 30 commits March 20, 2026 23:34

Lower Colab vLLM memory defaults

0489bc3

Serialize analyzer outputs for Colab export

a48a68d

Add hook teardown to non-metal workers

5e091d1

Fix CoRe export spec unpacking

30d958b

Update colab sandbox comparison workflow

f57a9c0

Expand colab sandbox findings

69c325d

Add Metal activation steering worker

6af58d1

Align metal workers and refresh Granite micro demos

7602d2f

Add structured docstrings to worker functions

f87aa76

Add notebook setup docs and sanitize notebooks

99bc538

Ignore notebook caches and untrack external cache

46c8dd1

Ignore notebook metadata and qk peek cache

c1b1285

Fix Colab notebook repo reuse

4a051ff

Fix Colab plugin imports without restart

ba822e7

Document Colab T4 context cap

d448a5d

Tune actsteer Colab notebook for T4 memory

07b353d

Add T4 memory tuning to Colab notebooks

20d1e52

Switch attntracker Colab default to Qwen

ede2b0a

Fix attntracker Colab notebook syntax

b002d05

Update Colab notebook docs and defaults

b7aaf0d

Restore Granite micro default in attntracker Colab

bc59edf

Lower Colab GPU memory utilization defaults

144334c

Add granite h-tiny core reranker Colab config

a365aa4

Add RedHatAI 2B quantized Colab experiments

e736606

GPU worker logic change for Granite self_attn matching

b2ad9aa

This changes the GPU probe worker matching logic to support models whose attention module is exposed as model.layers.<i>.self_attn instead of model.layers.<i>.self_attn.attn.

Guard GPU attention tuple hook for Granite fallback

f8d4727

This skips the legacy tuple-based q/k capture path when a matched attention module does not expose q and k in its forward-hook input tuple. Granite-style self_attn modules can then rely on the q_proj/k_proj fallback without crashing.

Set Colab defaults to RedHatAI Granite 2B quantized

61c8ff0

This updates the actsteer, attention-tracker, and core-reranker Colab notebooks to default to RedHatAI/granite-3.1-2b-instruct-quantized.w4a16 and adds a temporary activation-steering config for that checkpoint.

Fix HookLLM backend resolution

07f0a71

Fix backend-specific hook worker selection

64ece99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add probe QK hook worker for Apple Silicon backend#1

Add probe QK hook worker for Apple Silicon backend#1
tburleyinfo wants to merge 56 commits intoIBM:mainfrom
tburleyinfo:sandbox

tburleyinfo commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tburleyinfo commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tburleyinfo commented Mar 16, 2026 •

edited

Loading