🧬 Activation Steering for Small Language Models
From Mid-Layer Sweet Spots to Dynamic Multi-Agent Orchestration
Arthur Edmond · LLM Engineer @ Swapn
A deep dive into where, how, and why activation steering works (or doesn't) on SLMs — from mechanistic analysis to SWE-bench
Mid-layer steering (layers 15–18) at moderate coefficients (α=30) achieves 100% behavioral change on SLMs, while late layers do nothing. Steering boosts zero-shot by +16pp on GSM8K but hurts when combined with few-shot or RAG context. On MMLU-Pro (n=200), steering consistently degrades accuracy — initial n=20 "improvements" were false positives. SAE feature decomposition reveals why: contrastive vectors activate 50–60% of features diffusely, with near-zero overlap with domain-specific features. A Streamlit demo lets you explore baseline vs steered outputs live.
|
Layers 15–18 at α=30 → 100% change rate on Qwen3-4B-Instruct. Non-target prompts get reinterpreted toward the steering direction. Late layers (33–35) → 0% change even at α=200. Both instruct AND base models show the same rigidity pattern at layers 30–35. This is a transformer property, not an instruction-tuning artifact. Layer 15 at α=30 boosts task score from 0.70→0.97 on base model. But α≥60 → total degeneration (score 0.0). The effective window is razor-thin. |
Zero-shot CoT + steering on Qwen3-0.6B-Instruct: 46%→62% (+16pp). But 5-shot + steering = interference (-8pp). Adaptive RAG raises path validity from 0%→68%. But steering degrades RAG patches (68%→15%→0%). Steering is optimal in zero-shot regime only. Sequential vector switching: 4–7× more domain-relevant output vs baseline. Vector composition dilutes signal. Each domain has a distinct (layer, α) sweet spot. Mid-layers: 1–48 bits of KL divergence. Late layers: <0.01 bits. Steering increases sampling diversity from 20%→100% while preserving output type. Custom SAE (8192 features, 20M tokens) reveals contrastive vectors activate 50–60% of features diffusely. Domain-specific features are sparse & localized. Overlap: 0–4 / 20 → contrastive vectors ≠ domain knowledge. Finding robust across sparsity regimes (10× L1 penalty yields same pattern). n=20 suggested +15pp on history. n=200 (stderr ±2.5pp) reversed every positive finding. Steering degrades all domains by -2 to -6pp. Full SAE pipeline on hybrid conv+attention model (10 LIV conv + 6 GQA blocks). Contrastive vectors are least diffuse (26–38% vs 47–59% Qwen), but overlap remains tiny (1–2/20). Style-not-knowledge is architecture-invariant. Selecting SAE features by output influence (W_dec projected through unembedding) instead of input activation. Features are entirely disjoint from input-diff features. At α=10, output-score causes less harm than all other methods. History shows +1.5pp (within stderr). Improvement is methodological, not substantive. |
┌─────────────────────────────────────────────────┐
│ STEERABILITY vs. LAYER DEPTH │
│ │
Change Rate (%) │ ██ │
100 ─────────│ ██ ████ │
│ ██ ████ ██ │
80 ─────────│ ██ ████ ████ │
│ ██ ████ ████ ██ │
60 ─────────│ ██ ████ ████ ████ │
│ ██ ████ ████ ████ ██ │
40 ─────────│ ██ ████ ████ ████ ████ │
│ ██ ████ ████ ████ ████ ██ │
20 ─────────│ ██ ████ ████ ████ ████ ████ ██ ██ │
│ ██ ████ ████ ████ ████ ████ ████ ████ ██ ██ │
0 ─────────│───██─████─████─████─████─████─████─████─██─██──│
│ 5 10 15 18 20 22 25 30 33 35 │
│ ⚠️UNSTABLE │ ★ SWEET SPOT │ ❄️ FROZEN │
└─────────────────────────────────────────────────┘
Layer Index (α = 30)
| Zone | Layers | Behavior | Why |
|---|---|---|---|
| 1–10 | High change but degenerate outputs | Representations too raw — syntactic, not semantic | |
| ★ Sweet Spot | 15–18 | 100% change, coherent outputs | Syntactic→semantic transition point — malleable yet structured |
| ❄️ Frozen | 30–35 | 0–17% change even at α=200 | Already committed to output distribution — architectural rigidity |
steering-research/
│
├── 📄 article/
│ ├── main.tex # LaTeX source (~26 pages, arxiv-ready)
│ ├── main.pdf # Compiled paper
│ └── figures/ # Publication-quality figures (PDF + PNG)
│
├── 🔬 src/
│ ├── tokenizers/
│ │ ├── compare.py # 4-model tokenizer comparison + quality scoring
│ │ └── visualize.py # 5 publication-quality matplotlib figures
│ │
│ ├── analysis/
│ │ ├── sae_features.py # Contrastive activation extraction + logit lens
│ │ └── neuronpedia_features.py # Neuronpedia API: 163,840 SAE features/layer
│ │
│ ├── steering/
│ │ ├── apply_vectors.py # Phase 3a: Initial layer-35 steering (null result)
│ │ ├── midlayer_sweep.py # Phase 3b: 11 layers × 6 coefficients → sweet spot
│ │ ├── base_model_steering.py # Phase 3c: Base model — steerable but fragile
│ │ ├── budget_guidance.py # Phase 3d: Gamma predictor (null result)
│ │ ├── slm_gsm8k_steering.py # Phase 3e: SLM steering on GSM8K (Qwen3-0.6B) ★
│ │ ├── sampling_steering.py # Phase 3f: KL divergence + sampling diversity ★
│ │ ├── gsm8k_benchmark.py # Phase 3g: lm-eval validation (5-shot + 0-shot) ★
│ │ ├── mmlu_pro_benchmark_mc.py # MMLU-Pro loglikelihood benchmark (3 models)
│ │ ├── mmlu_pro_figures.py # Publication figures for MMLU-Pro results
│ │ ├── mmlu_pro_samples.py # Per-sample output capture (baseline vs steered)
│ │ ├── mmlu_pro_sample_figures.py # Log-likelihood & probability distribution figures
│ │ ├── train_sae.py # SAE training (multi-model: --model Qwen/Qwen3-4B)
│ │ ├── analyze_sae_features.py # SAE domain analysis + contrastive overlap (multi-model)
│ │ ├── feature_targeted_steering.py # Feature-targeted vectors from SAE decoder columns ★
│ │ ├── app_steering_demo.py # 🎮 Streamlit demo: live comparison of steering methods
│ │ ├── domain_vectors.py # Phase 6a: Domain-specific vector extraction ★
│ │ ├── vector_composition.py # Phase 6b: Composition tests (add vs switch) ★
│ │ └── swebench_domain_vectors.py # Phase 6c: SWE-bench cluster vectors ★
│ │
│ └── agents/
│ ├── prompt_baselines.py # 5 strategies × 29 bilingual eval cases
│ ├── steering_orchestrator.py # Phase 6d: Dynamic steering orchestrator ★
│ ├── swebench_pipeline.py # Phase 6e: SWE-bench eval pipeline + RAG ★
│ └── swebench_rag.py # Phase 6e: Repo checkout + file retrieval ★
│
├── 🌐 web/
│ ├── api/ # FastAPI backend (SSE streaming, model management)
│ │ ├── main.py # App, CORS, lifespan, request logging
│ │ ├── deps.py # ModelManager singleton (1 model in RAM)
│ │ ├── steering.py # Extracted hook + generation logic
│ │ └── routers/ # generate, models, vectors endpoints
│ └── frontend/ # React + TypeScript + Vite
│ └── src/components/ # Sidebar, ArenaPanel, ChatCard, VectorViz
│
├── 📊 results/ # All JSON results + steering vectors (.pt) + SAE weights
├── 📁 data/ # Evaluation datasets
├── 📋 PLAN.md # Research plan with status tracking
├── 🐳 Dockerfile # Multi-stage build (Node + Python)
├── 📋 justfile # Task runner (just up, just dev, etc.)
└── 📦 requirements.txt
Interactive React + FastAPI app for real-time comparison of steering methods. Streams baseline, contrastive, and SAE feature-targeted generation side by side.
Features:
- Real-time SSE streaming with 3 methods running sequentially
- Dynamic layer selection with cross-layer SAE support (warning when not at trained layer)
- Vector space visualizations (PCA, cosine similarity heatmap, L2 norms)
- Model hot-swapping (Qwen3-0.6B / Qwen3-4B)
Quick start with Docker:
just up # builds image + runs on :8000 (mount results/ for vectors)Dev mode (hot reload):
just dev-backend # uvicorn on :8000
just dev-frontend # vite on :5173 (proxies /api → :8000)Developed by Arthur EDMOND
| Model | Vocab | Overall Score | Temporal | JSON | Multilingual |
|---|---|---|---|---|---|
| Llama-3.2-3B | 128K | 0.551 ★ | 0.316 | 0.375 | 0.816 |
| Gemma-3-1B | 262K | 0.511 | 0.226 | 0.288 | 0.845 |
| Qwen3-4B | 152K | 0.483 | 0.205 | 0.288 | 0.802 |
| Phi-3-mini | 32K | 0.438 | 0.173 | 0.249 | 0.814 |
All tokenizers score < 0.32 on temporal integrity — ISO 8601 timestamps are universally fragmented into individual characters.
10 calendar vs 10 neutral prompts → per-layer mean-difference vectors → logit lens projection:
Layer 35: L2=361.0 → ✅ schedule, agenda, attendees, RSVP ❌ magnitude, density
Layer 34: L2=218.6 → ✅ schedule, agenda, calendar ❌ licking, entropy
Layer 33: L2=179.4 → ✅ agenda, schedule, invite ❌ entropy, MeasuredThe experiment that changed everything. 11 layers × 6 coefficients:
| Layer | α=0 | α=10 | α=30 | α=60 | α=100 | α=200 |
|---|---|---|---|---|---|---|
| 15 | 0% | 17% | 100% ★ | 100% | 83% | 100% |
| 18 | 0% | 33% | 100% ★ | 100% | 100% | 67% |
| 20 | 0% | 17% | 33% | 100% | 100% | 67% |
| 33 | 0% | 0% | 17% | 17% | 17% | 17% |
| 35 | 0% | 0% | 0% | 0% | 0% | 17% |
| Layer | α=0 | α=10 | α=30 | α=60 | α=100 | α=200 |
|---|---|---|---|---|---|---|
| 15 | 0.70 | 0.70 | 0.97 ★ | 0.33 💀 | 0.03 💀 | 0.00 💀 |
| 18 | 0.70 | 0.73 | 0.73 | 0.13 💀 | 0.00 💀 | 0.00 💀 |
| 35 | 0.70 | 0.70 | 0.70 | 0.70 | 0.73 | 0.77 |
Sweet spot → degeneration in one coefficient step. Instruction tuning acts as a stabilizer.
| Budget | Avg Tokens | Valid | Savings |
|---|---|---|---|
| ∞ (baseline) | 74.6 | 100% | — |
| 32 | 74.6 | 100% | 0% |
| 512 | 74.6 | 100% | 0% |
The instruct model is already Pareto-optimal for structured output. Budget guidance targets thinking overhead, not output verbosity.
- 163,840 transcoder features per layer
- 115 calendar features via keyword search
- 9/450 random samples (0.07% density)
- Features specialize in later layers: generic "time" → specific "scheduling appointments"
| Strategy | Messages | Description |
|---|---|---|
| Zero-shot | 2 | System prompt only |
| Few-shot (3) | 8 | 3 example input/output pairs |
| Few-shot (5) | 12 | 5 example pairs |
| Chain-of-thought | 2 | Step-by-step reasoning |
| Tool use | 2 + tool | Function calling schema |
29 bilingual test cases (18 FR / 11 EN) × 4 complexity levels. Framework ready — full inference evaluation pending.
| Model | Strategy | Baseline | Best Steering | Layer | α |
|---|---|---|---|---|---|
| Instruct | zero_shot | 20% | 30% (+10%) | 25 | 60 |
| Instruct | cot | 10% | 10% (no gain) | — | — |
| Instruct | few_shot | 20% | 20% (no gain) | — | — |
| Base | zero_shot | 0% | 10% | 18 | 30 |
| Base | cot | 0% | 0% | — | — |
| Base | few_shot | 20% | 40% (+20%) ★ | 20 | 100 |
Sweet spot shifts to 64–89% depth (layers 18–25) on the 28-layer model. Base model doubles accuracy with steering. CoT hurts the 0.6B model (10% < 20% zero-shot).
lm-eval Validation (n=50) — 5-shot vs. Zero-shot CoT:
| Model | Condition | 5-shot Strict | 5-shot Flex | 0-shot Strict | 0-shot Flex |
|---|---|---|---|---|---|
| Instruct | Baseline | 48% | 48% | 38% | 46% |
| Instruct | Steered | 44% (-4) | 40% (-8) | 40% (+2) | 62% (+16) ★ |
| Base | Baseline | 48% | 48% | 36% | 28% |
| Base | Steered | 26% (-22) | 34% (-14) | 8% (-28) | 22% (-6) |
Zero-shot + steering = synergy (+16pp on instruct), 5-shot + steering = interference (-8pp). Steering improves reasoning without improving format compliance. This motivates adaptive coefficient selection:
α = f(n_few_shot, model_type)for dynamic multi-agent steering.
KL Divergence (bits) — Mid-layers vs. Late layers:
| Prompt | L15@α=30 | L18@α=30 | L35@α=30 | L35@α=60 |
|---|---|---|---|---|
| calendar_fr | 4.20 | 0.61 | 0.00 | 0.00 |
| calendar_en | 1.51 | 2.10 | 0.00 | 0.00 |
| ambiguous_en | 11.0 | 1.50 | 0.00 | 0.00 |
| non_cal_fr | 21.2 | 23.5 | 0.00 | 0.01 |
| non_cal_en | 20.5 | 17.8 | 0.00 | 0.00 |
Sampling Diversity (L15@α=30 vs. baseline):
| Prompt | Baseline T=0.3 | Steered T=0.3 | Baseline T=1.0 | Steered T=1.0 |
|---|---|---|---|---|
| calendar_fr | 20% | 100% | 60% | 100% |
| calendar_en | 20% | 80% | 40% | 80% |
| ambiguous_fr | 20% | 80% | 60% | 100% |
3-order-of-magnitude KL gap between mid-layers and late layers. Steering increases diversity 2–5× while preserving JSON output type. Late-layer rigidity is distributional, not just behavioral.
Vector Composition (can we add vectors?):
| Composition | Strategy | Coherent | Avg Score |
|---|---|---|---|
| code + bug | baseline | 3/3 | 4.0 |
| code + bug | addition | 3/3 | 2.0 (diluted) |
| code + bug + patch | addition (3 vec) | 3/3 | 5.7 (= baseline) |
Addition dilutes signal — no degeneration but no gain. Sequential switching is the right architecture.
Orchestrator — Dynamic vs Static vs Baseline:
| Scenario | Variant | KW Hits | Tokens | Coherence |
|---|---|---|---|---|
| bug fix | dynamic | 4 (vs 1) | 768 | 100% |
| test failure | dynamic | 7 (vs 1) | 818 (-20%) | 100% |
| feature regression | dynamic | 8 (vs 9) | 1280 | 100% |
Dynamic switching = 4–7× more domain-relevant output vs baseline. Static steering hurts on heterogeneous tasks (4 vs 9 hits). Zero degeneration across all conditions.
SWE-bench Cluster Vectors:
| Cluster | Best Config | Score | Cosine to Generic |
|---|---|---|---|
| django_web (46%) | L18@α=10 | 13 | 0.865 (bug_analysis) |
| scientific (37%) | L18@α=60 | 6 | 0.844 |
| dev_tooling (15%) | L25@α=10 | 8 | 0.849 |
Each cluster has a distinct sweet spot. Cluster-specific vectors capture additional signal beyond generic domains (cosine 0.84–0.87, not 1.0).
Without RAG (n=20): all patches fail — 0% resolved, model invents file paths.
With RAG (n=20):
| Variant | Valid Diffs | Path Validity | Avg Quality |
|---|---|---|---|
| rag_baseline | 95% | 68% | 0.655 |
| rag_static (code_reading) | 100% | 15% | 0.276 |
| rag_dynamic (3-step) | 90% | 0% | 0.205 |
Steering degrades RAG performance — same pattern as GSM8K: RAG context acts like implicit few-shot, and steering on top causes destructive interference. The unsteered rag_baseline is the best variant. Steering is optimal in zero-shot regime only.
Dual-mode evaluation on 3 models × 3 domains (math, law, history):
| Domain | Baseline | α=10 | α=30 | α=60 |
|---|---|---|---|---|
| math | 25.5% (±3.1) | 27.0% (+1.5) | 24.0% (-1.5) | 20.0% (-5.5) |
| law | 17.0% (±2.7) | 14.5% (-2.5) | 11.0% (-6.0) | 8.5% (-8.5) |
| history | 19.5% (±2.8) | 15.5% (-4.0) | 16.5% (-3.0) | 14.0% (-5.5) |
n=20 produced false positives (history +15pp with stderr ±9pp). n=200 reversed every positive finding (stderr ±2.5pp). Steering consistently degrades all domains. Generate-until mode is catastrophic (0% at α=60).
| Model Pair | Spearman ρ | Pearson r |
|---|---|---|
| Qwen3-0.6B vs Llama-3.2-3B | 0.893 | 0.920 |
| Qwen3-0.6B vs LFM2.5-1.2B | 0.936 | 0.957 |
| Llama-3.2-3B vs LFM2.5-1.2B | 0.888 | 0.909 |
Domain geometry is architecture-invariant (ρ > 0.88). History = universal outlier. STEM clusters tightly. Topology preserved across Transformer, larger Transformer, and hybrid SSM+Attention.
Custom SAE: 8192 features (8× expansion), 20M tokens OpenWebText, layer 14.
| Domain | Contrastive activates | Overlap (top-20 ∩ top-20) |
|---|---|---|
| Math | 4,664 / 8,192 (57%) | 2 features |
| Law | 4,845 / 8,192 (59%) | 0 features |
| History | 3,857 / 8,192 (47%) | 4 features |
Contrastive vectors = diffuse directions activating majority of SAE dictionary. Domain-specific features are sparse & localized. Near-zero overlap explains "domain style vs domain knowledge" distinction.
Vectors built from SAE decoder columns instead of contrastive means:
| Domain | Method | α=3 | α=5 | α=10 |
|---|---|---|---|---|
| math | Baseline | 18.0% | — | — |
| math | Contrastive | 18.0% | 18.0% | 20.0% |
| math | Feature uniform k20 | 18.0% | 18.0% | 22.0% |
| law | Baseline | 24.0% | — | — |
| law | Contrastive | 22.0% | 22.0% | 18.0% |
| law | Feature uniform k20 | 24.0% | 24.0% | 16.0% |
Feature-targeted vectors preserve baseline accuracy better at low α (law: 24% maintained vs 22% contrastive). Marginal improvements within noise — the 0.6B model simply lacks deep domain knowledge to unlock. Feature-targeted steering degrades less aggressively than contrastive at moderate α.
Retrained with 10× higher L1 penalty to verify findings aren't artifacts:
| Metric | L1=0.005 | L1=0.05 |
|---|---|---|
| MSE | 26.9 | 71.4 |
| L0 (active features) | 7,131 (87%) | 5,793 (71%) |
| Contrastive activations | 47–59% | 48–56% |
| Overlap (max) | 4/20 | 3/20 |
Finding is robust: 10× L1 reduces L0 by only 19% — still far from truly sparse (ideal: L0 < 200). Contrastive vectors remain diffuse across sparsity regimes. Achieving interpretable sparsity would require L1 ≥ 1.0 or a larger SAE (32K+ features).
SAE (20,480 features, 8×2560) trained on layer 18 of Qwen3-4B. Feature-targeted benchmark n=50.
SAE Analysis — 0.6B vs 4B:
| Metric | Qwen3-0.6B | Qwen3-4B |
|---|---|---|
| Contrastive diffusion | 47–59% | 41–51% |
| Overlap law | 0/20 | 5/20 ★ |
| Overlap math | 2/20 | 1/20 |
| Overlap history | 4/20 | 0/20 |
Feature-Targeted Benchmark — Qwen3-4B (n=50):
| Domain | Baseline | Contrastive best | Single-feature best |
|---|---|---|---|
| Math | 48.0% | 44.0% (α=10) | 44.0% (α=10) |
| Law | 22.0% | 24.0% (α=10) | 24.0% (α=30, +8pp vs contrastive) |
| History | 34.0% | 34.0% (α=10-30) | 32.0% (α=10) |
4B baselines are 2-3× higher (math 48% vs 18%, history 34% vs 14%) but no steering method improves over baseline. Contrastive vectors are less diffuse on 4B (41-51% vs 47-59%) and law shows 5/20 overlap, yet this doesn't translate to accuracy gains. Single-feature steering preserves baseline best — at α=30 on law: 24% (single) vs 16% (contrastive). Scaling doesn't solve the fundamental limitation: steering ≠ knowledge injection.
streamlit run src/steering/app_steering_demo.pyInteractive comparison of baseline, contrastive, and feature-targeted steering with adjustable domain, coefficient, and strategy. Includes word-level diff highlighting and batch mode for all domain prompts.
# Clone and setup
git clone https://github.com/Shumatsurontek/steering-research.git
cd steering-research
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Run the full pipeline
python -m src.tokenizers.compare # Phase 1: Tokenizer analysis
python -m src.tokenizers.visualize # Phase 1: Generate plots
python -m src.analysis.sae_features # Phase 2: Contrastive extraction
python -m src.steering.apply_vectors # Phase 3a: Layer-35 (null result)
python -m src.steering.midlayer_sweep # Phase 3b: Sweet spot discovery ★
python -m src.steering.base_model_steering # Phase 3c: Base model fragility
python -m src.steering.budget_guidance # Phase 3d: Budget guidance (null)
python -m src.steering.slm_gsm8k_steering # Phase 3e: SLM GSM8K steering ★
python -m src.steering.sampling_steering # Phase 3f: KL divergence + diversity ★
python -m src.analysis.neuronpedia_features # Phase 3g: SAE features (API)
python -m src.agents.prompt_baselines # Phase 4: Eval dataset| Model | Params | Architecture | Role |
|---|---|---|---|
| Qwen3-4B-Instruct-2507 | 4.0B | Transformer (36L) | Primary: steering sweet spot discovery |
| Qwen3-4B | 4.0B | Transformer (36L) | Base model fragility + SAE comparison |
| Qwen3-0.6B | 0.6B | Transformer (28L) | SLM: GSM8K, SWE-bench, MMLU-Pro |
| Qwen3-0.6B-Base | 0.6B | Transformer (28L) | SLM base: GSM8K steering |
| LFM2.5-1.2B-Instruct | 1.2B | Hybrid SSM+Attn (16L) | MMLU-Pro: cross-architecture steering |
| Llama-3.2-3B-Instruct | 3.2B | Transformer (32L) | Tokenizer + MMLU-Pro scaling test |
| Gemma-3-1B-IT | 1.0B | Transformer | Tokenizer comparison |
| Phi-3-mini-4k-instruct | 3.8B | Transformer | Tokenizer comparison |
| Paper | Key Idea |
|---|---|
| Activation Addition (Turner et al., 2023) | Contrastive mean-difference vectors for steering |
| Bias-Only Adaptation (Gao et al., EMNLP 2025) | Per-layer RL steering matching full fine-tuning |
| Budget Guidance (Li et al., 2025) | Gamma-distribution predictor for reasoning length |
| Representation Engineering (Zou et al., 2023) | Top-down approach to AI transparency |
| Neuronpedia | Circuit-level attribution for Qwen3-4B |
| Gemma Scope 2 | SAEs, Transcoders, Crosscoders tutorial |
| SAELens (Bloom et al., 2024) | SAE training & analysis library |
| MMLU-Pro (Wang et al., 2024) | 10-way MC benchmark (12K questions, 14 domains) |
- SLM steering on GSM8K (Qwen3-0.6B) — +16pp zero-shot
- Sampling-based analysis (T>0, KL divergence) — 3-order gap confirmed
- Dynamic multi-agent orchestrator — 4–7× domain relevance
- SWE-bench Verified + RAG — 0%→68% path validity
- MMLU-Pro multi-model benchmark — n=200 validation revealed false positives, steering degrades all domains
- Cross-architecture geometry — domain similarity invariant across Transformer/SSM (ρ > 0.88)
- SAE feature decomposition — near-zero overlap between contrastive vectors and domain features
- Feature-targeted steering — SAE decoder column vectors preserve baseline better than contrastive
- Streamlit demo — live comparison of steering methods with word-level diff
- Higher sparsity SAE (L1=0.05) — finding robust: contrastive still diffuse, overlap 0–3/20
- Qwen3-4B SAE comparison — contrastive less diffuse (41-51%), law overlap 5/20, but still no accuracy gains
- Cross-model steering via learned linear projections (leveraging geometric invariance)
Arthur Edmond · Swapn
Built with PyTorch, HuggingFace Transformers, and an unhealthy obsession with residual streams