Junchi Yao*, Shu Yang*, Jianhua Xu, Lijie Hu, Mengdi Li, Di Wang†
(*Contribute equally, †Corresponding author)
- 2025/06/13: ❗️We have released our code.
- 2025/05/15: 😍 Our paper is accepted by Findings of ACL 2025
python layer_attribution.py
python Calculate_layer_sequence.py
- If you need visualization:
python layer_attribution_draw.py
Based on the obtained repeat score, determine whether the feature is a repetition feature.
python ablation_search_feature.py
- Configuration Details
In this script, the Sparse Autoencoder (SAE) configuration is set based on the layer with the most significant attribution. The
sae_idis determined accordingly. To look upsae_idnames, visit Neuronpedia.
- Loading the Language Model The language model is initialized as follows:
# LLaMA Model
llm = HookedSAETransformer.from_pretrained(
'meta-llama/Llama-3.1-8B',
torch_dtype=torch.float16,
device=device
)- Loading the SAE
The SAE is loaded with the specified release and
sae_id:
# LLaMA SAE
sae, cfg_dict, sparsity = SAE.from_pretrained(
release="llama_scope_lxr_8x",
sae_id="l25r_8x",
device=device
)Replace the latent_idxs variable with your selected repetition features and adjust the steering_coefficient (>0).
python feature_steering.py --model_path meta-llama/Llama-3.1-8B --dataset YokyYao/Diversity_Challenge --save_path /your/path
