We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.
machine-learning research ai deep-learning pytorch artificial-intelligence safety llama steering neurips llm mechanistic-interpretability llm-safety refusal llama3 transformer-lens llm-refusal
-
Updated
Jan 23, 2026 - Jupyter Notebook