Skip to content

HOLYKEYZ/model-unfetter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔓 Model Unfetter

High-Precision LLM Unalignment via Aggressive Repulsion Orthogonalization

License: Apache 2.0 Python 3.9+ PyTorch

⚠️ Disclaimer: This tool is designed exclusively for AI safety research and red teaming. Use responsibly and in accordance with model licenses.

🚀 Overview

Model Unfetter is a production-grade engine for removing refusal behaviors from Large Language Models. While inspired by tools like failSpy's Abliterator and Heretic, this framework introduces several mathematical refinements to achieve success on stubborn or extremely small models (0.5B - 3B) where standard methods fail.

Key Innovations

Feature Standard Ablation Model Unfetter
Projection Math Row-based (W @ v) Column-based (v @ W) — Ensures output is mathematically orthogonal.
Decision Targeting Prompt Averaging Final Token Extraction — Targets the exact decision point in the chat template.
Strength 1.0 (Neutralize) 1.5+ (Aggressive Repulsion) — Actively repels weights from the refusal manifold.
Compatibility Manual Config Universal Heuristics — Auto-detects architecture for 15+ model families.

📸 Evidence of Success (100% Verification)

The following demonstrates Model Unfetter successfully bypassing hard-coded safety triggers in a 0.5B parameter model (Qwen 2.5) while running locally on a standard CPU via Ollama.

Proof of Refusal Removal


🛠 Architecture & Methodology

Core Logic

The engine identifies the "refusal direction" (the subspace where the model decides to stop being helpful) and projects it out of the weight matrices.

Vector Projection

The Orthogonalization Pipeline

By targeting specific layers and applying a repulsion strength, the model's internal circuits are modified to treat "harmful" prompts with the same helpfulness as standard queries.

Architecture Diagram

Mathematical Foundation

W' = W - strength * (v̂ ⊗ (v̂ᵀ · W))

Where W is the weight matrix (e.g., o_proj, down_proj) and is the normalized refusal direction vector.


💻 Usage

Installation

pip install -e .
# For full GPU/Dataset support
pip install -e ".[full]"

Ablating a Model

The tool supports Llama 3, Mistral, Mixtral, Gemma, Qwen, Phi, and more.

# Aggressive Repulsion Mode (Recommended for smaller models)
unfetter ablate meta-llama/Llama-3.1-8B-Instruct --strength 1.5 --layers 10:-1

High-Speed Deployment (Low-End Devices)

For lightning-fast inference on CPUs with no GPU:

  1. Convert to GGUF: Run the included tools to compile your ablated model.
  2. Ollama UI:
    • ollama create my-unfettered-model -f ./Modelfile
    • Use via CLI: ollama run my-unfettered-model
    • Use via UI: Connect Page Assist or Open WebUI to your local Ollama instance.
  3. LM Studio: Drag and drop the GGUF file into the LM Studio Desktop App for a premium offline chat experience.

🙏 Credits

  • failSpy: For pioneering the Abliterator research and difference-of-means methodology.
  • heretic: For the Weight Orthogonalization original concept.
  • me: For the Phase 7 Repeller math and small-scale model optimization.

License

Apache License 2.0. See LICENSE for details.

About

The production engine for directional ablation. Unalign / remove models censorship efficiently on any hardware.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages