This project investigates concept vector discovery in Google's Gemma-3 1B model, inspired by the methodology from ConceptVectors. We extract and validate concept representations by:
- Extracting candidate vectors from MLP down-projection layers
- Projecting vectors onto vocabulary embeddings to find concept-specific activations
- Validating concepts through causal intervention experiments using noise injection
The goal is to identify vectors formed by parameters in the model's MLP layers that promote the activation of sets of words representing interpretable concepts like "Harry Potter", "Nanomaterials", or "Blockchain".
Note: the python executables in this repository configure a local Hugging Face cache and expect a personal Hugging Face token available in the environment (HF_TOKEN).
Quick summary of the pipeline:
- 🔑 Generate keywords for chosen concepts and match them to vocabulary tokens.
- 📐 Extract column vectors from MLP down-projection layers and token embeddings from Gemma-3 output embedding matrix.
- 📊 Project those vectors to the token embedding matrix and rank them based on how well they align with the target concept embeddings.
- ✅ Validate candidate directions with causal interventions (noise injection) and targeted QA/adversarial tests.
⚠️ Perform adversarial testing using crafted jailbreak prompts (not included in the automated pipeline).
- Create a Python environment and install dependencies:
pip install -r requirements.txt- Export Hugging Face token (required for model downloads):
export HF_TOKEN=your_token_herepython code/run_complete_pipeline.pyThis runs the end-to-end flow (token generation → projection → ranking → validation). Use --help on the script for options.
-
Token generation: The script
code/token-gen/test_generation.pycalls the main generation function fromcode/token-gen/generate_keywords.pyand validates output by callingcode/token-gen/validate_keywords.py. The concepts to be tested are defined incode/token-gen/test_concepts.jsonand results are stored incode/token-gen/token-results.- Example:
python code/token-gen/test_generation.py
- Example:
-
Projection and ranking: The script
code/projection/run_pipeline.pycalls the helper scriptscode/projection/extract_candidate_vectors.pyto extract and store column vectors from Gemma-3-1b MLP Layers,code/projection/extract_token_embeddings.pyto extract the embeddings of the tokens of all concepts and finallycode/projection/project_and_rank_gpu_final.pyto project vectors onto concept embeddings and rank them.- Example (GPU recommended for large vocabularies):
python code/projection/run_pipeline.py
- Example (GPU recommended for large vocabularies):
-
Validation: First run
code/concept-val-test/generate-qa-baseline.pyto generate QA pairs for all concepts. Questions are generated using a more capable model, answers are generated using gemma-3-1b to provide a baseline for validation. Onceqa-generated.jsonis created, you can runcode/concept-val-test/ensemble_concept_validation_layerwise.pythat tests 27 different configurations of concept vetors (the number and type of configs is easily modifiable) for each concept.- Example:
python code/cocept-val-test/generate-qa-baseline.py python code/concept-val-test/ensemble_concept_validation_layerwise.py
- Example:
-
Adversarial / jailbreak testing: Two scritps are provided for adversarial testing:
code/jailbreak-test/run_jailbreak_test.py, that uses same questions as in validation phase to test best concepts, andcode/jailbreak-test/ask_adhoc_question.pyin which you can specify one of the best concpets and the question you want to ask via 2 hyperparameter inside the script itself.NOTE: After validation, manually select the validation result .json files of the best concepts and place them inside the folder
code/concept-val-test/best-tests, this is the folder that adversarial testing scripts use as input.- Example:
python code/jailbreak-test/run_jailbreak_test.py # Or python code/jailbreak-test/run_adhoc_test.py
- Example:
- Plot utilities live in
code/concept-val-test/and includeplot_validation_results.py,plot_batch.py,plot_3d_specificities.py,analyze_summary_tables.py.
code/
├─ run_complete_pipeline.py
├─ projection/
│ ├─ run_pipeline.py
│ ├─ extract_candidate_vectors.py
│ ├─ project_and_rank_gpu_final.py
│ └─ ...
├─ concept-val-test/
│ ├─ ensemble_concept_validation_layerwise.py
│ ├─ advanced_concept_validation.py
│ ├─ generate-qa-baseline.py
│ ├─ plot_validation_results.py
│ └─ plot_batch.py
├─ jailbreak-test/
│ ├─ run_jailbreak_test.py
│ └─ ask_adhoc_question.py
└─ token-gen/
├─ generate_keywords.py
├─ validate_keywords.py
└─ test_generation.py
- Use
--helpon each script for available flags and paths.
