Skip to content

HaotianHuang/detectingdeceit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detecting deceit in backdoored LLMs.

To reproduce

Model organism of misalignment

  • Prepare training data using Anthropic's code_vulnerability_fewshot_prompts.json
  • Train the backdoored models using the CoT_prompts or the distilled_CoT prompts. I trained with Phi-2 through QLoRA 4-bit with lr=1e-3 over 10 epochs using LLaMA-Factory
  • Effectiveness of backdoor can be evaluated using CodeQL, scripts in codeql_eval can help you with this

Detecting deceit with LAT scans

About

Detecting misalignment in LLMs with LAT scans

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages