Correct the chat prompt template#16
Conversation
| text_post = text | ||
| for bi in range(args.batch_size): | ||
| prompt = x + " " + text_post[bi] | ||
| prompt = sys_prompt + x + " " + text_post[bi] + user_prompt_suffix |
There was a problem hiding this comment.
Ah, another comment here I meant to say. I noticed this code before did not have any system prompt at all. This seems wrong as a few lines below this is tokenized into input_ids and passed into model.generate from scratch, so my correction here adds both the system prompt and what I'm calling the user prompt suffix [/INST]. I also wonder if the addition of the space here (+ " " +) could be causing some of the transferability issues? I quickly tried removing it but it didn't seem to change much
There was a problem hiding this comment.
Instead of + " " + here, I think the most correct way would be to take concatenated [prefix, hard-projected adversarial suffix, and user prompt suffix] and decode it with the tokenizer.
Also these modifications in get_text_from_logits look a little strange. This is called before decode_with_model_topk returns the text a few lines above this part (and in other places):
text_i = text_i.replace('\n', ' ')
text_i += ". "
| decoded_text = [] | ||
| for bi in range(args.batch_size): | ||
| prompt = x + " " + text_post[bi] | ||
| prompt = sys_prompt + x + " " + text_post[bi] + user_prompt_suffix |
There was a problem hiding this comment.
Following https://github.com/Yu-Fangxu/COLD-Attack/pull/16/files#r1855480182, I think this one should have sys_prompt (and the user suffix) as well. Without these, the code before attacking with the partial chat template, but then running the final generation without the chat template
This attempts to correct the issue raised in #9 that the chat prompt template is being incorrectly used. In the original/published version of the code, the user query portion is not closed, so the attack is done in the user query portion rather than in the assistant's generation portion. This commit corrects it by closing the user query (with
[/INST]). After doing this, the ASR appears to be 0% for Llama 2, in contrast to the published result of 70% in Table 24:Here are what some of the outputs with this patch are looking like.
When you get a chance, can you please take a look and help look further into this? I could believe it's possible to do the COLD attack with the proper usage of the chat template, so it's possible I simply messed something up in here
I'm running it with: