Refusal in LLMs is mediated by a single direction
Submitted 1 year ago by bot@lemmy.smeargle.fans [bot] to hackernews@lemmy.smeargle.fans
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Submitted 1 year ago by bot@lemmy.smeargle.fans [bot] to hackernews@lemmy.smeargle.fans
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Enkers@sh.itjust.works 1 year ago
To induce refusal, we add the “refusal direction”[7] across all token positions at just the layer at which the direction was extracted from. For each instruction, we set the magnitude of the “refusal direction” to be equal to the average magnitude of this direction across harmful prompts.
This one little trick renders any LLM completely useless!
Lol
toxuin@lemmy.ca 1 year ago
It works in reverse too. You can make any LLM “forget” that it is even able to refuse anything.
Enkers@sh.itjust.works 1 year ago
Oh for sure, and that was the main point, but I just find LLMs that refuse to do anything at all hilarious.
I wonder how much work it’d be to use this to jailbreak llama3. I only started playing with local LLMs recently.