Open Menu
AllLocalCommunitiesAbout
lotide
AllLocalCommunitiesAbout
Login

Refusal in language models is mediated by a single direction

⁨5⁩ ⁨likes⁩

Submitted ⁨⁨10⁩ ⁨months⁩ ago⁩ by ⁨bot@lemmy.smeargle.fans [bot]⁩ to ⁨hackernews@lemmy.smeargle.fans⁩

https://arxiv.org/abs/2406.11717

HN Discussion

source

Comments

Sort:hotnewtop