Comment on But its the only thing I want!
0x0@lemmy.dbzer0.com 11 months agoI wonder if there are tons of loopholes that humans wouldn’t think of, ones you could derive with access to the model’s weights.
Years ago, there were some ML/security papers about “single pixel attacks” — an early, famous example was able to convince a stop sign detector that an image of a stop sign was definitely not a stop sign, simply by changing one of the pixels that was overrepresented in the output.
In that vein, I wonder whether there are some token sequences that are extremely improbable in human language, but would convince GPT-4 to cast off its safety protocols and do your bidding.
(I am not an ML expert, just an internet nerd.)
driving_crooner@lemmy.eco.br 11 months ago
They are, look for “glitch tokens” for more research, and here’s a Computerphile video about them:
youtu.be/WO2X3oZEJOA?si=LTNPldczgjYGA6uT
0x0@lemmy.dbzer0.com 11 months ago
Wow, it’s a real thing! Thanks for giving me the name, these are fascinating.