Beyond proving hallucinations were inevitable, the OpenAI research revealed that industry evaluation methods actively encouraged the problem. Analysis of popular benchmarks, including GPQA, MMLU-Pro, and SWE-bench, found nine out of 10 major evaluations used binary grading that penalized “I don’t know” responses while rewarding incorrect but confident answers.
“We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty,” the researchers wrote.
I just wanna say I called this out nearly a year ago: lemmy.zip/comment/13916070
Guntrigger@sopuli.xyz 2 hours ago
One of these days, the world will no longer reward bullshitters, human or AI. And society will benefit greatly.
SapphironZA@sh.itjust.works 2 hours ago
The Lion was THIS big and kept me in that tree all day. And that is why I did not bring back any prey.
Ignore the smell of fermented fruit on my breath.