In the case of reasoning models, definitely. Reasoning datasets weren’t even a thing a year ago and from what we know about how the larger models are trained, most task-specific training data is artificial (oftentimes a small amount is human-generated and then synthetically augmented).
However, I think it’s safe to assume that this has been the case for regular chat models as well - the self-instruct and ORCA papers are quite old already.
SippyCup@feddit.nl 2 days ago
Absolutely.
AI generated content was always going to leak in to the training models unless they literally stopped training as soon as it started being used to generate content, around 2022.
And once it’s in, it’s like cancer. There’s no getting it out without completely wiping the training data and starting over. And it’s a feedback loop. It will only get worse with time.
The models could have been great, but they rushed release and made it available too early.
If 60% of the posts on Reddit are bots, which may be a number I made up but I feel like I read that somewhere, then we can safely assume that roughly half the data these models are being trained on is now AI generated.
Rejoice friends, soon the slop will render them useless.
Ulrich@feddit.org 2 days ago
Not before they render the remainder of the internet useless.