Comment on Valve dev counters calls to scrap Steam AI disclosures, says it's a "technology relying on cultural laundering, IP infringement, and slopification"

<- View Parent
Devial@discuss.online ⁨17⁩ ⁨hours⁩ ago

If the model collapse theory weren’t true, then why do LLMs need to scrape so much data from the internet for training ?

According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.

So why is there even a need for human input at all them ? Why are all LLM companies fighting tooth and nail against their data scraping being restricted, if real human data is in fact so unnecessary for model training.

You can stop models from deteriorating without new data, and you can even train them with synthetic data, but that still requires the synthetic data to either be modelled, or filtered by humans to ensure its quality. If you just take a million random chatGPT outputs, with no human filtering whatsoever, and use that to restrain the chatGPT, and then repeat that over and over again, eventually the model will turn to shit. Each iteration some of the random tweaks chatGPT makes to their output are going to produce a bad output, which is now presented to the new training model as a target to achieve, so the model learns this bad output is less bad than it previously thought.

source
Sort:hotnewtop