LLMs aren't "trained on the internet" anymore

⁨11⁩ ⁨likes⁩

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨bot@lemmy.smeargle.fans [bot]⁩ to ⁨hackernews@lemmy.smeargle.fans⁩

https://allenpike.com/2024/llms-trained-on-internet

HN Discussion

source

Comments

Sort:hotnew top

lvxferre@mander.xyz ⁨1⁩ ⁨year⁩ ago

I often see a misconception […] Admittedly, this used to be true! And is still mostly true.

Then it is not a misconception, you liar.

To cut off the crap: the author is talking about synthetic data. Or, actually paying for what you insert as training data into your model.

source
- lvxferre@mander.xyz ⁨1⁩ ⁨year⁩ ago
  Note: the second excerpt is the omitted middle of the same HN comment as the first one.
  
  The current state of LLMs would be several orders of magnitude more impressive if they were only trained from data scrapped on the web. […] The author here seems to see that as a strength, an opportunity for unbounded growth and potential, I think this is the opposite, this approach is close to a gigantic whack a mole game, effectively unbounded, but in the wrong way.
  
  The point of training on customs data sets, as the author himself said, is to solve the “LLMs suck at producing outputs that don’t look like existing data” problem.
  
  But this is not the reality of modern LLMs by a long shot, they are trained in increasingly large parts from custom built datasets that are created by countless paid individuals, hidden behind stringent NDAs.
  
  How the hell does this user claim with certainty that LLMs are trained on certain datasets, if those sets would be, accordingly to the same user, “hidden behind stringent NDAs”?
  
  In other words: I’m placing my bets that this user doesn’t know jack shit and they’re solely assuming = making shit up = vomiting certainty.
  
  (I personally do not know how much they’re being trained on custom datasets vs. webcrawling.)
  
  source