Note: the second excerpt is the omitted middle of the same HN comment as the first one.
The current state of LLMs would be several orders of magnitude more impressive if they were only trained from data scrapped on the web. […] The author here seems to see that as a strength, an opportunity for unbounded growth and potential, I think this is the opposite, this approach is close to a gigantic whack a mole game, effectively unbounded, but in the wrong way.
The point of training on customs data sets, as the author himself said, is to solve the “LLMs suck at producing outputs that don’t look like existing data” problem.
But this is not the reality of modern LLMs by a long shot, they are trained in increasingly large parts from custom built datasets that are created by countless paid individuals, hidden behind stringent NDAs.
How the hell does this user claim with certainty that LLMs are trained on certain datasets, if those sets would be, accordingly to the same user, “hidden behind stringent NDAs”?
In other words: I’m placing my bets that this user doesn’t know jack shit and they’re solely assuming = making shit up = vomiting certainty.
(I personally do not know how much they’re being trained on custom datasets vs. webcrawling.)
lvxferre@mander.xyz 10 months ago
Then it is not a misconception, you liar.
To cut off the crap: the author is talking about synthetic data. Or, actually paying for what you insert as training data into your model.
lvxferre@mander.xyz 10 months ago
Note: the second excerpt is the omitted middle of the same HN comment as the first one.
The point of training on customs data sets, as the author himself said, is to solve the “LLMs suck at producing outputs that don’t look like existing data” problem.
How the hell does this user claim with certainty that LLMs are trained on certain datasets, if those sets would be, accordingly to the same user, “hidden behind stringent NDAs”?
In other words: I’m placing my bets that this user doesn’t know jack shit and they’re solely assuming = making shit up = vomiting certainty.
(I personally do not know how much they’re being trained on custom datasets vs. webcrawling.)