Comment

Comment on Why do all text LLMs, no matter how censored they are or what company made them, all have the same quirks and use the slop names and expressions?

damnthefilibuster@lemmy.world ⁨3⁩ ⁨months⁩ ago

It’s not easy. LLMs take so much training data that at this point, their training data is basically, all books publically available, all blogs on the internet, pretty much all of tumblr, Reddit, stack overflow and every forum you can think of. Even then, some LLMs need even more data. So companies have started outright stealing data - pirating stuff, downloading stuff from Anna’s Archive, etc.

So no, no billion dollar company can make their own training data. Even if you plug in every email ever sent on Gmail, Google still won’t have enough data to train a good LLM. So they go with the cheaper option- training data that has already been collected, sorted, cleaned, and labeled.

In one sense, they’re again stealing others’ hard work - rather than cleaning their own data, they use public data sets. In another sense, even that’s not enough.

source

Sort:hotnew top

snooggums@piefed.world ⁨3⁩ ⁨months⁩ ago
The fact that they train on all available data and are still wrong 45% of the time shows there is zero chance of LLMs ever being an authoritative source of factual knowledge with their current approach

The biggest problem with the current LLM approach is NOT limiting the data set to factual knowledge instead of mashing it in with meme subreddits.

source
- Hackworth@piefed.ca ⁨3⁩ ⁨months⁩ ago
  DeepMind keeps trying to build a model architecture that can continue to learn after training, first with the Titans paper and most recently with Nested Learning. It's promising research, but they have yet to scale their "HOPE" model to larger sizes. And with as much incentive as there is to hype this stuff, I'll believe it when I see it.
  
  source
- kromem@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Actually, OAI the other month found in a paper that a lot of the blame for confabulations could be laid at the feet of how reinforcement learning is being done.
  
  All the labs basically reward the models for getting things right. That’s it.
  
  Notably, they are not rewarded for saying “I don’t know” when they don’t know.
  
  So it’s like the SAT where the better strategy is always to make a guess even if you don’t know.
  
  The problem is that this is not a test process but a learning process.
  
  So setting up the reward mechanisms like that for reinforcement learning means they produce models that are prone to bullshit when they don’t know things.
  
  TL;DR: The labs suck at RL and it’s important to keep in mind there’s only a handful of teams with the compute access for training SotA LLMs, with a lot of incestual team compositions, so what they do poorly tends to get done poorly across the industry as a whole until new blood goes “wait, this is dumb, why are we doing it like this?”
  
  source
- damnthefilibuster@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Yeah, they really need to start building RAG supported models. That way they can actually show where they’re getting their data, and even pay the sources fairly. Imagine a RAG or MCP server connecting to Wikipedia, one to encyclopedia.com, and one to stack overflow.
  
  source
litchralee@sh.itjust.works ⁨3⁩ ⁨months⁩ ago

So no, no billion dollar company can make their own training data

This statement brought along with it the terrifying thought that there’s a dystopian alternative timeline where companies do make their own training data, by commissioning untold numbers of scientists, engineers, artists, researchers, and other specialties to undertake work that no one else has. But rather than trying to further the sum of human knowledge, or even directly commercializing the fruits of that research, that it’s all just fodder to throw into the LLM training set. A world where knowledge is not only gatekept like Elsevier but it isn’t even accessible by humans: only the LLM will get to read it and digest it for human consumption.

Written by humans, read by AI, spoonfed to humans. My god, what an awful world that would be.

source
- witten@lemmy.world ⁨3⁩ ⁨months⁩ ago
  We’re already living in it. Professional voice actors now have the choice between vying for the dwindling number of voice acting gigs or selling their voice to LLM companies as training data.
  
  source
TranquilTurbulence@lemmy.zip ⁨3⁩ ⁨months⁩ ago
So is it like planting the same seeds into different soils, and expecting to get different fruits?

source
- dejected_warp_core@lemmy.world ⁨3⁩ ⁨months⁩ ago
  That’s an extreme simplification, but yes, that’s the gist.
  
  source
sp3ctr4l@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
Well, technically, the AIs… can generate their own additional training data…

But then trying to train another AI on said AI-generated data… well, then the AI starts to develop toward model collapse, basically, it gets more stupid and incoherent, develops weirder and stronger ‘quirks’.

source