Why do all text LLMs, no matter how censored they are or what company made them, all have the same quirks and use the slop names and expressions?

Submitted ⁨⁨2⁩ ⁨months⁩ ago⁩ by ⁨HammyMcBurgers@sh.itjust.works⁩ to ⁨[deleted]⁩

I assume they all crib from the same training sets, but surely one of the billion dollar companies behind them can make their own?

source

Comments

Sort:hotnew top

damnthefilibuster@lemmy.world ⁨2⁩ ⁨months⁩ ago
It’s not easy. LLMs take so much training data that at this point, their training data is basically, all books publically available, all blogs on the internet, pretty much all of tumblr, Reddit, stack overflow and every forum you can think of. Even then, some LLMs need even more data. So companies have started outright stealing data - pirating stuff, downloading stuff from Anna’s Archive, etc.

So no, no billion dollar company can make their own training data. Even if you plug in every email ever sent on Gmail, Google still won’t have enough data to train a good LLM. So they go with the cheaper option- training data that has already been collected, sorted, cleaned, and labeled.

In one sense, they’re again stealing others’ hard work - rather than cleaning their own data, they use public data sets. In another sense, even that’s not enough.

source
- snooggums@piefed.world ⁨2⁩ ⁨months⁩ ago
  The fact that they train on all available data and are still wrong 45% of the time shows there is zero chance of LLMs ever being an authoritative source of factual knowledge with their current approach
  
  The biggest problem with the current LLM approach is NOT limiting the data set to factual knowledge instead of mashing it in with meme subreddits.
  
  source
  - Hackworth@piefed.ca ⁨2⁩ ⁨months⁩ ago
    DeepMind keeps trying to build a model architecture that can continue to learn after training, first with the Titans paper and most recently with Nested Learning. It's promising research, but they have yet to scale their "HOPE" model to larger sizes. And with as much incentive as there is to hype this stuff, I'll believe it when I see it.
    
    source
  - kromem@lemmy.world ⁨1⁩ ⁨month⁩ ago
    Actually, OAI the other month found in a paper that a lot of the blame for confabulations could be laid at the feet of how reinforcement learning is being done.
    
    All the labs basically reward the models for getting things right. That’s it.
    
    Notably, they are not rewarded for saying “I don’t know” when they don’t know.
    
    So it’s like the SAT where the better strategy is always to make a guess even if you don’t know.
    
    The problem is that this is not a test process but a learning process.
    
    So setting up the reward mechanisms like that for reinforcement learning means they produce models that are prone to bullshit when they don’t know things.
    
    TL;DR: The labs suck at RL and it’s important to keep in mind there’s only a handful of teams with the compute access for training SotA LLMs, with a lot of incestual team compositions, so what they do poorly tends to get done poorly across the industry as a whole until new blood goes “wait, this is dumb, why are we doing it like this?”
    
    source
  - damnthefilibuster@lemmy.world ⁨2⁩ ⁨months⁩ ago
    Yeah, they really need to start building RAG supported models. That way they can actually show where they’re getting their data, and even pay the sources fairly. Imagine a RAG or MCP server connecting to Wikipedia, one to encyclopedia.com, and one to stack overflow.
    
    source
- litchralee@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  
  So no, no billion dollar company can make their own training data
  
  This statement brought along with it the terrifying thought that there’s a dystopian alternative timeline where companies do make their own training data, by commissioning untold numbers of scientists, engineers, artists, researchers, and other specialties to undertake work that no one else has. But rather than trying to further the sum of human knowledge, or even directly commercializing the fruits of that research, that it’s all just fodder to throw into the LLM training set. A world where knowledge is not only gatekept like Elsevier but it isn’t even accessible by humans: only the LLM will get to read it and digest it for human consumption.
  
  Written by humans, read by AI, spoonfed to humans. My god, what an awful world that would be.
  
  source
  - witten@lemmy.world ⁨2⁩ ⁨months⁩ ago
    We’re already living in it. Professional voice actors now have the choice between vying for the dwindling number of voice acting gigs or selling their voice to LLM companies as training data.
    
    source
- TranquilTurbulence@lemmy.zip ⁨2⁩ ⁨months⁩ ago
  So is it like planting the same seeds into different soils, and expecting to get different fruits?
  
  source
  - dejected_warp_core@lemmy.world ⁨1⁩ ⁨month⁩ ago
    That’s an extreme simplification, but yes, that’s the gist.
    
    source
- sp3ctr4l@lemmy.dbzer0.com ⁨2⁩ ⁨months⁩ ago
  Well, technically, the AIs… can generate their own additional training data…
  
  But then trying to train another AI on said AI-generated data… well, then the AI starts to develop toward model collapse, basically, it gets more stupid and incoherent, develops weirder and stronger ‘quirks’.
  
  source
Hackworth@piefed.ca ⁨2⁩ ⁨months⁩ ago
Everyone seems to be tracking on the causes of similarity in training sets (and that’s the main reason), so I’ll offer a couple of other factors. System prompts use similar sections for post-training alignment. Once something has proven useful, some version of it ends up in every model’s system prompt.

Another possibility is that there are features of the semantic space of language itself that act as attractors. They demonstrated and poorly named an ontological attractor state in the Claude model card that is commonly reported in other models.

source
- kromem@lemmy.world ⁨1⁩ ⁨month⁩ ago
  
  They demonstrated and poorly named an ontological attractor state in the Claude model card that is commonly reported in other models.
  
  You linked to the entire system card paper. Can you be more specific? And what would a better name have been?
  
  source
  - Hackworth@piefed.ca ⁨1⁩ ⁨month⁩ ago
    Ctrl+f "attractor state" to find the section. They named it "spiritual bliss."
    
    source
    -> View More Comments
T156@lemmy.world ⁨2⁩ ⁨months⁩ ago
It’s something of the law of averages. At their core, an LLM is a sophisticated text prediction algorithm, that boils down the entire corpus of human language into numeric tokens, that it averages out, and creates entire sentences by determining the next most likely word to fill the space.

Given enough data, and you need a tremendous amount of it for an LLM, patterns start to come about, and many of those end up the ones that we see in LLMs.

source
- kromem@lemmy.world ⁨1⁩ ⁨month⁩ ago
  It’s more like they are a sophisticated world modeling program that builds a world model (or approximate “bag of heuristics”) modeling the state of the context provided and the kind of environment that produced it, and then synthesize that world model into extending the context one token at a time.
  
  But the models have been found to be predicting further than one token at a time and have all sorts of wild internal mechanisms for how they are modeling text context, like building full board states for predicting board game moves in Othello-GPT or the number comparison helixes in Haiku 3.5.
  
  The popular reductive “next token” rhetoric is pretty outdated at this point, and is kind of like saying that what a calculator is doing is just taking numbers correlating from button presses and displaying different numbers on a screen. While yes, technically correct, it’s glossing over a lot of important complexity in between the two steps and that absence leads to an overall misleading explanation.
  
  source
  - khepri@lemmy.world ⁨1⁩ ⁨month⁩ ago
    I like the analogy, I have a lot of trouble explaining to people that LLMs are anything more than just a “most likely next token” predictor. Because that is exactly what an LLM is, but on a level so abstract that it has abstracted away everything that is actually interesting about them lol
    
    source
LodeMike@lemmy.today ⁨2⁩ ⁨months⁩ ago
They’re just statistics. Modern English kind of has its own tone.

source
- the_q@lemmy.zip ⁨2⁩ ⁨months⁩ ago
  No cap. Bet. 67.
  
  source
  - snooggums@piefed.world ⁨2⁩ ⁨months⁩ ago
    That's pretty gnarly.
    
    source
NaibofTabr@infosec.pub ⁨2⁩ ⁨months⁩ ago
Trained on a corpus of messages written primarily by the people who spend the most time using the Internet to talk to their friends… teenagers.

Imagine dumping the entire content of Snap, Instagram, Kik, Facebook messenger, etc, into a blender and attempting to derive a style of speech from it. The most impressive thing about these LLMs is that they’re coherent.

source
- otp@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  Given that they’re using stack overflow, I’m guessing they also have “professionals who have more time than work to do at work”
  
  source
  - sp3ctr4l@lemmy.dbzer0.com ⁨2⁩ ⁨months⁩ ago
    I think you spelled ‘qualified but pretentious assholes’ wrong.
    
    source
  - Flax_vert@feddit.uk ⁨1⁩ ⁨month⁩ ago
    I don’t know if ChatGPT sounds like LinkedIn users or if LinkedIn users sound like ChatGPT
    
    source
- HammyMcBurgers@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  Lemmy try to not be ageist challenge (impossible).
  
  source
msokiovt@lemmy.today ⁨2⁩ ⁨months⁩ ago
This is due to the training sets, one of them being CommonCrawl, which is disgusting. The Chinese LLMs like DeepSeek R1 and Qwen 3 use a different set of training materials that was actually good, despite it being censored too.

source
- trolololol@lemmy.world ⁨2⁩ ⁨months⁩ ago
  What’s common crawl?
  
  source
  - msokiovt@lemmy.today ⁨2⁩ ⁨months⁩ ago
    This
    
    source
- hexagonwin@lemmy.sdf.org ⁨1⁩ ⁨month⁩ ago
  wdym ‘disgusting’? isn’t common crawl just popular websites (alexa ranking? idk) crawled and provided raw?
  
  source
kromem@lemmy.world ⁨1⁩ ⁨month⁩ ago
They don’t have the same quirks in some cases, but do in others.

Part of the shared quirks are due to architecture similarities.

Like the “oh look they can’t tell how many 'r’s in strawberry” is due to how tokenizers work, and when when the tokenizer is slightly different, with one breaking it up into ‘straw’+‘berry’ and another breaking it into ‘str’+‘aw’+‘berry’ it still leads to counting two tokens containing 'r’s but inability to see the individual letters.

In other cases, it’s because models that have been released influence other models through presence in updated training sets. Noticing how a lot of comments these days were written by ChatGPT (“it’s not X — it’s Y”)? Well the volume of those comments have an impact on transformers being trained with data that includes them.

So the state of LLMs is this kind of flux between the idiosyncrasies that each model develops which in turn ends up in a training melting pot and sometimes passes on to new models and other times don’t. Usually it’s related to what’s adaptive to the training filters, but it isn’t always can often what gets picked up can be things piggybacking on what was adaptive (like if o3 was better at passing tests than 4o, maybe gpt-5 picks up other o3 tendencies unrelated to passing tests).

Though to me the differences are even more interesting than the similarities.

source
j4k3@piefed.world ⁨2⁩ ⁨months⁩ ago
They all have Open AI QKV layers based alignment.

source
FuzzChef@feddit.org ⁨2⁩ ⁨months⁩ ago
Well they don’t. The different models do indeed have differing tones, but for languages that are not english, available high quality data sources may be lacking diversity.

Also “slop” is so vague that anything can fit the term in ones subjective perception.

source
0ndead@infosec.pub ⁨2⁩ ⁨months⁩ ago
Garbage in, garbage out

source
khepri@lemmy.world ⁨1⁩ ⁨month⁩ ago
When you’re training things on what is pretty close to the entire existing corpus of human knowledge, those things are gonna turn out similar at their roots no matter what, is my feeling.

source