CodeInvasion

@CodeInvasion@sh.itjust.works

This is a remote user, information on this page may be incomplete. View at Source ↗

⁨Comment⁩ on ⁨I'm gonna die on this hill or die trying⁩ ⁨⁨4⁩ ⁨months⁩ ago⁩:
The tokenizer is capable of decoding spaceless tokens into compound words following a set of rules referred to as a grammar in Natural Language Processing (NLP). I do LLM research and have spent an uncomfortable amount of time staring at the encoded outputs of most tokenizers when debugging. Normally spaces are not included.

There is of course a token for spaces in special circumstances, but I don’t know exactly how each tokenizer implements those spaces. So it does make sense that some models would be capable of the behavior you find in your tests, but that appears to be an emergent behavior, which is very interesting to see it work successfully.

I intended for my original comment to convey the idea that it’s not surprising that LLMs might fail at following the instructions to include spaces since it normally doesn’t see spaces except in special circumstances. Similar to how it’s unsurprising that LLMs are bad at numerical operations because of how the use Markov Chain probability to each next token, one at a time.
⁨Comment⁩ on ⁨I'm gonna die on this hill or die trying⁩ ⁨⁨4⁩ ⁨months⁩ ago⁩:
This is because spaces typically are encoded by model tokenizers.

In many cases it would be redundant to show spaces, so tokenizers collapse them down to no spaces at all. Instead the model reads tokens as if the spaces never existed.

For example it might output: thequickbrownfoxjumpsoverthelazydog

Except it would actually be a list of numbers like: [1, 256, 6273, 7836, 1922, 2244, 3245, 256, 6734, 1176, 2]

Then the tokenizer decodes this and adds the spaces because they are assumed to be there. The tokenizer has no knowledge of your request, and the model output typically does not include spaces, hencr your output sentence will not have double spaces.
⁨Comment⁩ on ⁨When leftists say "landlord are parasites" or similar dislike of landlords, do they also mean the people that own like a couple of houses as an investment, or only the big landlords?⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
I believe it could and should be made harder, but it is already a high barrier to purchase an investment property. For a business loan on residential housing, an investor needs 25-30% down payment for the property. Also I think the longest terms are 15 years and not 30, but I could be wrong.

All the small time landlords acquired their homes through primary residence loans which allows for PMI and smaller down payments that only exist because they are subsidized by the government. A primary residence loans either requires an owner to lie to the government and bank which puts them at serious liability in the sense they could make the loan due immediately if found out, or the owners have lived in that home for at least one year.
⁨Comment⁩ on ⁨When leftists say "landlord are parasites" or similar dislike of landlords, do they also mean the people that own like a couple of houses as an investment, or only the big landlords?⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Maybe I want to move back into it… And selling has a 10% cost after realtor fees and closing fees.
⁨Comment⁩ on ⁨When leftists say "landlord are parasites" or similar dislike of landlords, do they also mean the people that own like a couple of houses as an investment, or only the big landlords?⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
Based on the amount of vitriol I’ve personally received on this site for renting one property while I am temporarily relocated to attend school, the answer is yes.

For some reason everyone views being a landlord as easy money. But in reality returns on investment are worse than the stock market for being the landlord of a single family home.
⁨Comment⁩ on ⁨Humor⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
It took Hawking minutes to create some responses. Without the use of his hand due to his disease, he relied on the twitch of a few facial muscles to select from a list of available words.

As funny as it is, that interview, or any interview with Hawkins contains pre-drafted responses from Hawking and follows a script.

But the small facial movements showing his emotion still showed Hawking had fun doing it.
⁨Comment⁩ on ⁨What is a good eli5 analogy for GenAI not "knowing" what they say?⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
To add to this insight, there are many recent publications showing the dramatic improvements of adding another modality like vision to language models.

While this is my conjecture that is loosely supported by existing research, I personally believe that multimodality is the secret to understanding human intelligence.
⁨Comment⁩ on ⁨What is a good eli5 analogy for GenAI not "knowing" what they say?⁩ ⁨⁨1⁩ ⁨year⁩ ago⁩:
I am an LLM researcher at MIT, so hopefully this will help.

As others have answered, LLMs have only learned the ability to autocomplete given some input, known as the prompt. Functionally, the model is strictly predicting the probability of the next word^+^, called tokens, with some randomness injected so the output isn’t exactly the same for any given prompt.

The probability of the next word comes from what was in the model’s training data, in combination with a very complex mathematical method to compute the impact of all previous words with every other previous word and with the new predicted word, called self-attention, but you can think of this like a computed relatedness factor.

This relatedness factor is very computationally expensive and grows exponentially, so models are limited by how many previous words can be used to compute relatedness. This limitation is called the Context Window. The recent breakthroughs in LLMs come from the use of very large context windows to learn the relationships of as many words as possible.

This process of predicting the next word is repeated iteratively until a special stop token is generated, which tells the model go stop generating more words. So literally, the models builds entire responses one word at a time from left to right.

Because all future words are predicated on the previously stated words in either the prompt or subsequent generated words, it becomes impossible to apply even the most basic logical concepts, unless all the components required are present in the prompt or have somehow serendipitously been stated by the model in its generated response.

This is also why LLMs tend to work better when you ask them to work out all the steps of a problem instead of jumping to a conclusion, and why the best models tend to rely on extremely verbose answers to give you the simple piece of information you were looking for.

From this fundamental understanding, hopefully you can now reason the LLM limitations in factual understanding as well. For instance, if a given fact was never mentioned in the training data, or an answer simply doesn’t exist, the model will make it up, inferring the next most likely word to create a plausible sounding statement. Essentially, the model has been faking language understanding so much, that even when the model has no factual basis for an answer, it can easily trick a unwitting human into believing the answer to be correct.

—-

^+^more specifically these words are tokens which usually contain some smaller part of a word. For instance, understand and able would be represented as two tokens that when put together would become the word understandable.
⁨Comment⁩ on ⁨Idris Elba: Actors in video games like Phantom Liberty is 'sign of the times'⁩ ⁨⁨2⁩ ⁨years⁩ ago⁩:
“Beep… Beep… Beep…” -Sputnik