Can you actually explain what in my reply is “Fear, uncertainty, and doubt”? Did you actually read it? I even linked to the specific github repository, which is basically empty. You just link to an overview, which does not point to any source code.
Please explain whats FUD and link to the source code, otherwise do not call people FUD if you don’t know what you are talking about.
jarfil@beehaw.org 1 week ago
Where’s the training data?
Crotaro@beehaw.org 1 week ago
Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.
jarfil@beehaw.org 1 week ago
Open source requires giving whatever digital information is necessary to build a binary.
In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.
DeepSeek is sharing:
In other words: a good amount of open source… with a huge binary blob in the middle.
teawrecks@sopuli.xyz 1 week ago
Is there any good LLM that fits this definition of open source, then? I thought the “training data” for good AI was always just: the entire internet, and they were all ethically dubious that way.
What is the concern with only having weights? It’s not abritrary code exectution, so there’s no security risk or lack of computing control that are the usual goals of open source in the first place.
To me the weights are less of a “blob” and more like an approximate solution to an NP-hard problem. Training is traversing the search space, and sharing a model is just saying “hey, this point looks useful, others should check it out”. But maybe that is a blob, since I don’t know how they got there.
Crotaro@beehaw.org 1 week ago
Thanks for the explanation. I don’t understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it’s excellent to have competition on the market and it feels that the US’ whole “But they’re spying on you and being a national security risk” is a hypocritical outcry when Facebook, OpenAI and the like still exist.
What do you think about Deepseek? If I understood correctly, it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it’s a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.
p03locke@lemmy.dbzer0.com 1 week ago
Nobody releases training data. It’s too large and varied. The best I’ve seen was the LAION-2B set that Stable Diffusion used, and that’s still just a big collection of links. Even that isn’t going to fit on a GitHub repo.
Besides, improving the model means using the model as a base and implementing new training data. Specialize, specialize, specialize.
thingsiplay@beehaw.org 1 week ago
That’s why its not Open Source. They do not release the source and its impossible to build the model from source.
jarfil@beehaw.org 1 week ago
What about these? Dozens of TB here:
huggingface.co/HuggingFaceFW
There is also a LAION-5B now, and several other datasets.
p03locke@lemmy.dbzer0.com 1 week ago
Wow, it’s like you didn’t even read my post.