Comment

Comment on Bill proposed to outlaw downloading Chinese AI models.

<- View Parent

p03locke@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago

This literally took one click: github.com/deepseek-ai

source

Sort:hotnew top

jarfil@beehaw.org ⁨6⁩ ⁨months⁩ ago
Where’s the training data?

source
- Crotaro@beehaw.org ⁨6⁩ ⁨months⁩ ago
  Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.
  
  source
  - jarfil@beehaw.org ⁨6⁩ ⁨months⁩ ago
    Open source requires giving whatever digital information is necessary to build a binary.
    
    In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.
    
    DeepSeek is sharing:
    
    NO training data
    
    NO training code
    
    instead, PDFs with a description of the process
    
    binary weights (a few snapshots)
    
    fine-tune code
    
    inference code
    
    evaluation code
    
    integration code
    
    In other words: a good amount of open source… with a huge binary blob in the middle.
    
    source
    teawrecks@sopuli.xyz ⁨6⁩ ⁨months⁩ ago
    Is there any good LLM that fits this definition of open source, then? I thought the “training data” for good AI was always just: the entire internet, and they were all ethically dubious that way.
    
    What is the concern with only having weights? It’s not abritrary code exectution, so there’s no security risk or lack of computing control that are the usual goals of open source in the first place.
    
    To me the weights are less of a “blob” and more like an approximate solution to an NP-hard problem. Training is traversing the search space, and sharing a model is just saying “hey, this point looks useful, others should check it out”. But maybe that is a blob, since I don’t know how they got there.
    
    source
    -> View More Comments
    Crotaro@beehaw.org ⁨6⁩ ⁨months⁩ ago
    Thanks for the explanation. I don’t understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it’s excellent to have competition on the market and it feels that the US’ whole “But they’re spying on you and being a national security risk” is a hypocritical outcry when Facebook, OpenAI and the like still exist.
    
    What do you think about Deepseek? If I understood correctly, it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it’s a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.
    
    source
    -> View More Comments
- p03locke@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago
  Nobody releases training data. It’s too large and varied. The best I’ve seen was the LAION-2B set that Stable Diffusion used, and that’s still just a big collection of links. Even that isn’t going to fit on a GitHub repo.
  
  Besides, improving the model means using the model as a base and implementing new training data. Specialize, specialize, specialize.
  
  source
  - thingsiplay@beehaw.org ⁨6⁩ ⁨months⁩ ago
    
    Nobody releases training data. It’s too large and varied.
    
    That’s why its not Open Source. They do not release the source and its impossible to build the model from source.
    
    source
  - jarfil@beehaw.org ⁨6⁩ ⁨months⁩ ago
    What about these? Dozens of TB here:
    
    huggingface.co/HuggingFaceFW
    
    There is also a LAION-5B now, and several other datasets.
    
    source
    p03locke@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago
    Wow, it’s like you didn’t even read my post.
    
    source
thingsiplay@beehaw.org ⁨6⁩ ⁨months⁩ ago
Can you actually explain what in my reply is “Fear, uncertainty, and doubt”? Did you actually read it? I even linked to the specific github repository, which is basically empty. You just link to an overview, which does not point to any source code.

Please explain whats FUD and link to the source code, otherwise do not call people FUD if you don’t know what you are talking about.

source
- p03locke@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago
  You’re purposely being obtuse, and not arguing in good faith. The source code is right there, in the other repos owned by the deepseek-ai user.
  
  source
  - thingsiplay@beehaw.org ⁨6⁩ ⁨months⁩ ago
    What are you talking about? What bad faith are you saying to me? I ask you to show me the repository that contains the source code. There is none. Please give me a link to the repo you have in mind. Where is the source code and training data of DeepSeek-R1? Can we build the model from source?
    
    source