Comment on Creating a torrent that includes all of humanity's knowledge/art/entertainment?
rufus@discuss.tchncs.de 1 year ago
“All” is impossible. You’re going to miss something. And it’s a lot of work. Maybe have a look at the datasets people/researchers use to train Artificial Intelligence. I think some people put in the effort to compile large datasets with just freely licensed data.
AnarchistsForDemocracy@lemmy.world 1 year ago
so per your suggestion using for example the zlibrary book/paper repo and training sets of openai as starting point one could maybe get around the brunt of the work.
rufus@discuss.tchncs.de 1 year ago
Well, this depends a bit on what you’re trying to archieve, here. If you want to learn about machine learning, you can do other things than train an LLM. And I think there are some online courses out there. I’ve seen one on Huggingface (NLP Course) and services like Lambda have lots of examples and tools available. There are other platforms with material and courses available. I don’t however think LLMs are something you want to get your feet wet, with.
If you want to create something useful, maybe taking a base/foundation model and fine-tuning it would be the best approach. I think training a halfway-decent model from zero takes like a 5 figure amount of money. I’m not an expert, I don’t really know. But teaching a model to speak a new language is something that can be done after the fact in fine-tuning. And this saves quite some computing time and requires a smaller dataset. That would be the way to “get around the brunt of the work”.
If you just want something that speaks French, I think there are already some models out there you could just use. For example Vigogne and also the big ones like Llama2 (and perhaps Mistral) should be able to do it to some degree.
So in conclusion, I don’t really know what you’re trying to do, so it is difficult to give advice. And machine learning, natural language processing etc isn’t super simple. You need the background knowledge, some experience with the different choices available and access to some computing resources to do it on your own. Fortunately AI is a hype now. There is traditional literature available (aka. books), courses, and even tutorials and YouTube videos that explain details or the process how to fine-tune something. And most of the scientific papers are available to us, nowadays and they often contain useful details about dataset size and stuff like that. I think training something from zero is something few people do. Especially as a hobby.