I don’t think training on all public information is super ethical regardless, but to the extent that others may support it, I understand that SO may be seen as fair game. To my knowledge though, all the big AIs I’m aware of have been trained on GitHub regardless of any individual projects license.
It’s not about proving individual code theft, it’s about recognizing the model itself is built from theft. Just because an AI image output might not resemble any preexisting piece of art doesn’t mean it isn’t based on theft. Can I ask what you used that was trained on just a projects documentation? Considering the amount of data usually needed for coherent output, I would be surprised if it did not need some additional data.
Katana314@lemmy.world 2 days ago
The example I gave was more around “context” than “model” - data related to the question, not their learning history. I would ask the AI to design a system that interacts with XYZ, and it would be thoroughly confused and have no idea what to do. Then I would ask again, linking it to the project’s documentation page, as well as granting it explicit access to fetch relevant webpages, and it would give a detailed response. That suggests to me it’s only working off of the documentation.
That said, AIs are not strictly honest, so I think you have a point that the original model training may have grabbed data like that at some point regardless. If most AI models don’t track/cite the details on each source used for generation, be it artwork on Deviantart or licensed Github repos, I think it’s fair to say any of those models should become legally liable; moreso if there’s ways of demonstrating “copying-like” actions from the original.