Comment on The data that powers AI is disappearing fast
autotldr@lemmings.world [bot] 3 months ago
This is the best summary I could come up with:
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
companies have struck deals with publishers including The Associated Press and News Corp, the owner of The Wall Street Journal, giving them ongoing access to their content.
Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit, has been cited in more than 10,000 academic studies, Mr. Longpre said.
“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods,” he said.
“Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers.”
The original article contains 1,235 words, the summary contains 171 words. Saved 86%. I’m a bot and I’m open source!