Your Bluesky Posts Are Probably In A Bunch of AI Datasets Now [404 Media]

⁨81⁩ ⁨likes⁩

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨theangriestbird@beehaw.org⁩ to ⁨technology@beehaw.org⁩

https://www.404media.co/bluesky-posts-machine-learning-ai-datasets-hugging-face/

source

Comments

Sort:hotnew top

sxan@midwest.social ⁨1⁩ ⁨year⁩ ago
Your Mastodon and Lemmy (and all other ActivityPub-talkin’ platforms) posts certainly are. I’m not sure it’s even technically possible to have federation without being open to AI ETLs. A centralized platform, maybe, but I expect this is the price we pay for decentralization.

source
- Kichae@lemmy.ca ⁨1⁩ ⁨year⁩ ago
  Yeah. A public internet means a public internet, for good and for ill. People have been trained to see the internet as private, and we’re now reaping those sown seeds, and people really hate the harvest.
  
  source
- MagicShel@lemmy.zip ⁨1⁩ ⁨year⁩ ago
  What??? But… I posted a disclaimer and license! How can they slap???
  
  source
- theangriestbird@beehaw.org ⁨1⁩ ⁨year⁩ ago
  great context! thanks for pointing that out
  
  source
- bloup@lemmy.sdf.org ⁨1⁩ ⁨year⁩ ago
  in a practical sense you’re completely right. However in a legal sense, I am not sure implementing ActivityPub on your website and not restricting federation doesn’t mean you’re not allowed to still impose legal conditions on access to the data that your website is hosting. I am not sure that the nature of the protocol completely absolves you of liability.
  
  to be extra clear. I am not making any kind of claims here. I’m only saying that I am not sure it’s this simple
  
  source
  - lemmy@acqrs.co.uk ⁨1⁩ ⁨year⁩ ago
    I’m sure you’re allowed to impose legal conditions on your data, but the AI folks have very clearly shown they don’t care and would prefer to just fight it out in court years and years later, if ever.
    
    source
    -> View More Comments
  - sxan@midwest.social ⁨1⁩ ⁨year⁩ ago
    Legislation is so far behind these issues, I expect AP to be replaced by whatever comes next before legal considerations have any impact. And what’s Joe Smallserver going to do? Sue Google?
    
    I agree with your theory, but while in theory, theory is the same as practice, in practice, it doesn’t.
    
    source
  - jarfil@beehaw.org ⁨1⁩ ⁨year⁩ ago
    From a copyright point of view… the rights to each piece of content are of each owner… but each owner is sharing that content with an instance, with the intent of it getting re-shared to further instances.
    
    In a strict sense, most instances are in breach of copyright law: they don’t require users to agree to an EULA specifying how the content will be used, they don’t require federated instances to agree to the same terms, they don’t make end users agree to the terms of other instances, and generally allow users to submit someone else’s content (see: memes) without the owner’s authorization, then share and re-share it across the federated network. A fully “copyright compliant” protocol, would need to have these things baked into it from the beginning… which would make joining the federated network a royal PITA.
    
    With the current approach of “like, chill bro”… anyone can set up an instance, federate with whatever target or federated-of-a-target one, and save all the data without any consequences. The fact of receiving federated data, carries an implicit consent to process that data, and definitely does nothing to prevent random processing.
    
    Scraping the web endpoint of an instance, carries the rules set by the EULA of that endpoint… which tend to be none, or in the best case, are that of the least restrictive instance offering that federated data.
    
    All of that, before scrapers simply ignoring any requirements.
    
    source
- purplemonkeymad@programming.dev ⁨1⁩ ⁨year⁩ ago
  Probably even easier than places like twitter, as your can set up a server and others will even push all the data to you.
  
  source
- BigBolillo@beehaw.org ⁨1⁩ ⁨year⁩ ago
  And not just AI datasets but the CIA AI datasets… 🤣🤣
  
  source
Megaman_EXE@beehaw.org ⁨1⁩ ⁨year⁩ ago
I would assume anything posted to the internet is accessible by AI. Nothing is sacred. I’m sure if your voice or face has been posted online, there’s a chance for it to be used by AI in the future.

source
OneRedFox@beehaw.org ⁨1⁩ ⁨year⁩ ago
If you post in public, it can be scraped; that’s true on Bluesky as well as the Fediverse and also on the centralized corporate platforms. It’s something you have to be mindful of when posting. Using privacy-conscious walled chat apps is the better option for people who want to avoid that, but even those can have leakers in the group chat.

source
flashgnash@lemm.ee ⁨1⁩ ⁨year⁩ ago
Lemmy is going to be exactly the same, super easy to scrape as it’s all standardised and open

source
GetOffMyLan@programming.dev ⁨1⁩ ⁨year⁩ ago
Well yeah they are public? Lemmy is indexed by Google. I imagine everything on here is as well.

source
cupcakezealot@lemmy.blahaj.zone ⁨1⁩ ⁨year⁩ ago
they can have my shitposts

source
ChaoticNeutralCzech@feddit.org ⁨1⁩ ⁨year⁩ ago
As opposed to Twitter, right? Right??

source
rickyrigatoni@lemm.ee ⁨1⁩ ⁨year⁩ ago
So? I’m an idiot who likes to make up absurd lies to amuse myself on slow work days so I’m just poisoning their models.

source