Comment

This is surely trivial to detect. If the number of pages on the site is greater than some insanely high number then just drop all data from that site from the training data.

It’s not like I can afford to compete with OpenAI on bandwidth, and they’re burning through money with no cares already.

source

Sort:hotnew top

bane_killgrind@slrpnk.net ⁨4⁩ ⁨months⁩ ago
Yeah sure, but when do you stop gathering regularly constructed data, when your goal is to grab as much as possible?

Markov chains are an amazingly simple way to generate data like this, and a little bit of stacked logic it’s going to be indistinguishable from real large data sets.

source
- Valmond@lemmy.world ⁨4⁩ ⁨months⁩ ago
  Imagine the staff meeting:
  
  You: we didn’t gather any data because it was poisoned
  
  Corposhill: we collected 120TB only from harry-potter-fantasy-club.il !!
  
  Boss: hmm who am I going to keep…
  
  source
  - yetAnotherUser@lemmy.ca ⁨4⁩ ⁨months⁩ ago
    The boss fires both, “replaces” them for AI, and tries to sell the corposhill’s dataset to AI companies that make fantasy novels
    
    source
Korhaka@sopuli.xyz ⁨4⁩ ⁨months⁩ ago
You can compress multiple TB nothing with the occasional meme down to a few MB.

source
- essteeyou@lemmy.world ⁨4⁩ ⁨months⁩ ago
  When I deliver it as a response to a request I have to deliver the gzipped version if nothing else. To get to a point where I’m poisoning an AI I’m assuming it’s going to require gigabytes of data transfer that I pay for.
  
  At best I’m adding to the power consumption of AI.
  
  source
  - MonkeMischief@lemmy.today ⁨4⁩ ⁨months⁩ ago
    
    I wonder, can I serve it ads and get paid?
    
    …and it’s just bouncing around and around and around in circles before its handler figures out what’s up…
    
    Heehee I like where your head’s at!
    
    source