Comment

Passerby6497@lemmy.world ⁨2⁩ ⁨years⁩ ago

That’s where you print the downloaded PDF to a new PDF. New hash and same content, good luck tracing it back to me fucko.

source

Sort:hotnew top

Syn_Attck@lemmy.today ⁨2⁩ ⁨years⁩ ago
Now that this is known, It’s not enough to remove metadata from the PDF itself. Each image inside a PDF, for example, can contain metadata.

There are multiple ways of removing ALL metadata from a PDF, here are most I know of.

It will be slow-ish and probably make the file larger, but if you’re sharing a PDF that only you are supposed to have access to, it’s worth it. MAT or exiftool should work.

source
- Passerby6497@lemmy.world ⁨2⁩ ⁨years⁩ ago
  Wouldn’t printing the PDF to a new PDF inherently strip the metadata put there by the publisher?
  
  source
  - sandbox@lemmy.world ⁨2⁩ ⁨years⁩ ago
    it’s possible using steganographic techniques to embed digital watermarks which would not be stripped by simply printing to pdf.
    
    source
    FinalRemix@lemmy.world ⁨2⁩ ⁨years⁩ ago
    Got it. Print to a low quality JPG, the use AI upscaling to restore the text and graphs.
    
    source
    -> View More Comments
    Syn_Attck@lemmy.today ⁨2⁩ ⁨years⁩ ago
    This is a great point. Image watermarking steganography is nearly impossible to defeat unless you can obtain multiple copies of the ‘same’ file from multiple users to look for differences. It could be a change of a single 10-15 pixels from one rgb code off.
    
    rgb(255, 251, 0)
    
    to
    
    rgb(255, 252, 0)
    
    Which would be imperceptable to the human eye. Depending on the number of users it may need to change more or less pixels.
    
    There is a ton of work in this field and its very interesting, for anyone considering majoring in computer science / information security.
    
    source
    -> View More Comments
    Thann@lemmy.ml ⁨2⁩ ⁨years⁩ ago
    When is why you steghide random data to the image to fuck up the other end =]
    
    source
    -> View More Comments
  - Syn_Attck@lemmy.today ⁨2⁩ ⁨years⁩ ago
    Good question. I believe “Print to PDF” isn’t actually “printing” it page by page as if it was a physical printer, but rather just saving the loaded PDF to a PDF file locally.
    
    I’m not an expert in this field, but you can ask on StackExchange, or ask the author of MAT and exiftools, or do it yourself by making a PDF with a jpg file with your metadata, and then extract the image and let us know here - it would be useful information that I can’t find via search engines. I’m using a smartphone so I can’t do it, but if you do, note from the linked SE page is you won’t be able to extract the original file extension, so if you use your own .jpg with your own exif data, rename to .jpg when finished (I believe exif is handled differently based on file type).
    
    There are multiple tools to add exif data to an image but the exiftool website has some good easy examples for our purpose.
    
    exiftool -artist=“Phil Harvey” -copyright=“2011 Phil Harvey” YourFile.jpg
    
    (do this as the first step before adding to the PDF)
    
    source
- Zacryon@lemmy.wtf ⁨2⁩ ⁨years⁩ ago
  Okay, got it. Print the PDF, then scan it and save as PDF.
  
  Or get some monks to get a handwritten copy, like the good old times.
  
  source
Olgratin_Magmatoe@lemmy.world ⁨2⁩ ⁨years⁩ ago
You’d be safer IRL printing it on a printer without yellow ink, then scanning it, then deleting the metadata from the scan.

source
ChaoticNeutralCzech@feddit.de ⁨2⁩ ⁨years⁩ ago
I know PDF providers who visibly print the customer’s name or number in the header of every page, along with short copyright text. I use qpdf --stream-decompress to make the PDF into human-readable PostScript, and then Python+regex to remove each header text, which stand out a bit from other PDF elements. The script throws an error if more or fewer elements than pages have been removed but that hasn’t happened yet. Processed documents sometimes have screwed-up non-ASCII characters in the Table of Contents for some reason but I don’t have the originas anymore so IDK if it’s my fault. Still, I wouldn’t share the PDFs unless in text-only or printed form because of any other steganographic shenanigans in the file. I would absolutely torrent them if I could repurchase them under a new identity and verify that the files are identical.

BTW, has anyone figured out how to embed Python code in PDF? The whitespace always gets reencoded as x-coordinates so copy&pasting it never preserves indentation. No, you can’t use the Ogham Space Mark (Unicode’s only non-blank character classified as a space) for indentation in Python, I tried.

source
IlIllIIIllIlIlIIlI@lemmy.world ⁨2⁩ ⁨years⁩ ago
I saw some that add background watermarks too into random pages and locations.

source