Link to notebook here: huggingface.co/…/fusion_t2i_CLIP_interrogator.ipy…
Image shows list of prompt items before/after running ‘remove duplicates’ from a subset of the Adam Codd huggingface repo of civitai prompts: huggingface.co/datasets/AdamCodd/…/main
Removing duplicates from civitai prompts results in a 90% reduction of items!
From 4.8 million-> 0.417 million items.
If you wish to search this set , you can use the notebook above.
Unlike the typical pharmapsychotic CLIP interrogator , I pre-encode the text corpus ahead of time.
Additionally , I’m using quantization on the text corpus to store the encodings as unsigned integers (torch.uint8) instead of float32 , using this formula:
For the clip encodings , I use scale 0.0043.
The TLDR is that you divide the float32 value with 0.043 , round it up to the closest integer , and then add the zero_point until all values within the encoding is above 0.
A typical zero_point value for a given encoding can be 0 , 30 , 120 or 250-ish.
In summary , a pretty useful setup for me when I need prompts for stuff.
//—//
I also have a 1.6 million item fanfiction set of tags loaded from archiveofourown.org
Its mostly character names.
They are listed as fanfic1 and fanfic2 respectively.
//—//
Upcoming plans is to include a visual representation of the text_encodings as colored cells within a 16x16 grid.
A color is an RGB value (3 integer values) within a given range , and 3 x 16 x 16 = 768 , which happens to be the dimension of the CLIP encoding
So a visual representation of a text_encoding could be shown like this , but with colored cells :
//—//
Thats all for this update.