github.com/lllyasviel/Omost

Omost is a project to convert LLM’s coding capability to image generation (or more accurately, image composing) capability.

The name Omost (pronunciation: almost) has two meanings: 1) everytime after you use Omost, your image is almost there; 2) the O mean “omni” (multi-modal) and most means we want to get the most out of it.

Omost provides LLMs models that will write codes to compose image visual contents with Omost’s virtual Canvas agent. This Canvas can be rendered by specific implementations of image generators to actually generate images.

Currently, we provide 3 pretrained LLM models based on variations of Llama3 and Phi3 (see also the model notes at the end of this page).

All models are trained with mixed data of (1) ground-truth annotations of several datasets including Open-Images, (2) extracted data by automatically annotating images, (3) reinforcement from DPO (Direct Preference Optimization, “whether the codes can be compiled by python 3.10 or not” as a direct preference), and (4) a small amount of tuning data from OpenAI GPT4o’s multi-modal capability.