AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

Submitted ⁨⁨2⁩ ⁨years⁩ ago⁩ by ⁨Even_Adder@lemmy.dbzer0.com⁩ to ⁨stable_diffusion@lemmy.dbzer0.com⁩

https://i.imgur.com/7vQDhip.jpeg

Abstract

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

Paper: arxiv.org/abs/2406.01388

Code: github.com/donahowe/AutoStudio (coming soon)

Project Page: howe183.github.io/AutoStudio.io/

Image

source

Comments

Sort:hotnew top

CaptainKickass@lemmy.world ⁨2⁩ ⁨years⁩ ago
More trash for the can

OP created nothing

source
- felsiq@lemmy.zip ⁨2⁩ ⁨years⁩ ago
  Whether or not you agree with AI image generation, the authors of this study have pulled off something impressive. This particular study isn’t going to be the single most important thing to humanity this year, sure, but they made a pretty clever stride in pushing a developing field forward and you don’t need to be excited about the field itself to appreciate that.
  I’m assuming your dislike for AI image generation is based on the plagiarism issue, which is absolutely valid, but model architecture is separate from training data and the concepts here are perfectly usable with a more moral training set. The companies scraping all the data - OpenAI, google, and to a much lesser extent stability AI - are the ones to blame for that problem, not researchers working on model architecture.
  
  source
  - Even_Adder@lemmy.dbzer0.com ⁨2⁩ ⁨years⁩ ago
    Have you read this article by Cory Doctorow yet?
    
    source
    -> View More Comments
- Even_Adder@lemmy.dbzer0.com ⁨2⁩ ⁨years⁩ ago
  You seem a little lost. I don’t think you even know what kind of post you’re commenting on.
  
  source
Even_Adder@lemmy.dbzer0.com ⁨2⁩ ⁨years⁩ ago
Not gonna lie, I did all this just to post those two comics with context.

source
tagginator@utter.online [bot] ⁨2⁩ ⁨years⁩ ago
New Lemmy Post: AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation (https://lemmyverse.link/lemmy.dbzer0.com/post/21588567)
Tagging: #StableDiffusion
(Replying in the OP of this thread (NOT THIS BOT!) will appear as a comment in the lemmy discussion.)
I am a FOSS bot. Check my README: https://github.com/db0/lemmy-tagginator/blob/main/README.md
source