Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨Even_Adder@lemmy.dbzer0.com⁩ to ⁨stable_diffusion@lemmy.dbzer0.com⁩

https://i.imgur.com/XDZD8Et.png

Abstract

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. (1) First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. (2) Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on less than 500 images. (3) Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area.

Paper: arxiv.org/abs/2404.01197

Code: github.com/SPRIGHT-T2I/SPRIGHT

Demo: huggingface.co/spaces/SPRIGHT-T2I/SPRIGHT-T2I

Project Page: spright-t2i.github.io

Model: huggingface.co/SPRIGHT-T2I/spright-t2i-sd2

Dataset: huggingface.co/datasets/SPRIGHT-T2I/spright

Image

source

Comments

Sort:hotnew top

Blaster_M@lemmy.world ⁨1⁩ ⁨year⁩ ago
In other words, quality over quantity. Caption better and the model can draw better.

source
- Even_Adder@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
  Yeah, I’ve had that hunch for a while, I remember looking through the landfill that was SD 1.5’s dataset I was impressed it even worked at release.
  
  source