SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Submitted ⁨⁨8⁩ ⁨months⁩ ago⁩ by ⁨Even_Adder@lemmy.dbzer0.com⁩ to ⁨stable_diffusion@lemmy.dbzer0.com⁩

https://github.com/NVlabs/Sana/raw/main/asset/Sana.jpg

Abstract

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

Paper: arxiv.org/abs/2410.10629

Code: github.com/NVlabs/Sana

Models: huggingface.co/…/sana-673efba2a57ed99843f11f9e

Demo: nv-sana.mit.edu

Project Page: hanlab.mit.edu/projects/sana

Image

source

Comments

Sort:hotnew top

Zarxrax@lemmy.world ⁨8⁩ ⁨months⁩ ago
Looks like an exciting model. I am curious to see how it will be with finetunes and controlnets. Could potentially be better than sdxl. It does unfortunately appear to be a non commercial license though, which might limit interest.

source
m_f@midwest.social ⁨8⁩ ⁨months⁩ ago
It’ll be interesting to see where we go when it’s practical to generate video in realtime

source