StabilityAI’s newest diffusion model (up to the newly announced Stable Diffusion 3, that is!), StableCascade, is said to produce images faster and better than SDXL via addition of an extra diffusion stage. But how is it in practice?

First, let’s answer the question of “Can it do pretty stuff”, with a clear “yes”.

Image Image Image Image

It’s not yet well integrated into Automatic1111, but can be run with basic features in its own tab via:

github.com/blue-pen5805…

The first set of diffusion steps controls the layout; the other, finetuned detail.

One thing noticed right off the bat is its memory efficiency for large images.

Image Image Image Image

Running on only a 12GB 3060, it has no trouble making images thousands of pixels wide (up to half a dozen megapixels or more), whereas with SDXL you crash out much over the native resolution of 1024x1024.

Stable Cascade also seems to be native to 1024x1024, but seems more lenient against deviation.

Image Image Image Image

That said, different parts of the image still “lose sight of distant parts” over time, and even abstracts like the above start to get somewhat repetitive; the highest resolution images below were 4096x1280 without any upscaling.

Image Image Image Image

Testing with a less abstract prompt of “epic handshake photograph”, we see that the 2048x640 handshake actually looks quite good, but the more elongated ones have coherence problems (though are still quite attractive). I greatly enjoy the 4096x1280 one ;) )

Image Image Image

There are however limits to how large you can go. An attempt to make a rusty steel texture failed at 3072x3072 (9MP), but succeeded at 2560x2560 (6,5MP)

Image

Once Stable Cascade is better integrated into Automatic1111 and img2img and ControlNet can be used with it, it’ll be killer for inpaint & upscale.

There’s some claims that Stable Cascade is better at spelling than SDXL. Well… I’d say “yes”, but don’t expect too much. Top is SDXL, bottom is Stable Cascade, both asked for an ornate sign that says “Welcome to Bluesky!”. Both are mostly fails, but the letters are better formed in Stable Cascade

Image Image

To test its text model, I ask it for a “red cube on top of a blue sphere.” Again, top is SDXL, bottom is Stable Cascade. It’s mostly red cubes and blue spheres, but it just can’t get the order right (normally you don’t balance things on spheres). Also, note the lack of diversity. Keep that in mind.

Image Image

To test two things diffusion models have trouble with - hands and guns - I asked for a clown wielding an AK-47 (SD top, SC bottom, AK47 example also attached). Are Stable Cascade’s hands and guns better? Yeah. But the low image diversity is really problematic. It has a SPECIFIC clown in mind.

Image Image Image

Gemini has been taking flak recently for inserting diversity where it doesn’t belong. Asked to draw an 18th century pope in a cornfield, none of them add in inaccurate racial or gender diversity, but while SD’s image variety (normally a strong suit) isn’t at its strongest here, SC’s is basically nonexistent.

Image Image

In terms of desirable racial / gender diversity, my go-to test is “military personnel”. SDXL (left) tends to get the 1/6th female, 1/3rd black, etc mix of the actual US military. But unsurprisingly… Stable Cascade just gets one particular image in mind for any given prompt, and never deviates far from it.

Image Image

In short… Stable Cascade is a mixed bag. Poor on image diversity, and only marginal improvements in several key areas, but quite pretty, great on memory management, and should become a great tool for use with ControlNet, img2img, and in upscaling.

Image Image Image Image

(Respective prompts:

  1. “The End”. Helical rainforest. Lightning. Plasma. Hope. “End Of Thread” written. “The End”. 2: Glowing words “The End”. Helical rainforest. Lightning. Plasma. Hope. “The End” written. “The End”. 3: Glowing words “The End”. Helical rainforest. Lightning. Plasma. Hope. “The End” written. “The End”. 4: Glowing words “End Of Thread”. Helical rainforest. Lightning. Plasma. Hope. “End Of Thread” written. “End Of Thread”.

First generation of each taken; no cherry picking)