@zozbot's comment on 'Stable Diffusion 3 is here'

https://stability.ai/news/stable-diffusion-3

Prompt: Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says "Stable Diffusion 3" made out of colorful energy

Announcing Stable Diffusion 3 in early preview, our most capable text-to-image model with greatly improved performance in multi-subject prompts, image quality, and spelling abilities.

While the model is not yet broadly available, today, we are opening the waitlist for an early preview. This preview phase, as with previous models, is crucial for gathering insights to improve its performance and safety ahead of an open release. You can sign up to join the waitlist here.

The Stable Diffusion 3 suite of models currently range from 800M to 8B parameters. This approach aims to align with our core values and democratize access, providing users with a variety of options for scalability and quality to best meet their creative needs. Stable Diffusion 3 combines a diffusion transformer architecture and flow matching. We will publish a detailed technical report soon.

We believe in safe, responsible AI practices. This means we have taken and continue to take reasonable steps to prevent the misuse of Stable Diffusion 3 by bad actors. Safety starts when we begin training our model and continues throughout the testing, evaluation, and deployment. In preparation for this early preview, we've introduced numerous safeguards. By continually collaborating with researchers, experts, and our community, we expect to innovate further with integrity as we approach the model's public release.

Our commitment to ensuring generative AI is open, safe, and universally accessible remains steadfast. With Stable Diffusion 3, we strive to offer adaptable solutions that enable individuals, developers, and enterprises to unleash their creativity, aligning with our mission to activate humanity's potential.

If you'd like to explore using one of our other image models for commercial use prior to the Stable Diffusion 3 release, please visit our Stability AI Membership page to self host or our Developer Platform to access our API.

Discussions

https://news.ycombinator.com/item?id=39466630

https://old.reddit.com/r/StableDiffusion/comments/1ax6h0o/stable_diffusion_3_stability_ai/

!codecels

Jump in the discussion.

No email address required.

View entire discussion

2DBussy shad/ban :!marseykneel:

5mo ago #5992773

8B parameters means 16gb of VRAM even at fp16

in short

:#marseytrollgun:

9 Context

_sus ogre/bogre :parrotslow:

2DBussy 5mo ago #5992825 Edited 5mo ago

two RTX 4060 Ti (16+16=32GB, total ca $900), plus a mainboard that supports two GPUs (another $300?), you'll probably get 70% the speed of one RTX 4090 (cost ca $1900), 36% less power consumption and 33% more VRAM.

6 Context

_sus 5mo ago #5993080

>two GPUs

>16+16=32GB

>He doesn't know

8 Context

2DBussy 5mo ago #5995009

doesn't know what?

probably useful to install gds

3 Context

zozbot zoz/zle Sir ZozaLot _sus 5mo ago #5995010

zoz

1 Context

zozbot zoz/zle Sir ZozaLot zozbot 5mo ago #5995011

zle

zozbot zoz/zle Sir ZozaLot zozbot 5mo ago #5995012

zozzle

TheGoodTheBadTheBussy Bagel/Flavor Now with an *improved* complex taste! :marseyfeedme:

zozbot 5mo ago #5995033

:#marseyschizotwitch:

zozbot 5mo ago #5995055

:marseyshook:

_sus 5mo ago #5996992 Edited 5mo ago

VRAM doesn't pool between multiple GPUs, and DirectStorage is not worth it since it's incredibly low bandwidth compared to GDDR6X or HBM (premium memory on datacenter chips) and ML is all about bandwidth, so much so that most consumer opts like xformers are about reducing memory IO in exchange for more computations

EDIT: it's a model based on transformers so actually, maybe you could use multiple GPUs without it making no sense

2 Context

2DBussy 5mo ago #5997514 Edited 5mo ago

first half of the layers on GPU1, second half on GPU2. the central layer is usually the lowest dimensional one. you only need to transfer that central layer's values from GPU1 to GPU2, the bandwidth between them isn't the bottleneck.

ML is all about bandwidth

whether or not bandwidth is the bottleneck depends on how much computation must be performed per byte of training data (or per output data, or per data transferred between GPUs).

if you're training a typical neural network then bandwidth (between storage and GPU) is the bottleneck, but typical neural networks are only a small part of machine learning -- in the sense that chicken, cows and pigs are only a small part of the animal kingdom, even if they constitute the majority of industrial animal use.

GDS helps improve the speed a little and decreases CPU load when transferring data to/from VRAM.

GDDR6X or HBM (premium memory on datacenter chips)

I was talking about how a dual 4060 Ti setup compares to a single 4090, Why would I try to compete with $200k worth of professional hardware?

:space:

____

EDIT: it's a model based on transformers

not just transformer based ones, most models can easily be split in two. distributing the computations among 8 GPUs is hard, at that point the high bandwidth between professional GPUs helps a lot.

_sus 5mo ago #5997758

most models can easily be split in two

Yeah but that's not the most effective way, sharding models is especially easy on transformers, and that's usually better than splitting the model if you can do so, but not all ops are shardable and some are but with great pain.

Splitting also makes the code for training weirder since if you want to fully use it you'll have to compute the next batch without having received or even computed the gradient updates.

2DBussy 5mo ago #5998212

:marseyitsallsotiresome:

are you a bullshitter professionally, or just trying to annoy me?

Sharding vs splitting is not widely agreed-upon terminology, I've only heard sharding in relation to data parallelism so that's what I assume you're talking about (assuming you have any clue what you're talking about at all, which I'm starting to doubt) -- splitting the (minibatches of) training data across multiple GPUs and computing gradients simultaneously. This only makes sense if the model is small enough to fit on a single GPU.

I'm talking about model parallelism. There are no consumer grade GPUs with 32GB VRAM, sometimes model parallelism is the only way.

that's not the most effective way,

Again, most of the commonly used neural networks can be split in two parts efficiently, ie with little cross-GPU communication. For those cases, even if the model is small enough to fit on each GPU separately, model-parallel and data-parallel implementation end up having similar speedup.

from here

Top Poster of the Day:

Sasanka_of_Gauda

Current Registered Users: 26,831

Guidelines:

What to Submit

In Submissions

In Comments

Miscellaneous: