Conversation
461db42 to
99cecf1
Compare
daanelson
left a comment
There was a problem hiding this comment.
this is great! you can push to an internal H100 model (just don't leave it running 😄) on Replicate to test perf in prod, good to have solid metrics on that before we merge
| @@ -166,12 +167,65 @@ def base_setup( | |||
| shared_models=shared_models, | |||
| ) | |||
|
|
|||
There was a problem hiding this comment.
nit - since these flags are just simple little flags we set setup for dev/schnell predictor, I don't mind adding a separate compile_ae flag
| # the order is important: | ||
| # torch.compile has to recompile if it makes invalid assumptions | ||
| # about the input sizes. Having higher input sizes first makes | ||
| # for fewer recompiles. |
There was a problem hiding this comment.
any way we can compile once with craftier use of dynamo.mark_dynamic - add a max=192 on dims 2 & 3? I assume you've tried this, curious how it breaks
There was a problem hiding this comment.
I tried max=192, but it didn't have any effect. Setting torch.compile(dynamic=True) makes for one fewer recompile, but I should check the runtime performance of that.
|
Did some H100 benchmarks. flux-schnell 1 image, VAE not compiled
flux-schnell 4 images, VAE not compiled
flux-schnell 4 images, VAE compiled
The VAE speed seems reproducible, where the uncompiled VAE spends a lot of time in nchwToNhwcKernel while the compiled version manages to avoid it. At the same time, I had a cog bug saying |
99cecf1 to
0039a42
Compare
|
Did you figure out what the |
|
@jonluca as I understand it, it was a regression in cog and should be fixed when building with 0.9.25 and later. |
It takes about 80 seconds on my machine to compile this. Makes the encoding step about 50% faster on A5000 (0.3 -> 0.2s), haven't tried H100.