maybe will be helpful for someone, 4070 super with 12gb vram and 32ram generating 5b 1 video around 15min

#7
by Dianor - opened

maybe will be helpful for someone, 4070 super with 32ram generating 5b 1 video around 15min

Dianor changed discussion title from maybe will be helpful for someone, 4070 super with 32ram generating 5b 1 video around 15min to maybe will be helpful for someone, 4070 super with 12gb vram and 32ram generating 5b 1 video around 15min

how many inference steps do you use becuase on my PC with 32GB DDR4 ram clocked at 3200MHz and a NVIDIA RTX 4070 it takes around 10 minuts with 50 steps.
pleese check your BIOS / UEFI if your RAM is clocked correctly and that use use "pipe.enable_model_cpu_offload()" instead of "pipe.enable_sequential_cpu_offload()".
Depending on yor vides style you can lower the step cout to 25

FYI I'm running this in comfy on a L40 gpu (regular one not the S) and it takes around 565 seconds [9min] for the demo dog video to complete. ( bog standard settings the used for prompt gen etc)

I find it interesting that nvidia is doing what they do best here -> doing a small MhZ tweak and a small bump in the power, slap a S sticker and rebrand same cards. Oh servers? You need this one instead with this sticker. 9000$ difference in price between the cards. Sweet. Thanks Nvidia. <_<

Your 4070 and the L40 I'm using literally are the same card in our super different across the planet real world machine tests. Odd results are off within the tolerances eh? lol

my server : 8x xeon cpu , 80gb ram, 1x L40 (reg) gpu.

Do you use the GPU like 24/7 or just sometimes because datacenter / server GPUs are made to run 24/7. IDK how much slower it runs due yo int8 and CPU offload but for ne it takes around 732 seconds (12 minutes) with 50 steps and usually i run it at 25 steps because of higher speed.

I have it on demand. Spin up VM off a IBM datacenter blade or whatever their side is using etc.

They sure are but at what cost to my/your pocket book haha. I wish I could afford to straight up buy a L40s and be done with it. Honestly all you need with a good swap config for most ai's.

edit: thread topic cleanup a bit.

//

Ref speeds so far for thread topic on my side, in case people wanted to compare. all machines are spin up vm's with equal if not more system ram than vram and at least 8x vcpu if not more.
[default prompt file on their homepage for testing, so base everything, 50 steps etc]

A6000 (48gb vram) - using no switches - 8min fastest time [ machine uses 26/48GB vram loading everything into itself so far]
L40 normal (48gb vram) - using no switches - 6min7sec [machine used 26/48GB vram loading everything.]
H100 smb (80gb vram) - using no switches - 2min28sec [ It stalls and cannot go faster. I'm assuming this is due to the model loading taking "x" fixed time regardless and then starting the workload. I don't think I'll/you'll never be able to reduce the time less than this as it is natural that things take a set time to load. Still insane time to make a single video! ]

ADDING:
T4 AWS server > 32gb sys ram, 4 core cpu, T4 GPU (16gb vram) - using 3 switches - 14min22sec fastest time so far [ machine uses swapping to run, 30/32gb sys ram while running the swapping fyi]
homebuild pc > 64gb sys ram, 16 core i7, 1070gtx (8gb vram) - using 3 switches - 28min fastest time so far [ machine uses swapping to run, WARNING over 45Gb system swap while running on 8gb vram! Might just be my config but it does in fact work on an old dinosaur home computer.]

Sign up or log in to comment