CUDA out of memory error on 4080 super (16GB)

#114
by error418-teapot - opened

Sample code is encountering an out of memory error when run.

torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 18.00 MiB. GPU 0 has a total capacity of 15.67 GiB of which 30.19 MiB is free. 
Including non-PyTorch memory, this process has 15.13 GiB memory in use. 
Of the allocated memory 12.90 GiB is allocated by PyTorch, and 1.94 GiB is reserved by PyTorch but unallocated. 
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Currently running on the following setup:

Ubuntu 24.04.1 LTS
i9-9900k
64GB RAM
RTX 4080 SUPER (16 GB VRAM)
Kernel ver. 6.8.0-41 generic
Driver ver. 550.90.07
CUDA ver. 12.4 (I've heard that 12.1 can speed things up but haven't tried it)

I've also tried running the model through WSL Ubuntu (same machine), and did not encounter the out of memory error. However I was getting abysmal times at 70+s/it.
I'm lost as to why this is happening, as I see others posting about it working perfectly fine on similar or significantly worse machines.

@error418-teapot Flux will simply not work in a 16gb vram machine, you don’t have enough vram. It will require at least 24gb vram in default precision.

The only reason it was working with 70s/it was because it was using shared ram which is basically cpu ram, not gpu ram which slows down things by a massive amount.

I will highly recommend using something like 8bit quantization which will allow you to run it in a 16gb vram machine with basically no loss in quality at much faster speeds.

@error418-teapot Flux will simply not work in a 16gb vram machine, you don’t have enough vram. It will require at least 24gb vram in default precision.
The only reason it was working with 70s/it was because it was using shared ram which is basically cpu ram, not gpu ram which slows down things by a massive amount.

Thanks for the response, would be nice to have a vram requirement listed somewhere in the description (if there is one I haven't seen it).

For anyone seeing this discussion in the future, Flux does work without having 24GB, but as @YaTharThShaRma999 said it will use some shared RAM. I managed to get the model working by using pipe.enable_sequential_offload() and am currently getting between 4 and 10 s/it on a 1024x1024 image, so I wouldn't say that it simply won't work. My guess is that the bottleneck was likely WSL.

Would be great to see both the vram requirement and the sequential offload trick mentioned somewhere official!
Closing as my issue is resolved.

error418-teapot changed discussion status to closed

Sign up or log in to comment