Text Generation
PEFT
Safetensors
Eval Results

Inference speed

#2
by rmihaylov - opened

Great work! How fast is the inference? They say QLora is slow at inference, is it true?

Ya pretty slow - took like 2 minutes to run the 40b model for 200 generated tokens. 7b variant is faster.

Just shared this finetuning: https://github.com/rmihaylov/falcontune . 5-7 times faster inference in 4-bit compared to QLora 4-bit

I also do inference here in 4-bit.

Yes, I know, but the forward computation when inference is not using cuda/triton kernels, that is why it is slow.

Is there a low-code way to move the inference to triton kernels? Fwiw, the inference is happening on the cuda device in my code.

Not implemented yet in bitsandbytes

dfurman changed discussion status to closed

Sign up or log in to comment