Inference speed

by rmihaylov - opened May 31, 2023

May 31, 2023

Great work! How fast is the inference? They say QLora is slow at inference, is it true?

Owner May 31, 2023

Ya pretty slow - took like 2 minutes to run the 40b model for 200 generated tokens. 7b variant is faster.

May 31, 2023

•

Just shared this finetuning: https://github.com/rmihaylov/falcontune . 5-7 times faster inference in 4-bit compared to QLora 4-bit

Owner May 31, 2023

I also do inference here in 4-bit.

May 31, 2023

Yes, I know, but the forward computation when inference is not using cuda/triton kernels, that is why it is slow.

Owner May 31, 2023

Is there a low-code way to move the inference to triton kernels? Fwiw, the inference is happening on the cuda device in my code.

May 31, 2023

Not implemented yet in bitsandbytes

dfurman changed discussion status to closed Jun 7, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment