Whether model is prepared for 16 bit also?

#3
by SergeyOvchinnikov - opened

Hello Ilya!
First of all, thank you very much for the model you have cretaed. Great job!

Please advise whether your model is prepared to run under 16bit (i.e. without flag load_in_8bit=true)?
I can see a strange behavior that under 8bit model inference reletavely slow (i.e. 4-5 tokens per sec even on A100 GPU), but text responses quality is good.
If I switch 8bit mode OFF, i.e. turn 16bit mode ON it works much faster (i.e. 15-20 tokens per sec on A100) but text response quality much lower and responses are shorter than with 8bit mode.
My promts are in Russian.
I wonder whether your LORA layer is only for 8bits and this layer takes extra time during inference?
Do you see a configuration to run the model with good quality responses (as with 8bit) but fast as in 16bit? Considering that I have hardware enough.
Thank you!

Of course, the base model's precision is float16. You can always merge adapters into it.
There should be almost no difference between 8 and 16 bits.

Hello again!
Please advise whether Your model is prepared to run with native 32-bit mode?
Is so, whould you be so kind to give an easy sample of code of "from_pretrained" with parameters how I can run the model in this mode?
Thanks in advance!

изображение.png
изображение.png

Thank you for the answer!
Please provide links to the sources of these screenshorts.

BNB is bitsandbytes or something esle?

https://arxiv.org/abs/2208.07339

Yes, it is bitsandbytes

Добрый день.
Планируем использовать модель в качестве чат-бота (архитектура RAG).
Подскажите как добиться, чтобы модель отвечала на вопрос, основываясь только на переданном контексте?
пробовал разные варианты промптов, но всегда модель оперирует в том числе и теми данными (знаниями), на которых она обучалась.

Sign up or log in to comment