Inference Endpoint Setup

#2
by JulianGerhard - opened

Awesome work!

May I ask which setup you chose for the model to run? I tried to run it via docker on a server with 2 x A100 80G und 142GB RAM - but everything I got was an infinite loop of "Waiting for shard 0/1 to be ready...".

A hint would be really cool!

Hugging Face H4 org

Hi @JulianGerhard ! We're currently running this on 2 x A100 (80GB) so it does indeed seem like there's an issue with Inference Endpoints. I've alerted the team internally - thanks!

Hugging Face H4 org

Hey @JulianGerhard I chatted with @philschmid and he showed me that one can deploy the 40B model on Inference Endpoints with 1 x A100 (80GB) by enabling quantization with the text-generation-inference container:

Screenshot 2023-06-07 at 11.02.16.png

If you're having trouble with text-generation-inference itself, I recommend opening an issue there

Hi @lewtun ,

first of all - thanks a lot! for this detailed answer. Philipp told me that you have a limited amount of A100 and because my current use case is prior my own interest I do not want to occupy valuable resources in the meantime.

I started to experiment with the library itsself and was successful in starting my own inference endpoint. It may be remarkable for following readers, that even with a capable system like mine, the loading time of the shards with quantization lasts about 30 minutes.

Kind regards and thanks again
Julian

JulianGerhard changed discussion status to closed

Sign up or log in to comment