How to run Meta-Llama-3-70B-Instruct-FP8 using several devices?

#3
by Fertel - opened

Please, provide an exact script how to run Meta-Llama-3-70B-Instruct-FP8 model using several devices.
It works well, when I use only one device:
'''python
from vllm import LLM
model = LLM(model=model_path, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
'''

But I got problems with cuda, running this:

'''python
from vllm import LLM
model = LLM(model=model_path, tensor_parallel_size = 2, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
'''

Neural Magic org

Hi @Fertel what is the stacktrace error that you get? It should work fine with that usage. Also there is no need to specify the quantization, it will pick it up from the checkpoint.

Neural Magic org

Set tensor_parallel_size=NUM_GPUS when launching

Set tensor_parallel_size=NUM_GPUS when launching

It works, I have 4xH100, but I have used only 2 and this caused an error.

So, the working code is:

from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
model = LLM(model=model_path, tensor_parallel_size = 2, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")

or

from vllm import LLM
model = LLM(model=model_path, tensor_parallel_size = 4, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")

Neural Magic org

I dont quite follow

I dont quite follow

tensor_parallel_size must be equal to the maximum number of available devices. In my case I had 4 devices so I should have either set tensor_parallel_size = 4 or limited the number of visible devices os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" and set tensor_parallel_size = 2.

Sign up or log in to comment