How to run Meta-Llama-3-70B-Instruct-FP8 using several devices?

by Fertel - opened Jun 26

Jun 26

Please, provide an exact script how to run Meta-Llama-3-70B-Instruct-FP8 model using several devices.
It works well, when I use only one device:
'''python
from vllm import LLM
model = LLM(model=model_path, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
'''

But I got problems with cuda, running this:

'''python
from vllm import LLM
model = LLM(model=model_path, tensor_parallel_size = 2, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")
'''

mgoin

Neural Magic org Jun 26

Hi @Fertel what is the stacktrace error that you get? It should work fine with that usage. Also there is no need to specify the quantization, it will pick it up from the checkpoint.

robertgshaw2

Neural Magic org Jun 26

Set tensor_parallel_size=NUM_GPUS when launching

Fertel

Jun 26

Set tensor_parallel_size=NUM_GPUS when launching

It works, I have 4xH100, but I have used only 2 and this caused an error.

So, the working code is:

from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
model = LLM(model=model_path, tensor_parallel_size = 2, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")

from vllm import LLM
model = LLM(model=model_path, tensor_parallel_size = 4, quantization="fp8", max_model_len = 100)
result = model.generate("Hello, my name is")

robertgshaw2

Neural Magic org Jun 26

I dont quite follow

Fertel

Jun 27

•

edited Jun 27

I dont quite follow

tensor_parallel_size must be equal to the maximum number of available devices. In my case I had 4 devices so I should have either set tensor_parallel_size = 4 or limited the number of visible devices os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" and set tensor_parallel_size = 2.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment