1M inference error on A100 80GBx4 System

#1
by shi3z - opened

Thank you for the excellent results. I immediately tried to run it on my 8x A100 80GB system, but I encountered this error. Do you know of any solutions?

>>> pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
Fetching 20 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 20/20 [00:00<00:00, 6944.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                                               
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-35:
Traceback (most recent call last):
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
    model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231 

[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-37:
Traceback (most recent call last):
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
    model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231 

[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-36:
Traceback (most recent call last):
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
    model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231 

[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-38:
Traceback (most recent call last):
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
    model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231 

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/api.py", line 89, in pipeline
    return pipeline_class(model_path,
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 217, in __init__
    self.gens_set.add(self.engine.create_instance())
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 358, in create_instance
    return TurboMindInstance(self, cuda_stream_id)
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 390, in __init__
    t.join()
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
InternLM org

Can you share the backend_config?

Sign up or log in to comment