Flash Attention

#3
by Neman - opened

Do you plan to implement Flash Attention 2? Or maybe I am doing something wrong here:
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
attn_implementation="flash_attention_2").eval().cuda()

Getting this error:
Exception has occurred: ValueError
InternVLChatModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//mnt/disk2/LLM_MODELS/models/MULTIMODAL/Mini-InternVL-Chat-4B-V1-5/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
File "/home/MULTIMODAL_TESTS/Mini-InternVL-Chat-4B-V1-5_test1.py", line 89, in
model = AutoModel.from_pretrained(
ValueError: InternVLChatModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//mnt/disk2/LLM_MODELS/models/MULTIMODAL/Mini-InternVL-Chat-4B-V1-5/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

OpenGVLab org

Hi, thank you for your attention.

The model is configured to enable Flash Attention by default, so no manual setup is required.

czczup changed discussion status to closed

Sign up or log in to comment