Here is a simple multimodal like training script to see model working.

#60
by besiktas - opened

https://github.com/grahamannett/finetune-fuyu/blob/main/train-simple.py

If anyone would like to test their machine with fuyu, here is a small script that makes fake text + images but is a complete training loop. It is all self-contained and only needs transformers/torch/simple_parsing installed.
The idea is that since you may not know if the model will fit on your resources, better to try this before digging into FSDP/QLoRA/Accelerate.

I can add an FSDP/Accelerate/QLoRA example as well since those can be hard to get working with this model with limited resources.

Can FuyuProcessor be modified to handle both multi-resolution and multiple images?
I looked through its code and noticed it only processes one image at a time and doesn't support this feature.

It would be great if the training process could support settings for both multi-resolution and multi-image processing.

FuyuProcessor handles multi-resolution images and multiple images so long as they are each a different sample.

The current model does not allow multiple images per sample but it does seem to work with them if you change gather_continuous_embeddings

This comment has been hidden

Sign up or log in to comment