SpaceVLMs
Collection
Features VLMs fine-tuned for enhanced spatial reasoning using a synthetic data pipeline similar to Spatial VLM.
•
9 items
•
Updated
•
3
SpaceMantis fine-tunes Mantis-8B-siglip-llama3 for enhanced spatial reasoning.
Uses LoRA fine-tune on the spacellava dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM.
This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@article{jiang2024mantis,
title={MANTIS: Interleaved Multi-Image Instruction Tuning},
author={Jiang, Dongfu and He, Xuan and Zeng, Huaye and Wei, Con and Ku, Max and Liu, Qian and Chen, Wenhu},
journal={arXiv preprint arXiv:2405.01483},
year={2024}
}