wchai/AuroraCap-7B-VID-xtuner

Resources

Features

AuroraCap is a multimodal large language model for image and video captioning.

Quick Start

See Docs.

FAQ

Q: Can I only use token merging during inference?

A: No, our experiments show that token merging is also a way to accelerate training while maintaining similar performance. Additionally, besides auroracap, you can also use token merging on other llava-like models.

Q: Why do we provide both official LLaVA-format and Xtuner format weights for AuroraCap?

A: While Xtuner supports saving checkpoints in multiple formats, it currently only allows continued training with the Xtuner format. Therefore, we currently provide the model in the Xtuner format for both continued training and inference. In the future, we will provide the model in the official LLaVA format for both training and inference, enabling quicker SGLang deployment and integration with the transformers.

Citation

Collection including wchai/AuroraCap-7B-VID-xtuner

Evaluation results

VDCScore on VDC
self-reported

38.210
VDD on VDC
self-reported

48.330
cider on VDC
self-reported

9.510
bleu@1 on VDC
self-reported

30.900
bleu@4 on VDC
self-reported

4.060
meteor on VDC
self-reported

19.090
rouge-l on VDC
self-reported

21.580
cider on NSR-VTT
self-reported

33.100
bleu@1 on NSR-VTT
self-reported

58.600
bleu@4 on NSR-VTT
self-reported

21.000

View on Papers With Code

wchai
/

AuroraCap-7B-VID-xtuner

Resources

Features

Quick Start

FAQ

Citation

Model tree for wchai/AuroraCap-7B-VID-xtuner

Dataset used to train wchai/AuroraCap-7B-VID-xtuner

Collection including wchai/AuroraCap-7B-VID-xtuner

AuroraCap

Evaluation results