--- license: apache-2.0 datasets: - OpenFace-CQUPT/FaceCaption-15M language: - zh - en metrics: - accuracy pipeline_tag: image-to-text --- # Demonstration of Cross-modal Retrieval (FLIP-based model) # FLIP (Facial Language Image Pretraining) This repository is the official implementation of [FaceCaption-15M](). # Updates: **[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)** **[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)** **Overview of FLIP architecture.** ![image-20240318101027127](https://img.yutangli.net/img/202403181010116.png) **Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.** ## Training Coming soon......(Only for the datasets been published, the code of training is meaningful.) ```shell python pretrain.py > log.log ``` ## Pre-trained Models We provide pretrained model weights : FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt) FLIP Large —— coming soon...... ## Datasets Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M). ## Results ### Task1: Text-Image Retrieval **Table 1:** Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation. ![](https://img.yutangli.net/img/202403181015142.png) ### Task2: Facial Attributes Prediction **Table 2:** Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset. ![image-20240318101126897](https://img.yutangli.net/img/202403181011115.png) ### Task3: Sketch Less Facial Image Retrieval **Table 3:** Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset. ![image-20240318101633671](https://img.yutangli.net/img/202403181016876.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/snd-9JBKJnRuZpm0Wp38f.png) **Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.** ## Contacts mailto: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com ## Citation ```tex @misc{dai202415mmultimodalfacialimagetext, title={15M Multimodal Facial Image-Text Dataset}, author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang}, year={2024}, eprint={2407.08515}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.08515}, } ```