---
license: apache-2.0
datasets:
- OpenFace-CQUPT/FaceCaption-15M
language:
- zh
- en
metrics:
- accuracy
pipeline_tag: image-to-text
---
# Demonstration of Cross-modal Retrieval (FLIP-based model)

<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TGxEwHBbWZIbW67kG9jMH.mp4"></video>

# FLIP (Facial Language Image Pretraining)

This repository is the official implementation of [FaceCaption-15M]().

# Updates：

**[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)**  

**[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)**  

**Overview of FLIP architecture.**

![image-20240318101027127](https://img.yutangli.net/img/202403181010116.png)

 **Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.**

## Training

Coming soon......（Only for the datasets been published, the code of training is meaningful.）

```shell
python pretrain.py > log.log
```

## Pre-trained Models

We provide pretrained model weights :  
FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt)  
FLIP Large —— coming soon......

## Datasets

Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M).


## Results

### Task1: Text-Image Retrieval

**Table 1:** Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.

![](https://img.yutangli.net/img/202403181015142.png)

### Task2: Facial Attributes Prediction

**Table 2:** Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.

![image-20240318101126897](https://img.yutangli.net/img/202403181011115.png)

### Task3: Sketch Less Facial Image Retrieval

**Table 3:** Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.

![image-20240318101633671](https://img.yutangli.net/img/202403181016876.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/snd-9JBKJnRuZpm0Wp38f.png)

**Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.**

## Contacts
mailto: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com

## Citation
```tex
@misc{dai202415mmultimodalfacialimagetext,
      title={15M Multimodal Facial Image-Text Dataset}, 
      author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
      year={2024},
      eprint={2407.08515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.08515}, 
}
```