Edit model card

gte-multilingual-mlm-base

We introduce mGTE series, new generalized text encoder, embedding and reranking models that support 75 languages and the context length of up to 8192. The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU, code refer to Alibaba-NLP/new-impl) as well as the vocabulary of XLM-R.

This text encoder (mGTE-MLM-8192 in our paper) outperforms the same-sized previous state-of-the-art XLM-R-base in both GLUE and XTREME-R.

Model list

Models Language Model Size Max Seq. Length GLUE XTREME-R
gte-multilingual-mlm-base Multiple 306M 8192 83.47 64.44
gte-en-mlm-base English - 8192 85.61 -
gte-en-mlm-large English - 8192 87.58 -

Training Details

Training Data

  • Masked language modeling (MLM): c4-en, mc4, skypile, Wikipedia, CulturaX, etc (refer to paper appendix A.1)

Training Procedure

To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy. The model first undergoes preliminary MLM pre-training on shorter lengths. And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.

The entire training process is as follows:

  • MLM-2048: lr 2e-4, mlm_probability 0.3, batch_size 8192, num_steps 250k, rope_base 10000
  • MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 2048, num_steps 30k, rope_base 160000

Evaluation

Models Language Model Size Max Seq. Length GLUE XTREME-R
gte-multilingual-mlm-base Multiple 306M 8192 83.47 64.44
gte-en-mlm-base English - 8192 85.61 -
gte-en-mlm-large English - 8192 87.58 -
MosaicBERT-base English 137M 128 85.4 -
MosaicBERT-base-2048 English 137M 2048 85 -
JinaBERT-base English 137M 512 85 -
nomic-bert-2048 English 137M 2048 84 -
MosaicBERT-large English 434M 128 86.1 -
JinaBERT-large English 434M 512 83.7 -
XLM-R-base Multiple 279M 512 80.44 62.02
RoBERTa-base English 125M 512 86.4 -
RoBERTa-large English 355M 512 88.9 -

Citation

If you find our paper or models helpful, please consider citing them as follows:

@misc{zhang2024mgtegeneralizedlongcontexttext,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
  author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
  year={2024},
  eprint={2407.19669},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.19669}, 
}
Downloads last month
679
Safetensors
Model size
306M params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Datasets used to train Alibaba-NLP/gte-multilingual-mlm-base

Collection including Alibaba-NLP/gte-multilingual-mlm-base