Edit model card

SapBERT-biomedical-clinical model for Spanish

Table of contents

Click to expand

Model description

SapBERT model in Spanish trained with a procedure similar to that described by Liu et al. (2020). The model has been trained with the Spanish data from UMLS 2023AA, using PlanTL-GOB-ES/roberta-base-biomedical-clinical-es as the base model.

Intended uses and limitations

The model is prepared to provide a numerical representation of biomedical concepts in UMLS. This allows using the embeddings generated by the model for semantic similarity tasks of biomedical concepts or entity linking tasks, among others.

How to use

The following script taken and adapted from the original SapBERT model converts a list of strings (entity names) into embeddings.

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es")  
model = AutoModel.from_pretrained("BSC-NLP4BIA/SapBERT-parents-from-roberta-base-biomedical-clinical-es").cuda()

# replace with your own list of entity names in spanish
all_names = ["cancer de pulmón", "fiebre", "cirugía torácica"] 

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs], 
                                       padding="max_length", 
                                       max_length=25, 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

For more details about training and eval, see SapBERT github repo.

Training

The training was performed using the original SapBERT training repository. As training data, the Spanish entries in UMLS were used, as well as the commercial names of the drugs (although they are in English), transformed to lowercase. To train the model, a set of 15 pairs of synonymous terms has been generated for each UMLS concept, we have considered as synonyms the lexical entries of each concept.

Evaluation

Evaluation of the results of using this model are in:

Gallego, F., López-García, G., Gasco-Sánchez, L., Krallinger, M., & Veredas, F. J. (2024, June). Clinlinker: Medical entity linking of clinical concept mentions in spanish. In International Conference on Computational Science (pp. 266-280). Cham: Springer Nature Switzerland.

Additional information

Author

NLP4BIA at the Barcelona Supercomputing Center

Licensing information

Apache License, Version 2.0

Citation information

@inproceedings{gallego2024clinlinker,
  title={Clinlinker: Medical entity linking of clinical concept mentions in spanish},
  author={Gallego, Fernando and L{\'o}pez-Garc{\'\i}a, Guillermo and Gasco-S{\'a}nchez, Luis and Krallinger, Martin and Veredas, Francisco J},
  booktitle={International Conference on Computational Science},
  pages={266--280},
  year={2024},
  organization={Springer}
}

Disclaimer

Click to expand

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

Downloads last month
73
Inference Examples
Inference API (serverless) is not available, repository is disabled.