sorryhyun's picture
Update README.md
36a21d1 verified
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
license: cc-by-sa-4.0
datasets:
- klue
language:
- ko
---
๋ณธ ๋ชจ๋ธ์€ multi-task loss (MultipleNegativeLoss -> AnglELoss) ๋กœ, KlueNLI ๋ฐ KlueSTS ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”๋“œ๋Š” ๋‹ค์Œ [Github hyperlink](https://github.com/comchobo/SFT_sent_emb?tab=readme-ov-file)์—์„œ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
## Usage (Huggingface inference API)
```python
import requests
API_URL = "https://api-inference.huggingface.co/models/sorryhyun/sentence-embedding-klue-large"
headers = {"Authorization": "your_HF_token"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": {
"source_sentence": "์ข‹์•„์š”, ์ถ”์ฒœ, ์•Œ๋ฆผ์„ค์ •๊นŒ์ง€",
"sentences": [
"์ข‹์•„์š” ๋ˆŒ๋Ÿฌ์ฃผ์„ธ์š”!!",
"์ข‹์•„์š”, ์ถ”์ฒœ ๋“ฑ ์œ ํˆฌ๋ฒ„๋“ค์ด ์ข‹์•„ํ•ด์š”",
"์•Œ๋ฆผ์„ค์ •์„ ๋ˆŒ๋Ÿฌ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค."
]
},
})
if __name__ == '__main__':
print(output)
```
## Usage (HuggingFace Transformers)
```python
from transformers import AutoTokenizer, AutoModel, DataCollatorWithPadding
import torch
from torch.utils.data import DataLoader
device = torch.device('cuda')
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
collator = DataCollatorWithPadding(tokenizer)
model = AutoModel.from_pretrained('{MODEL_NAME}').to(device)
tokenized_data = tokenizer(sentences, padding=True, truncation=True)
tokenized_data = tokenized_data.remove_columns('text')
dataloader = DataLoader(tokenized_data, batch_size=batch_size, pin_memory=True, collate_fn=collator)
all_outputs = torch.zeros((len(tokenized_data), 1024)).to(device)
start_idx = 0
# I used mean-pool method for sentence representation
with torch.no_grad():
for inputs in tqdm(dataloader):
inputs = {k: v.to(device) for k, v in inputs.items()}
representations, _ = self.model(**inputs, return_dict=False)
attention_mask = inputs["attention_mask"]
input_mask_expanded = (attention_mask.unsqueeze(-1).expand(representations.size()).to(representations.dtype))
summed = torch.sum(representations * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
end_idx = start_idx + representations.shape[0]
all_outputs[start_idx:end_idx] = (summed / sum_mask)
start_idx = end_idx
```
## Evaluation Results
| Organization | Backbone Model | KlueSTS average | KorSTS average |
| -------- | ------- | ------- | ------- |
| team-lucid | DeBERTa-base | 54.15 | 29.72 |
| monologg | Electra-base | 66.97 | 40.98 |
| LMkor | Electra-base | 70.98 | 43.09 |
| deliciouscat | DeBERTa-base | - | 67.65 |
| BM-K | Roberta-base | 82.93 | **85.77** |
| Klue | Roberta-large | **86.71** | 71.70 |
| Klue (Hyperparameter searched) | Roberta-large | 86.21 | 75.54 |
๊ธฐ์กด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์€ mnli, snli ๋“ฑ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๊ณ„๋ฒˆ์—ญํ•˜์—ฌ ํ•™์Šต๋œ ์ ์„ ์ฐธ๊ณ ์‚ผ์•„ Klue ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋Œ€์‹  ํ•™์Šตํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.
๊ทธ ๊ฒฐ๊ณผ, Klue-Roberta-large ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๊ฒฝ์šฐ KlueSTS ๋ฐ KorSTS ํ…Œ์ŠคํŠธ์…‹์— ๋ชจ๋‘์— ๋Œ€ํ•ด ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ์ข€ ๋” elaborateํ•œ representation์„ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‚ฌ๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋งŒ ํ‰๊ฐ€ ์ˆ˜์น˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ธํŒ…, ์‹œ๋“œ ๋„˜๋ฒ„ ๋“ฑ์œผ๋กœ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ฐธ๊ณ ํ•˜์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.
## Training
NegativeRank loss -> simcse loss ๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.