File size: 1,826 Bytes

77cff9e
 
 
 
 
 
 
 
 
 
 
5bdc74c
77cff9e
5bdc74c
77cff9e
 
 
 
 
 
389bfc9
77cff9e
 
 
dba18b4

---
language:
- ru

tags:
- toxic comments classification

licenses:
- cc-by-nc-sa
---

Bert-based classifier (finetuned from [Conversational Rubert](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational)) trained on merge of Russian Language Toxic Comments [dataset](https://www.kaggle.com/blackmoon/russian-language-toxic-comments/metadata) collected from 2ch.hk and Toxic Russian Comments [dataset](https://www.kaggle.com/alexandersemiletov/toxic-russian-comments) collected from ok.ru.

The datasets were merged, shuffled, and split into train, dev, test splits in 80-10-10 proportion.
The metrics obtained from test dataset is as follows

|              | precision | recall | f1-score | support |
|:------------:|:---------:|:------:|:--------:|:-------:|
|       0      |    0.98   |  0.99  |   0.98   | 21384   |
|       1      |    0.94   |  0.92  |   0.93   | 4886    |
|   accuracy   |       |   |   0.97  |         26270|
| macro avg    | 0.96      | 0.96   | 0.96     | 26270   |
| weighted avg | 0.97      | 0.97   | 0.97     | 26270   |


## How to use
```python
from transformers import BertTokenizer, BertForSequenceClassification

# load tokenizer and model weights
tokenizer = BertTokenizer.from_pretrained('SkolkovoInstitute/russian_toxicity_classifier')
model = BertForSequenceClassification.from_pretrained('SkolkovoInstitute/russian_toxicity_classifier')

# prepare the input
batch = tokenizer.encode('ты супер', return_tensors='pt')

# inference
model(batch)
```


## Licensing Information

[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].

[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]

[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
[cc-by-nc-sa-image]: https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png