---
language:
- ru
license: apache-2.0
---

# Russian text (number) normalization

Finetuned version of [FRED-T5 large 820M](https://huggingface.co/ai-forever/FRED-T5-large).

Code [repo](https://github.com/saarus72/text_normalization/tree/dev).

Trained on [ficbook](https://huggingface.co/datasets/IlyaGusev/ficbook), [librusec](https://huggingface.co/datasets/IlyaGusev/librusec) and [pikabu](https://huggingface.co/datasets/IlyaGusev/pikabu) sentences, inverse text normalized using [NeMo Text Processing](https://github.com/NVIDIA/NeMo-text-processing). Haven't trained anything yet but number normalization.

## Usage

```python
import torch
from transformers import GPT2Tokenizer, T5ForConditionalGeneration 


device='cuda'
tokenizer = GPT2Tokenizer.from_pretrained('saarus72/russian_text_normalizer', eos_token='</s>')
model = T5ForConditionalGeneration.from_pretrained('saarus72/russian_text_normalizer').to(device)
lm_text = '<SC1>Было у отца [3]<extra_id_0> сына, но не было даже [2-3]<extra_id_1> пиджаков с блёстками за [142 990 руб]<extra_id_2>.'
input_ids = torch.tensor([tokenizer.encode(lm_text)]).to(device)
outputs = model.generate(input_ids, eos_token_id=tokenizer.eos_token_id, early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))
# <extra_id_0>  три  <extra_id_1>  двух-трех  <extra_id_2>  сто сорок две тысячи девятьсот рублей </s>