--- language: - ru license: apache-2.0 --- # Russian text (number) normalization Finetuned version of [FRED-T5 large 820M](https://huggingface.co/ai-forever/FRED-T5-large). Code [repo](https://github.com/saarus72/text_normalization/tree/dev). Trained on [ficbook](https://huggingface.co/datasets/IlyaGusev/ficbook), [librusec](https://huggingface.co/datasets/IlyaGusev/librusec) and [pikabu](https://huggingface.co/datasets/IlyaGusev/pikabu) sentences, inverse text normalized using [NeMo Text Processing](https://github.com/NVIDIA/NeMo-text-processing). Haven't trained anything yet but number normalization. ## Usage ```python import torch from transformers import GPT2Tokenizer, T5ForConditionalGeneration device='cuda' tokenizer = GPT2Tokenizer.from_pretrained('saarus72/russian_text_normalizer', eos_token='') model = T5ForConditionalGeneration.from_pretrained('saarus72/russian_text_normalizer').to(device) lm_text = 'Было у отца [3] сына, но не было даже [2-3] пиджаков с блёстками за [142 990 руб].' input_ids = torch.tensor([tokenizer.encode(lm_text)]).to(device) outputs = model.generate(input_ids, eos_token_id=tokenizer.eos_token_id, early_stopping=True) print(tokenizer.decode(outputs[0][1:])) # три двух-трех сто сорок две тысячи девятьсот рублей