File size: 2,774 Bytes
1e46939
 
 
 
 
 
 
 
 
 
a557264
1e46939
 
 
 
 
a557264
1e46939
 
 
 
 
 
 
 
 
 
 
 
a557264
1e46939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
097434c
1e46939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: mit
language:
- ru
library_name: transformers
tags:
- text-generation-inference
---
# Model Card for maximxls/text-normalization-ru-terrible

Normalization for Russian text. Couldn't find any existing solutions (besides algorithms, don't like those) so made this.

## Model Details

### Model Description

Tiny T5 trained from scratch for normalizing Russian texts:
- translating numbers into words
- expanding abbreviations into phonetic letter combinations
- transliterating english into russian letters
- whatever else was in the dataset (see below)

### Model Sources

- **Training code repository:** https://github.com/maximxlss/text_normalization
- **Main dataset:** https://www.kaggle.com/c/text-normalization-challenge-russian-language

## Uses

Useful in TTS, for example with Silero to make it read numbers and English words (even if not perfectly, it's at least not ignoring)

### Quick Start

```Python
from transformers import (
    T5ForConditionalGeneration,
    PreTrainedTokenizerFast,
)


model_path = "maximxls/text-normalization-ru-terrible"

tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)


example_text = "Я ходил в McDonald's 10 июля 2022 года."

inp_ids = tokenizer(
    example_text,
    return_tensors="pt",
).input_ids
out_ids = model.generate(inp_ids, max_new_tokens=128)[0]
out = tokenizer.decode(out_ids, skip_special_tokens=True)

print(out)
```

`я ходил в макдоналд'эс десятого июля две тысячи двадцать второго года.`

## Bias, Risks, and Limitations

**Very much unreliable:**
- For some reason, sometimes skips over first couple of tokens. Might be benificial to add some extra padding or whatever so it would be more stable. Wasn't able to solve it in training.
- Sometimes is pretty unstable with repeating or missing words (especially with transliteration)

## Training Details

### Training Data

Data from [this Kaggle challenge](https://www.kaggle.com/c/text-normalization-challenge-russian-language) (761435 sentences) aswell as a bit of extra data written by me.

### Training Procedure 

#### Preprocessing

See [`preprocessing.py`](https://github.com/maximxlss/text_normalization/blob/master/preprocess.py)

#### Training Hyperparameters

See [`train.py`](https://github.com/maximxlss/text_normalization/blob/master/train.py)

I have reset lr manually several times during training, see metrics.

#### Details

See [`README` on github](https://github.com/maximxlss/text_normalization) for a step-by-step overview of the training procedure.

## Technical Specifications

#### Hardware

Couple tens of hours of RTX 3090 Ti compute on my personal PC (21.65 epochs)